Page 44 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 44

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

equally important and prioritize the preservation of • We have shown how the Quality of Experience (QoE)
more important aspects of a feed over others in the in videoconferencing has improved in low band‑
compression/decompression process. Despite these width networks by the proposed architecture in Sec‑
advancements, a lot of work needs to be done in order tion 7.2.2. The proposed videoconferencing pipeline
to give an enhanced videoconferencing experience in helps in controlling the QoE based on the compute
unreliable network conditions [1] such as glitches, lags, resource, bandwidth availability and importance of
low internet bandwidth, etc. the speaker in the videoconference. It can further
be used in data privacy by synthesizing the video on
In this paper, we propose the audio driven video con‑ person or avatar. Noisy audio can be handled by the
ferencing methodology that helps in improving the proposed model and generates the expressive output
video quality in odd network scenarios. In the proposed and gives a high quality of experience.
method, we have used a GAN‑based approach at the
receiver’s end to generate video with enhanced quality • Various experimental (Section 7.2) and ablation
under unreliable conditions. One of the possible con‑ studies (Section 7.3) have shown that the proposed
cerns of this methodology is that it shifts the burden from multimodal adaptive normalization is lexible in
communication bandwidth to increased computation at building the architecture with various networks such
the receiver end. The use of a GAN‑based [2] approach as 2DConvolution, partial2D convolution, attention,
can increase the latency resulting in the lag of video LSTM, Conv1D for extracting and modeling the mu‑
during streaming. But with the rapid improvement of tual information.
hardware capabilities in mobiles and personal comput‑
ers, this is unlikely to be a major obstacle. With the recent • The proposed multimodal adaptive normalization‑
development of NVIDIA Maxine project [3], such hurdles based architecture for video synthesis using audio
can be resolved and results into the practical system that and a single image as an input has shown supe‑
provides immense gains over the conventional methods. rior performance on multiple qualitative and quan‑
titative metrics such as Structural Similarity Index
Given an arbitrary image and an audio sample, we pro‑ (SSIM), Peak Signal to Noise Ratio (PSNR), Cumula‑
pose multimodal adaptive normalization in the proposed tive Probability of Blur Detection (CPBD), Word Er‑
architecture to generate realistic videos. We have built ror Rate (WER), blinks/sec and Landmark Distance
the architecture based on [4] to show how multimodal (LMD) in tables 1, 2, 3 and 4. The generated videos
1
adaptive normalization helps in generating highly expres‑ are given at .
sive videos using the audio and person’s image as in‑
put. The proposed GAN architecture consists of generator 2. BACKGROUND
and discriminator. The generator has two major compo‑ 2.1 Audio to video generation
nents, namely multimodal adaptive normalization frame‑
work and class activation attention map. A multimodal Audio to video generation is an active area of research
adaptive normalization framework feeds various features due to its wide range of applications such as for the
such as optical low/keypoint heatmaps, single image, au‑ entertainment industry, education, healthcare and many
dio melspectrogram, pitch and energy of the audio frames more. Computer Generated Imagery (CGI) has become
to the generator to produce realistic and expressive video. an important part of the entertainment industry due to
A class activation attention map helps the generator to its ability to produce high quality results in a controllable
produce global features such as eyes, nose, lips, etc and lo‑ manner.
cal features such as movements of facial action units prop‑
erly which will increase the video quality. The discrim‑ Facial animation is an important part of CGI as it is
inator used in the proposed method is multiscale with a capable of conveying a lot of information not only about
class activation attention layer to discriminate fake and the character but also about the scene in general. The
real frames at the global and local level. generation of realistic and expressive animation is highly
Our main contributions are : complex due to its multiple properties such as lip syn‑
chronization with audio, movements of a facial action
• The proposed speech driven facial video synthesis units for expressiveness and natural eye blinks. Facial
architecture isa GAN‑basedapproachthatconsistsof synthesis in CGI is traditionally performed using face
a generator and discriminator in Section 4. The gen‑ capture methods, which have seen drastic improvements
erator incorporates the multimodal adaptive nor‑ over the past years and can produce faces that exhibit a
malizationframework(Fig. 9), optical low/keypoint high level of realism. However, these approaches require
predictor and class activation map‑based attention expensive equipment and signi icant amounts of labour.
layer to generate the expressive videos. The discrim‑ In order to drive down the cost and time required to pro‑
inator uses multiscale patchGAN‑based discrimina‑ duce high quality, researchers are looking into automatic
tor along with a class activation map‑based layer to
classify fake or real images. 1 https://sites.google.com/view/itu2021

28 © International Telecommunication Union, 2021

39 40 41 42 43 44 45 46 47 48 49