Page 44 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 44

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4





          equally important and prioritize the preservation of   • We have shown how the Quality of Experience (QoE)
          more important aspects of a feed over others in the      in videoconferencing has improved in low band‑
          compression/decompression process.  Despite these        width networks by the proposed architecture in Sec‑
          advancements, a lot of work needs to be done in order    tion 7.2.2. The proposed videoconferencing pipeline
          to give an enhanced videoconferencing experience in      helps in controlling the QoE based on the compute
          unreliable network conditions [1] such as glitches, lags,  resource, bandwidth availability and importance of
          low internet bandwidth, etc.                             the speaker in the videoconference. It can further
                                                                   be used in data privacy by synthesizing the video on
          In this paper, we propose the audio driven video con‑    person or avatar. Noisy audio can be handled by the
          ferencing methodology that helps in improving the        proposed model and generates the expressive output
          video quality in odd network scenarios. In the proposed  and gives a high quality of experience.
          method, we have used a GAN‑based approach at the
          receiver’s end to generate video with enhanced quality  • Various experimental (Section 7.2) and ablation
          under unreliable conditions. One of the possible con‑    studies (Section 7.3) have shown that the proposed
          cerns of this methodology is that it shifts the burden from  multimodal adaptive normalization is  lexible in
          communication bandwidth to increased computation at      building the architecture with various networks such
          the receiver end. The use of a GAN‑based [2] approach    as 2DConvolution, partial2D convolution, attention,
          can increase the latency resulting in the lag of video   LSTM, Conv1D for extracting and modeling the mu‑
          during streaming. But with the rapid improvement of      tual information.
          hardware capabilities in mobiles and personal comput‑
          ers, this is unlikely to be a major obstacle. With the recent  • The proposed multimodal adaptive normalization‑
          development of NVIDIA Maxine project [3], such hurdles   based architecture for video synthesis using audio
          can be resolved and results into the practical system that  and a single image as an input has shown supe‑
          provides immense gains over the conventional methods.    rior performance on multiple qualitative and quan‑
                                                                   titative metrics such as Structural Similarity Index
          Given an arbitrary image and an audio sample, we pro‑    (SSIM), Peak Signal to Noise Ratio (PSNR), Cumula‑
          pose multimodal adaptive normalization in the proposed   tive Probability of Blur Detection (CPBD), Word Er‑
          architecture to generate realistic videos. We have built  ror Rate (WER), blinks/sec and Landmark Distance
          the architecture based on [4] to show how multimodal     (LMD) in tables 1, 2, 3 and 4. The generated videos
                                                                              1
          adaptive normalization helps in generating highly expres‑  are given at .
          sive videos using the audio and person’s image as in‑
          put. The proposed GAN architecture consists of generator  2.  BACKGROUND
          and discriminator. The generator has two major compo‑  2.1 Audio to video generation
          nents, namely multimodal adaptive normalization frame‑
          work and class activation attention map. A multimodal  Audio to video generation is an active area of research
          adaptive normalization framework feeds various features  due to its wide range of applications such as for the
          such as optical  low/keypoint heatmaps, single image, au‑  entertainment industry, education, healthcare and many
          dio melspectrogram, pitch and energy of the audio frames  more. Computer Generated Imagery (CGI) has become
          to the generator to produce realistic and expressive video.  an important part of the entertainment industry due to
          A class activation attention map helps the generator to  its ability to produce high quality results in a controllable
          produce global features such as eyes, nose, lips, etc and lo‑  manner.
          cal features such as movements of facial action units prop‑
          erly which will increase the video quality. The discrim‑  Facial animation is an important part of CGI as it is
          inator used in the proposed method is multiscale with a  capable of conveying a lot of information not only about
          class activation attention layer to discriminate fake and  the character but also about the scene in general. The
          real frames at the global and local level.           generation of realistic and expressive animation is highly
          Our main contributions are :                         complex due to its multiple properties such as lip syn‑
                                                               chronization with audio, movements of a facial action
            • The proposed speech driven facial video synthesis  units for expressiveness and natural eye blinks. Facial
             architecture isa GAN‑basedapproachthatconsistsof  synthesis in CGI is traditionally performed using face
             a generator and discriminator in Section 4. The gen‑  capture methods, which have seen drastic improvements
             erator incorporates the multimodal adaptive nor‑  over the past years and can produce faces that exhibit a
             malizationframework(Fig. 9), optical low/keypoint  high level of realism. However, these approaches require
             predictor and class activation map‑based attention  expensive equipment and signi icant amounts of labour.
             layer to generate the expressive videos. The discrim‑  In order to drive down the cost and time required to pro‑
             inator uses multiscale patchGAN‑based discrimina‑  duce high quality, researchers are looking into automatic
             tor along with a class activation map‑based layer to
             classify fake or real images.                     1 https://sites.google.com/view/itu2021




          28                                 © International Telecommunication Union, 2021
   39   40   41   42   43   44   45   46   47   48   49