Page 49 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 49

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4



























                                         Fig. 5 – Proposed architecture for audio to video synthesis
















                                                  Fig. 6 – Generator architecture


          4.  ARCHITECTURAL DESIGN OF SPEECH                   features which go into the normalization framework.
              DRIVEN VIDEO SYNTHESIS
                                                               4.1  Generator
          Given an arbitrary image and an audio sample, the pro‑
          posed method is able to generate speech synchronized  Fig.  6 shows the generator architecture to generate re‑
          realistic video on the target face. The proposed method  alistic  images.  It  consists  of  convolution  layers,  several
          uses multimodal adaptive normalization technique to  layers  having  multimodal  adaptive  normalization‑based
          generate realistic expressive videos.  The proposed  Resnet  [53]  block  (MANResnet)  along  with  a  class  acti‑
          architecture is GAN‑based which consists of a generator  vation  map  layer.  Fig.   7  shows  the  residual  architec‑
          and a discriminator; see Fig. 5.                     ture around Multimodal Adaptive Normalization (MAN)
                                                               along  with  2d  convolution  and  Relu  [54]  activation  lay‑
                                                               ers. The audio and video features namely person’s image,
          The architecture consists of 4 important subparts i.e.
          Generator, Discriminator, Multimodal Adaptive Normal‑  predicted optical  low/predicted keypoint heatmap, mel‑
          ization and Features Extractor Modules. The role of the  spectrogram features, pitch and energy go into the mul‑
          generator is to generate realistic video frames (Fig. 6).  timodal adaptive normalization network. Figures  10 and
          The discriminator distinguishes between real and fake  11 show the multimodal adaptive normalization architec‑
          images and helps the generator to produce more realistic  ture which takes various features of audio and video do‑
          images (Fig. 14). The multimodal adaptive normaliza‑  main and calculates the af ine parameters i.e, scale,    and
                                                               a shift,    for normalization.
          tion provides necessary information/features i.e. pitch,
          energy and Audio Melspectrogram Features (AMF) from
          audio domain & static image and Optical Flow (OF)/facial  Class  Activation  Map  (CAM)‑based  layer:   This  layer
          Keypoint Heatmap (KH) features from video domain to  is employed as a layer of generator to capture the global
          the generator ( igures 7, 10, 11). The feature extractor  and local features of the face. In class activation map [55],
          modules consists of various predictor modules such op‑  the concatenation of adaptive average pooling and adap‑
          tical  low predictor, keypoint heatmap predictors, pitch,  tive max pooling of feature map create the CAM features
          energy and audio melspectrogram extractors that extract  which capture global and local facial features respectively.
          necessary features such as Optical Flow (OF)/facial Key‑  It helps the generator to focus on the image regions that
          points Heatmap (KH), pitch, energy and melspectrogram




                                             © International Telecommunication Union, 2021                    33
   44   45   46   47   48   49   50   51   52   53   54