Page 48 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 48

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4






          Some  of  these  methods  target  rigged  3D  characters  or   MoCoGAN [44] uses RNN‑based generator with separate
          meshes with prede ined mouth blend shapes that corre‑   latent spaces for motion and content. A sliding window
          spond  to  speech  sounds  [33,  34,  35,  36,  37,  38]  which   approach is used so that the discriminator can handle
          have primarily focused on mouth motions only and show   variable‑length sequences. This model is trained to gen‑
          a    inite  number  of  emotions,  blinks,  facial  action  units   erate disentangled content and motion vectors such that
          movements.                                           they can generate audios with different emotions and
                                                               content. Our approach uses multimodal adaptive normal‑
          3.1.2   Deep learning techniques for video gener‑    ization to generate expressive videos.
                                                               [45] extracts the expression and pose from an audio
                 ation                                         signal and a 3D face is reconstructed on the target
                                                               image. The model renders the 3D facial animation into
          CNN‑based architectures for audio to video genera‑   video frames using the texture and lighting information
          tion: A lot of work has been done on CNN to generate  obtained from the input video.  Then they  ine‑tune
          realistic videos given an audio and static image as input.  these synthesized frames into realistic frames using a
          [39](Speech2Vid) has used encoder‑decoder architec‑  novel memory‑augmented GAN module. The proposed
          ture to generate realistic videos. They have used L1 loss  approach uses multimodal adaptive normalization with
          between the synthesized image and the target image. Our  predicted optical  low/keypoint heatmap as an input to
          approach has used multimodal adaptive normalization in  learn the movements and facial expressions on the target
          GAN‑based architecture to generate realistic videos.  image with audio as an input. CascadedGAN [46] have
                                                               used the L‑GAN and T‑GAN for motion (landmark) and
          Synthesizing Obama: Learning lip sync from audio [38] is  texture generation. They have used a noise vector for
          able to generate quality videos of Obama speaking with  blink generation. Model Agnostic Meta Learning (MAML)
          accurate lip‑sync using RNN‑based architecture. They  [47] is used to generate the videos on an unseen person
          can generate only a single person video whereas the  image.  The proposed method has used multimodal
          proposed model can generate videos on multiple images  adaptive normalization to generate realistic videos.
          in GAN‑based approach.
                                                               [48] uses an Audio Transformation network (AT‑net) for
          GAN‑based architectures for audio to video gener‑    audio to landmark generation and a visual generation net‑
          ation: Temporal Gan [40] and generating videos with  work for facial generation. [49] uses audio, identity en‑
          scene dynamics [41] have done the straightforward    coder and a three‑stream GAN discriminator for audio, vi‑
          adaptation of GANs for generating videos by replacing  sual and optical  low to generate lip movement based on
          2D convolution layers with 3D convolution layers. Such  input speech. [50] enables arbitrary‑subject talking face
          methods are able to capture temporal dependencies but  generation by learning disentangled audiovisual repre‑
          require constant length videos.  The proposed model  sentation through an associative‑and‑adversarial training
          is able to generate videos of variable length with a low  process. [51] uses a generator that contains three blocks:
          word error rate.                                     (i) Identity Encoder, (ii) Speech Encoder, and (iii) Face
                                                               Decoder. It is trained adversarially with a visual qual‑
          Realistic Speech‑Driven Facial Animation with GANs   ity discriminator and pretrained architecture for lip au‑
          (RSDGAN) [42] used a GAN‑based approach to produce   dio synchronization. [49, 50, 51] are limited to lip move‑
          quality videos.  They used identity encoder, context  ments whereas the proposed method uses multimodal
          encoder and frame decoder to generate images and     adaptive normalization to generate different facial action
          used various discriminators to take care of different  units of an expressive video. [52] uses Asymmetric Mu‑
          aspects of video generation. The proposed method has  tual Information Estimator (AMIE) to better express the
          used multimodal adaptive normalization along with    audio information into generated video in talking face
          class activation layers and an optical  low predictor and  generation. They have AIME to capture mutual infor‑
          keypoint heatmap predictor in the GAN‑based setting to  mation to learn the cross‑modal coherence whereas we
          generate expressive videos.                          have used multimodal adaptive normalization to incorpo‑
                                                               rate multimodal features into our architecture to gener‑
          The X2face [43] model uses a GAN‑based approach to   ate the expressive videos. [4] have used deep speech fea‑
          generate videos given a driving audio or driving video  tures into the generator architecture with spatially adap‑
          and a source image as input.  The model learns the   tive normalization layers in it along with lip frame dis‑
          face embeddings of source frame and driving vectors of  criminator, temporal discriminator and synchronization
          driving frames or audio bases which generate the videos.  discriminator to generate realistic videos. They have lim‑
          In X2face, the video is processed at 1fps whereas the  ited blinks and lip synchronization whereas the proposed
          model generate the video at 25fps. The quality of output  method used multimodal adaptive normalization to cap‑
          video is not good as compared to our proposed method  ture the mutual relation between audio and video to gen‑
          with audio as an input.                              erate expressive video.




          32                                 © International Telecommunication Union, 2021
   43   44   45   46   47   48   49   50   51   52   53