Page 48 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 48

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

Some of these methods target rigged 3D characters or MoCoGAN [44] uses RNN‑based generator with separate
meshes with prede ined mouth blend shapes that corre‑ latent spaces for motion and content. A sliding window
spond to speech sounds [33, 34, 35, 36, 37, 38] which approach is used so that the discriminator can handle
have primarily focused on mouth motions only and show variable‑length sequences. This model is trained to gen‑
a inite number of emotions, blinks, facial action units erate disentangled content and motion vectors such that
movements. they can generate audios with different emotions and
content. Our approach uses multimodal adaptive normal‑
3.1.2 Deep learning techniques for video gener‑ ization to generate expressive videos.
[45] extracts the expression and pose from an audio
ation signal and a 3D face is reconstructed on the target
image. The model renders the 3D facial animation into
CNN‑based architectures for audio to video genera‑ video frames using the texture and lighting information
tion: A lot of work has been done on CNN to generate obtained from the input video. Then they ine‑tune
realistic videos given an audio and static image as input. these synthesized frames into realistic frames using a
[39](Speech2Vid) has used encoder‑decoder architec‑ novel memory‑augmented GAN module. The proposed
ture to generate realistic videos. They have used L1 loss approach uses multimodal adaptive normalization with
between the synthesized image and the target image. Our predicted optical low/keypoint heatmap as an input to
approach has used multimodal adaptive normalization in learn the movements and facial expressions on the target
GAN‑based architecture to generate realistic videos. image with audio as an input. CascadedGAN [46] have
used the L‑GAN and T‑GAN for motion (landmark) and
Synthesizing Obama: Learning lip sync from audio [38] is texture generation. They have used a noise vector for
able to generate quality videos of Obama speaking with blink generation. Model Agnostic Meta Learning (MAML)
accurate lip‑sync using RNN‑based architecture. They [47] is used to generate the videos on an unseen person
can generate only a single person video whereas the image. The proposed method has used multimodal
proposed model can generate videos on multiple images adaptive normalization to generate realistic videos.
in GAN‑based approach.
[48] uses an Audio Transformation network (AT‑net) for
GAN‑based architectures for audio to video gener‑ audio to landmark generation and a visual generation net‑
ation: Temporal Gan [40] and generating videos with work for facial generation. [49] uses audio, identity en‑
scene dynamics [41] have done the straightforward coder and a three‑stream GAN discriminator for audio, vi‑
adaptation of GANs for generating videos by replacing sual and optical low to generate lip movement based on
2D convolution layers with 3D convolution layers. Such input speech. [50] enables arbitrary‑subject talking face
methods are able to capture temporal dependencies but generation by learning disentangled audiovisual repre‑
require constant length videos. The proposed model sentation through an associative‑and‑adversarial training
is able to generate videos of variable length with a low process. [51] uses a generator that contains three blocks:
word error rate. (i) Identity Encoder, (ii) Speech Encoder, and (iii) Face
Decoder. It is trained adversarially with a visual qual‑
Realistic Speech‑Driven Facial Animation with GANs ity discriminator and pretrained architecture for lip au‑
(RSDGAN) [42] used a GAN‑based approach to produce dio synchronization. [49, 50, 51] are limited to lip move‑
quality videos. They used identity encoder, context ments whereas the proposed method uses multimodal
encoder and frame decoder to generate images and adaptive normalization to generate different facial action
used various discriminators to take care of different units of an expressive video. [52] uses Asymmetric Mu‑
aspects of video generation. The proposed method has tual Information Estimator (AMIE) to better express the
used multimodal adaptive normalization along with audio information into generated video in talking face
class activation layers and an optical low predictor and generation. They have AIME to capture mutual infor‑
keypoint heatmap predictor in the GAN‑based setting to mation to learn the cross‑modal coherence whereas we
generate expressive videos. have used multimodal adaptive normalization to incorpo‑
rate multimodal features into our architecture to gener‑
The X2face [43] model uses a GAN‑based approach to ate the expressive videos. [4] have used deep speech fea‑
generate videos given a driving audio or driving video tures into the generator architecture with spatially adap‑
and a source image as input. The model learns the tive normalization layers in it along with lip frame dis‑
face embeddings of source frame and driving vectors of criminator, temporal discriminator and synchronization
driving frames or audio bases which generate the videos. discriminator to generate realistic videos. They have lim‑
In X2face, the video is processed at 1fps whereas the ited blinks and lip synchronization whereas the proposed
model generate the video at 25fps. The quality of output method used multimodal adaptive normalization to cap‑
video is not good as compared to our proposed method ture the mutual relation between audio and video to gen‑
with audio as an input. erate expressive video.

32 © International Telecommunication Union, 2021

43 44 45 46 47 48 49 50 51 52 53