Page 45 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 45

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

face synthesis using machine learning techniques.
Machine learning methods could simplify the video gen‑
eration process by automatically producing it from the au‑
dio. Such methods could be applied in post‑production of
ilm making to achieve better lip synchronization. They
can be applied in the education sector to teach students
in a more realistic manner that can help reduce the cost
of teaching. Apart from that, such techniques can be used
to generate parts of the face that are occluded or missing
in a scene. This technology can improve band‑limited vi‑
sual telecommunications by either generating the entire
visual content based on the audio or illing in dropped Fig. 2 – Top: Frames of the video. Bottom: Optical low of the video.
frames.
as face veri ication [7], face recognition [8], and facial
2.2 Facial video attribute inference [9]. The high variability of shapes,
poses, lighting conditions, and possible occlusions makes
Facial video generation is a complex problem. It has sev‑ it a particularly challenging task even today. Such vari‑
eral properties which make the video realistic. abilities can be captures using the facial landmark key‑
points. We detect the landmark keypoints around the
• Semantic consistency ‑ The facial features such as cheeks, nose, eyes, lips to capture the movement of face
eyes, nose, lips, etc. should be consistent among each while speaking or giving expressions using deep learning
other. techniques. The heatmap of keypoints helps in giving a
coarser view of these keypoint locations. Such heatmaps
• Temporal consistency ‑ Video consists of several
frames. Each frame should be temporal smoother help the model to focus on the regions around the lips,
with its previous and next frames, so that there noise, eyes and cheeks such that it captures the expres‑
should not be any jitters, spikes or holes in the video. siveness of the image. Fig. 3 shows the landmark points
of the images on the upper part of the image. The lower
• Expressiveness ‑ This property makes the video part shows the heatmap of the keypoints which gives the
more realistic and natural. Properties such as move‑ information about the expressiveness of the images.
ment of facial action units, lip synchronization with
the audio and the blinking of eyes make the video
more realistic and visually appealing.

While generating the video from audio, the predicted
videos should inhibit such properties. Optical low and a
keypoint heatmap help in making the video semantically
and temporally consistent as well as more expressive.

2.2.1 Optical low

Optical low is the pattern of apparent motion of image
objects between two consecutive frames caused by the Fig. 3 – Top: facial keypoints. Bottom: keypoint heatmap of the face.
movement of object or camera. It is a 2D vector ield
where each vector is a displacement vector showing the
movement of points from the irst frame to the second. 2.3 Audio
Optical low has many applications in areas such as struc‑
ture from motion [5], video compression [6] and video Modeling audio is a complex problem. Several aspects of
generation. The optical low helps in achieving the tem‑ the synthesized speech, such as a speaker’s voice, speak‑
porally smoother videos. Fig. 2 shows the optical lows ing style/prosody and noise comes into play to better in‑
between the two consecutive frames of any videos. The corporate the audio into modeling. The range of prosody
optical low gives the temporal as well as spatial informa‑ in the dialogue must encompass a large range of human
tion based on the movement of the intensity values of the conversation, from neutral expression to extremely emo‑
frames. tional, while always sounding perfectly natural. Here,
prosody refers to the variation of several speech related
2.2.2 Keypoint heatmap phenomena such as intonation, stress, rhythm and style
of the speech. Traditionally, prosody modeling is based
Facial landmark detection is a well‑studied topic in the on schematizing and labeling prosodic phenomena and
ield of computer vision with many applications such developing rule‑based systems or statistical models from

40 41 42 43 44 45 46 47 48 49 50