Page 45 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 45

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4





          face synthesis using machine learning techniques.
          Machine learning methods could simplify the video gen‑
          eration process by automatically producing it from the au‑
          dio. Such methods could be applied in post‑production of
           ilm making to achieve better lip synchronization. They
          can be applied in the education sector to teach students
          in a more realistic manner that can help reduce the cost
          of teaching. Apart from that, such techniques can be used
          to generate parts of the face that are occluded or missing
          in a scene. This technology can improve band‑limited vi‑
          sual telecommunications by either generating the entire
          visual content based on the audio or  illing in dropped  Fig. 2 – Top: Frames of the video. Bottom: Optical  low of the video.
          frames.
                                                               as face veri ication [7], face recognition [8], and facial
          2.2 Facial video                                     attribute inference [9]. The high variability of shapes,
                                                               poses, lighting conditions, and possible occlusions makes
          Facial video generation is a complex problem. It has sev‑  it a particularly challenging task even today. Such vari‑
          eral properties which make the video realistic.      abilities can be captures using the facial landmark key‑
                                                               points. We detect the landmark keypoints around the
            • Semantic consistency ‑ The facial features such as  cheeks, nose, eyes, lips to capture the movement of face
             eyes, nose, lips, etc. should be consistent among each  while speaking or giving expressions using deep learning
             other.                                            techniques. The heatmap of keypoints helps in giving a
                                                               coarser view of these keypoint locations. Such heatmaps
            • Temporal consistency ‑ Video consists of several
             frames. Each frame should be temporal smoother    help the model to focus on the regions around the lips,
             with its previous and next frames, so that there  noise, eyes and cheeks such that it captures the expres‑
             should not be any jitters, spikes or holes in the video.  siveness of the image. Fig. 3 shows the landmark points
                                                               of the images on the upper part of the image. The lower
            • Expressiveness ‑ This property makes the video   part shows the heatmap of the keypoints which gives the
             more realistic and natural. Properties such as move‑  information about the expressiveness of the images.
             ment of facial action units, lip synchronization with
             the audio and the blinking of eyes make the video
             more realistic and visually appealing.

          While generating the video from audio, the predicted
          videos should inhibit such properties. Optical  low and a
          keypoint heatmap help in making the video semantically
          and temporally consistent as well as more expressive.

          2.2.1  Optical  low

          Optical  low is the pattern of apparent motion of image
          objects between two consecutive frames caused by the  Fig. 3 – Top: facial keypoints. Bottom: keypoint heatmap of the face.
          movement of object or camera. It is a 2D vector  ield
          where each vector is a displacement vector showing the
          movement of points from the  irst frame to the second.  2.3 Audio
          Optical  low has many applications in areas such as struc‑
          ture from motion [5], video compression [6] and video  Modeling audio is a complex problem. Several aspects of
          generation. The optical  low helps in achieving the tem‑  the synthesized speech, such as a speaker’s voice, speak‑
          porally smoother videos. Fig. 2 shows the optical  lows  ing style/prosody and noise comes into play to better in‑
          between the two consecutive frames of any videos. The  corporate the audio into modeling. The range of prosody
          optical  low gives the temporal as well as spatial informa‑  in the dialogue must encompass a large range of human
          tion based on the movement of the intensity values of the  conversation, from neutral expression to extremely emo‑
          frames.                                              tional, while always sounding perfectly natural. Here,
                                                               prosody refers to the variation of several speech related
          2.2.2  Keypoint heatmap                              phenomena such as intonation, stress, rhythm and style
                                                               of the speech. Traditionally, prosody modeling is based
          Facial landmark detection is a well‑studied topic in the  on schematizing and labeling prosodic phenomena and
           ield of computer vision with many applications such  developing rule‑based systems or statistical models from




                                             © International Telecommunication Union, 2021                    29
   40   41   42   43   44   45   46   47   48   49   50