Page 51 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 51

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4























                      Fig. 10 – Architecture to calculate the af ine parameters from video features in multimodal adaptive normalization




















                      Fig. 11 – Architecture to calculate the af ine parameters from audio features in multimodal adaptive normalization


          and Audio Melspectrogram Features (AMF) from audio   resents the contour of a speci ic facial part, e.g., upper
          domain & static image and Optical Flow (OF)/facial Key‑  left eyelid and nose bridge. It captures the spatial con‑
          points Heatmap (KH) features from video domain. There   iguration of all landmarks, and hence it captures pose,
          are various feature extractor modules such as keypoint  expression and shape information. We have used a pre‑
                                                                           2
          heatmap predictor which predicts the keypoint heatmap  trained model to calculate the ground truth of heatmap
          having the information of various facial parts, e.g., up‑  and have applied mean square error loss between pre‑
          per left eyelid and nose bridge. Another module is Op‑  dicted heatmaps and ground truth. Fig. 12 shows the
          tical Flow Predictor which predicts the optical  low of the  architectural diagram the of the keypoint heatmap pre‑
          frame needed to maintain the temporal consistency in any  dictor which takes the previous 5 frames and melspectro‑
          video.                                               gram of audio signal and feed it to the model to predict
          Fig. 7 shows that block level diagram of how the vari‑  the heatmaps of the frames. Hourglass architecture [57]
          ous features of the audio and video domain go into the  is used after two layers of convolution which helps in gen‑
          multimodal adaptive normalisation through the af ine pa‑  erating better facial heatmaps.
          rameters i.e, scale,    and a shift,    .Fig. 10 shows that  In the experiments, we have used the 15 channel
          the video features such as predicted optical  low or key‑  heatmaps and input and output sizes are (15, 96, 96).
          point heatmaps and single image go into a few layers of  We have done the joint training of keypoint predictor ar‑
          2D convolution/2D partial convolution/2D convolution‑  chitecture along with the generator architecture and fed
          attention layers to generate the corresponding   ’s and   ’s  the output of keypoint predictor architecture in the multi‑
          and then go to the normalization layers of the generator.  modal adaptive normalization network to learn the af ine
          Fig. 11 shows that various features such as pitch, energy  parameters and have optimized it with mean square er‑
          and melspectrogram go to the various layers of 1D convo‑  ror loss with the output of a pretrained model. The input
          lution/LSTM to generate the corresponding   ’s and   ’s.  of the keypoint predictor model is the previous 5 frames
                                                               along with 256 audio mel spectrogram features which are
          4.3.1  Keypoint heatmap predictor                    concatenated along the channel axis. This is optimization
                                                               with mean square error loss with a pretrained facial key‑
          The predictor model is based on Hourglass architecture  point detection model.
          [57] that, from the input image, estimates K heatmaps      
          ∈ [0, 1]H×W, one for each keypoint, each of which rep‑  2 https://github.com/raymon-tian/hourglass-facekeypoints-detection




                                             © International Telecommunication Union, 2021                    35
   46   47   48   49   50   51   52   53   54   55   56