Page 51 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 51

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

Fig. 10 – Architecture to calculate the af ine parameters from video features in multimodal adaptive normalization

Fig. 11 – Architecture to calculate the af ine parameters from audio features in multimodal adaptive normalization

and Audio Melspectrogram Features (AMF) from audio resents the contour of a speci ic facial part, e.g., upper
domain & static image and Optical Flow (OF)/facial Key‑ left eyelid and nose bridge. It captures the spatial con‑
points Heatmap (KH) features from video domain. There iguration of all landmarks, and hence it captures pose,
are various feature extractor modules such as keypoint expression and shape information. We have used a pre‑
2
heatmap predictor which predicts the keypoint heatmap trained model to calculate the ground truth of heatmap
having the information of various facial parts, e.g., up‑ and have applied mean square error loss between pre‑
per left eyelid and nose bridge. Another module is Op‑ dicted heatmaps and ground truth. Fig. 12 shows the
tical Flow Predictor which predicts the optical low of the architectural diagram the of the keypoint heatmap pre‑
frame needed to maintain the temporal consistency in any dictor which takes the previous 5 frames and melspectro‑
video. gram of audio signal and feed it to the model to predict
Fig. 7 shows that block level diagram of how the vari‑ the heatmaps of the frames. Hourglass architecture [57]
ous features of the audio and video domain go into the is used after two layers of convolution which helps in gen‑
multimodal adaptive normalisation through the af ine pa‑ erating better facial heatmaps.
rameters i.e, scale, and a shift, .Fig. 10 shows that In the experiments, we have used the 15 channel
the video features such as predicted optical low or key‑ heatmaps and input and output sizes are (15, 96, 96).
point heatmaps and single image go into a few layers of We have done the joint training of keypoint predictor ar‑
2D convolution/2D partial convolution/2D convolution‑ chitecture along with the generator architecture and fed
attention layers to generate the corresponding ’s and ’s the output of keypoint predictor architecture in the multi‑
and then go to the normalization layers of the generator. modal adaptive normalization network to learn the af ine
Fig. 11 shows that various features such as pitch, energy parameters and have optimized it with mean square er‑
and melspectrogram go to the various layers of 1D convo‑ ror loss with the output of a pretrained model. The input
lution/LSTM to generate the corresponding ’s and ’s. of the keypoint predictor model is the previous 5 frames
along with 256 audio mel spectrogram features which are
4.3.1 Keypoint heatmap predictor concatenated along the channel axis. This is optimization
with mean square error loss with a pretrained facial key‑
The predictor model is based on Hourglass architecture point detection model.
[57] that, from the input image, estimates K heatmaps
∈ [0, 1]H×W, one for each keypoint, each of which rep‑ 2 https://github.com/raymon-tian/hourglass-facekeypoints-detection

46 47 48 49 50 51 52 53 54 55 56