Page 50 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 50

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4



























                                       Fig. 7 – Multimodal adaptive normalization residual architecture


                                                               various encoder architectures [42] are used to convert
                                                               the various features of multiple domains into latent vec‑
                                                               tors, and then the concatenated vectors are fed to the de‑
                                                               coder to model the mutual dependence and generate the
                                                               required output. The proposed multimodal adaptive nor‑
                                                               malization helps in reducing the number of model param‑
             Fig. 8 – Class activation map layer architecture in generator  eters required to incorporate multimodal mutual depen‑
                                                               dence into the architecture.
                                                               In the multimodal adaptive normalization where we have
                                                               used the pitch, energy and Audio Melspectrogram Fea‑
                                                               tures (AMF) (Figure 11) from audio domain & static im‑
                                                               ageandOpticalFlow(OF)/facialKeypointsHeatmap(KH)
                                                               features from video domain (Figure 10) in the normaliza‑
                                                               tion to compute the different af ine parameters in multi‑
                                                               modal adaptive normalization setup. Multimodal adap‑
                                                               tive normalization gives the  lexibility of using various ar‑
                                                               chitectures namely 2D convolution, partial convolution
                                                               and attention model for video related features and 1D
                                                               convolution and the LSTM layer for audio features, as
          Fig. 9 – Higher level architecture of multimodal adaptive normalization  shown in Table 6.
                                                               (Equation (5)) shows the combined equation of the mul‑
                                                               timodal adaptive normalized output where     is the
                                                                                                           
            (Fig.  8).                                         instance normalized with mean and variance calculated
                                                               across batch and channel.Various   ’s and   ’s are modeled
          4.2  Multimodal adaptive normalization               and linearly combined under an equation. The parameter
                                                                 ’s is used to combine these parameters (Equation (6)).
          Fig.  9 shows the higher‑level architectural design of mul‑
                                                               The value of   ’s is constrained to the range of [0, 1] by us‑
          timodal  adaptive  normalization.  The    ine  parameters
                                                               ing the softmax function (Equation (6)).
          i.e,  scale,      and a shift,      are typically used to learn the
          higher‑order  statistics  of  image  features  corresponding
                                                                                                       
          to style, texture, etc.  to generate the required output as      =    (                    +              )+   (       /           +       /     )
                                                                               
                                                                                            2
                                                                      1
          depicted  in  various  previous  work  [13,  56,  18,  16,  19,
                                                                   +   (        +      )+   (        +      )
          15].  We are the  irst ones to propose how af ine param‑   3                        4          ℎ               ℎ
                                                                                                 
          eters help to learn the higher‑order statistics of multiple               +   (                      +                ) (5)
                                                                                       5
          domains.  The  respective  af ine  parameters  i.e.      and    
          are dynamically controlled by learnable parameters,    ’ s            +   +   +   +   = 1          (6)
                                                                                      3
                                                                                              5
                                                                                          4
                                                                                  2
                                                                              1
          whose sum will be 1 constrained by the softmax function
          (Equation (6)).  The idea behind using multimodal adap‑   4.3 Feature extractor modules
          tive  normalization  is  that  various  features  in  the  multi‑
          modal domain are correlated.  Multimodal adaptive nor‑   This section consists of various feature extractor modules
          malization opens the non‑trivial path to capture the mu‑   which extract the various features such as pitch, energy
          tual  dependence  between  various  domains.  Generally,
          34                                 © International Telecommunication Union, 2021
   45   46   47   48   49   50   51   52   53   54   55