Page 47 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 47

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4






          The Dropout [12] is typically used to reduce over itting  For style transfer tasks, a popular methodology is trying
          but in a batch‑normalized network it can be either   the denormalization to the learned af ine transformation
          removed or reduced in strength and helps in better gen‑  that  is  parameterized  based  on  a  separate  input  image
          eralization of the network. Batch normalization reduces  (the  style  image).  SPADE  [18]  makes  this  denormaliza‑
          the photometric distortions because batch normalized  tion spatially sensitive.  SPADE normalization boils down
          networks train faster and observe each training example  to  ”conditional  batch  normalization  which  varies  on  a
          fewer times, we let the trainer focus on more “real”  per‑pixel basis”.  In world‑consistent video to video syn‑
          images by distorting them less.                      thesis [19], they have used optical features and semantic
                                                               maps in the normalization to learn the af ine parameters
          Equation (4) is the batch normalized output with     to generate the realistic and temporally smoother videos.
          input(   ⋯   ) used to calculate the mean (Equation (1))
                1
                      
          and variance (Equation (2)) which is used to get the  We have proposed multimodal adaptive normalization to
          normalized output ( ̂   ⋯ ̂   ) (Equation (3)). Need of nor‑  incorporate  the  higher‑order  statistics  of  multimodal
                            1
                                  
          malization occurs as distribution invariance assumption  fea‑ tures (image and audio) through af ine parameters
          is not satis ied at local level. Without normalization, the  of nor‑ malization i.e. scale (  )  and shift (  ) .
          model has to run more steps for parameters to adapt. Use
          of scale (  ) and bias (  ) in Equation (4) gives  lexibility  3.   RELATED WORK
          to work with normalized input and also with scaled
          normalized input, if there is a need, thus increasing the  There have been many years of research on video codecs
          representation power.                                for  various  applications  such  as  AV1  [20]  and  VVC  [21]
                                                               codecs. Researchers are working on improving the codes
                                 1                             using  machine  learning  techniques  either  by  end  to
                              =    ∑                   (1)
                               
                                                               end  approaches  or  working  on    ic  parts  of  video
                                     =1
                                                               streaming pipelines.
                              1    
                         2
                           =   ∑(   −   ) 2            (2)     In one of the approaches , face detection/mesh extraction
                                          
                           
                                      
                               
                                  =1                           [22,  23,  24,  25] and on body pose tracking [26,  27,  28],
                                    −                          focusing on both 3D and 2D meshes, generally based on
                             ̂    =                    (3)
                               
                                   2
                                √   +                          neural  networks  are  used  to  encode  the  video  streams
                                     
                                                               and  sent  to  the  data  channel.  The    inal  video  is  then
                               =    ̂   +              (4)     reconstructed back by using body pose along with mesh
                                     
                               
                                                               at the receiver side to make the video streaming pipelines
          2.5.2   Variants of normalization                    in erratic network conditions.
          Variants  of  normalization  have  been  used  to  capture   There was some work on video compression and recon‑
          various  information  such  as  style,  texture,  shape,  etc.   struction based on facial landmarks in [29, 30], which are
          Instance Normalization (IN) [13] is a representative ap‑   promising in extremely low bitrates, but did not demon‑
          proach which was introduced to discard instance‑speci ic   strate real‑time conferencing capabilities.
          contrast information from an image during style transfer.
          Inspired  by  this,  adaptive  instance  normalization  [14]   3.1  Audio to realistic video generation
          provided  a  rational  interpretation  that  IN  performs  a
          form  of  style  normalization,  showing  that  by  simply   The earliest methods for generating videos relied on Hid‑
          adjusting  the  feature  statistics,  namely  the  mean  and   den Markov Models which captured the dynamics of au‑
          variance  of  a  generator  network,  one  can  control  the   dio and video sequences.  Simons and Cox [31] used the
          style of the generated image.  IN dilutes the information   Viterbi algorithm to calculate the most likely sequence of
          carried  by  the  global  statistics  of  feature  responses   the mouth shape given particular utterances. Such meth‑
          while leaving their spatial con iguration only, which can   ods are not capable of generating quality videos and lack
          be  undesirable  depending  on  the  task  at  hand  and  the   emotions.
          information  encoded  by  a  feature  map.  To  handle  this,
          Batch‑Instance  Normalization(BIN)  [15]  normalizes  the   3.1.1   Phoneme  and  visemes  generation  of
          styles adaptively to the task and selectively to individual
          feature  maps.   It  learns  to  control  how  much  of  the   videos
          style  information  is  propagated  through  each  channel
          of  features  leveraging  a  learnable  gate  parameter.  For   Phoneme  and  viseme‑based  approaches  have  been
          style transfer across the  domain,  UGATIT [16]  has  used   used  to  generate  videos.  Real‑Time  Lip  Sync  for  Live
          adaptive  instance  and  Layer  Normalization  (LN)  [17]   2D  Animation  [32]  has  used  an  LSTM‑based  approach
          which adjusts the ratio of IN and LN to control the amount   to  generate  live  lip  synchronization  on  2D  character
          the style transfers from one domain to other domains.  animation.



                                             © International Telecommunication Union, 2021                    31
   42   43   44   45   46   47   48   49   50   51   52