Page 47 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 47

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

The Dropout [12] is typically used to reduce over itting For style transfer tasks, a popular methodology is trying
but in a batch‑normalized network it can be either the denormalization to the learned af ine transformation
removed or reduced in strength and helps in better gen‑ that is parameterized based on a separate input image
eralization of the network. Batch normalization reduces (the style image). SPADE [18] makes this denormaliza‑
the photometric distortions because batch normalized tion spatially sensitive. SPADE normalization boils down
networks train faster and observe each training example to ”conditional batch normalization which varies on a
fewer times, we let the trainer focus on more “real” per‑pixel basis”. In world‑consistent video to video syn‑
images by distorting them less. thesis [19], they have used optical features and semantic
maps in the normalization to learn the af ine parameters
Equation (4) is the batch normalized output with to generate the realistic and temporally smoother videos.
input( ⋯ ) used to calculate the mean (Equation (1))
1

and variance (Equation (2)) which is used to get the We have proposed multimodal adaptive normalization to
normalized output ( ̂ ⋯ ̂ ) (Equation (3)). Need of nor‑ incorporate the higher‑order statistics of multimodal
1

malization occurs as distribution invariance assumption fea‑ tures (image and audio) through af ine parameters
is not satis ied at local level. Without normalization, the of nor‑ malization i.e. scale ( ) and shift ( ) .
model has to run more steps for parameters to adapt. Use
of scale ( ) and bias ( ) in Equation (4) gives lexibility 3. RELATED WORK
to work with normalized input and also with scaled
normalized input, if there is a need, thus increasing the There have been many years of research on video codecs
representation power. for various applications such as AV1 [20] and VVC [21]
codecs. Researchers are working on improving the codes
1 using machine learning techniques either by end to
= ∑ (1)

end approaches or working on ic parts of video
=1
streaming pipelines.
1
2
= ∑( − ) 2 (2) In one of the approaches , face detection/mesh extraction

=1 [22, 23, 24, 25] and on body pose tracking [26, 27, 28],
− focusing on both 3D and 2D meshes, generally based on
̂ = (3)

2
√ + neural networks are used to encode the video streams

and sent to the data channel. The inal video is then
= ̂ + (4) reconstructed back by using body pose along with mesh

at the receiver side to make the video streaming pipelines
2.5.2 Variants of normalization in erratic network conditions.
Variants of normalization have been used to capture There was some work on video compression and recon‑
various information such as style, texture, shape, etc. struction based on facial landmarks in [29, 30], which are
Instance Normalization (IN) [13] is a representative ap‑ promising in extremely low bitrates, but did not demon‑
proach which was introduced to discard instance‑speci ic strate real‑time conferencing capabilities.
contrast information from an image during style transfer.
Inspired by this, adaptive instance normalization [14] 3.1 Audio to realistic video generation
provided a rational interpretation that IN performs a
form of style normalization, showing that by simply The earliest methods for generating videos relied on Hid‑
adjusting the feature statistics, namely the mean and den Markov Models which captured the dynamics of au‑
variance of a generator network, one can control the dio and video sequences. Simons and Cox [31] used the
style of the generated image. IN dilutes the information Viterbi algorithm to calculate the most likely sequence of
carried by the global statistics of feature responses the mouth shape given particular utterances. Such meth‑
while leaving their spatial con iguration only, which can ods are not capable of generating quality videos and lack
be undesirable depending on the task at hand and the emotions.
information encoded by a feature map. To handle this,
Batch‑Instance Normalization(BIN) [15] normalizes the 3.1.1 Phoneme and visemes generation of
styles adaptively to the task and selectively to individual
feature maps. It learns to control how much of the videos
style information is propagated through each channel
of features leveraging a learnable gate parameter. For Phoneme and viseme‑based approaches have been
style transfer across the domain, UGATIT [16] has used used to generate videos. Real‑Time Lip Sync for Live
adaptive instance and Layer Normalization (LN) [17] 2D Animation [32] has used an LSTM‑based approach
which adjusts the ratio of IN and LN to control the amount to generate live lip synchronization on 2D character
the style transfers from one domain to other domains. animation.

42 43 44 45 46 47 48 49 50 51 52