Page 43 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks

P. 43

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4

ENHANCED SHARED EXPERIENCES IN HETEROGENEOUS NETWORK WITH GENERATIVE AI

2
1
1,2
Neeraj Kumar , Ankur Narang , Brejesh Lall , Nitish Kumar Singh 3
2
1 Hike Private Limited, India, IIT Delhi, India
NOTE: Corresponding author: Neeraj Kumar, neerajku@hike.in

Abstract – COVID‑19 has made the immersive experiences such as video conferencing, virtual reality/augmented reality,
the most important modes of exchanging information. Despite much advancement in the network bandwidth and codec
techniques, the current system still suffers from glitches, lags and poor video quality, especially under unreliable network
conditions. In this paper, we propose the method of a video streaming pipeline to provide better video quality under erratic
network conditions. We propose an environment where the participants can interact with each other through video confer‑
encing by only sending the audio in the network. We propose a Multimodal Adaptive Normalization (MAN)‑based architecture
to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The ar‑
chitecture uses multimodal adaptive normalization, keypoint heatmap predictor, optical low predictor and class activation
map‑based layers to learn movements of expressive facial components and hence generates a highly expressive talking‑head
video of the given person. We demonstrate the effectiveness of proposed streaming that dynamically controls the Quality of
Experience (QoE) as per the requirements.
Keywords – Audio to video generation, deep learning architecture, dynamic QoE control, GAN, multimodal adaptive
normalization, video streaming pipeline

1. INTRODUCTION A lot of work has been done on the development and op‑
timization of novel video codecs to enhance the quality of
The ongoing COVID‑19 pandemic has forced people video streaming. Various codecs have been developed to
to work, learn, and communicate remotely on an un‑ reduce the amount of streamed data while maintaining as
precedented scale. With more people in quarantine and much information as possible in the network.
isolation, the demand for low latency applications, such
as video streaming, online games, and teleconferencing
has soared to the point that it has prompted some
countries to look at ways to curb streaming data to avoid
overwhelming the Internet. Several large companies
have already announced that this unintended pilot on
remote teleworking might become the norm.

Immersive media is likely to further exacerbate the
issues related to bandwidth and latency (even in the new
generation 5G networks), since all next‑generation media
types, either omnidirectional (360 degree) or multiview
Fig. 1 – Top: Typical video streaming pipeline. In the typical system, the
or three‑dimensional, impose bandwidth requirements input video is encoded using video codecs and sent to the receiver which
and latency requirements that vastly surpass those of decodes it in the form of a lossy reconstruction that preserves most of
traditional media. the video features at a pixel level. Bottom : Proposed streaming pipeline
where the audio signal is sent through a general‑purpose WebRTC Dat‑
aChannel and at the receiver side, the proposed model converts the au‑
With the emergence of 5G networks, ultrafast, ultra‑ dio into the video signal.
reliable, and high bandwidth capable edge becomes
an attractive option to media services developers. For In a typical system (Fig. 1), the data is irst read from
immersive media, 5G is a crucial enabling technology, a video source and compressed. The compressed data
since its targeted key performance indicators stipulated is sent over a network to the receiving end, where a
by the architecture documents are essential to providing decoding algorithm reconstructs a representation of
good Quality of Experience (QoE) for the users. With the original feed from the streamed data. Since most
the 5G network, a videoconferencing pipeline in erratic of the codecs are lossy, the reconstruction process at
conditions can still be challenging and advancements will the receiver end does not create the original feed but
be made to lower the latency and network bandwidth suf iciently close to the original with some distortions.
and provide better user experience. The compression techniques utilize the fact that not
all the information contained within a video frame is

38 39 40 41 42 43 44 45 46 47 48