Page 43 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 43

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4







            ENHANCED SHARED EXPERIENCES IN HETEROGENEOUS NETWORK WITH GENERATIVE AI


                                                                      2
                                                          1
                                            1,2
                                Neeraj Kumar , Ankur Narang , Brejesh Lall , Nitish Kumar Singh 3
                                                                  2
                                          1 Hike Private Limited, India, IIT Delhi, India
                                     NOTE: Corresponding author: Neeraj Kumar, neerajku@hike.in

          Abstract – COVID‑19 has made the immersive experiences such as video conferencing, virtual reality/augmented reality,
          the most important modes of exchanging information. Despite much advancement in the network bandwidth and codec
          techniques, the current system still suffers from glitches, lags and poor video quality, especially under unreliable network
          conditions. In this paper, we propose the method of a video streaming pipeline to provide better video quality under erratic
          network conditions. We propose an environment where the participants can interact with each other through video confer‑
          encing by only sending the audio in the network. We propose a Multimodal Adaptive Normalization (MAN)‑based architecture
          to synthesize a talking person video of arbitrary length using as input: an audio signal and a single image of a person. The ar‑
          chitecture uses multimodal adaptive normalization, keypoint heatmap predictor, optical  low predictor and class activation
          map‑based layers to learn movements of expressive facial components and hence generates a highly expressive talking‑head
          video of the given person. We demonstrate the effectiveness of proposed streaming that dynamically controls the Quality of
          Experience (QoE) as per the requirements.
          Keywords – Audio to video generation, deep learning architecture, dynamic QoE control, GAN, multimodal adaptive
          normalization, video streaming pipeline

          1.  INTRODUCTION                                     A lot of work has been done on the development and op‑
                                                               timization of novel video codecs to enhance the quality of
          The ongoing COVID‑19 pandemic has forced people      video streaming. Various codecs have been developed to
          to work, learn, and communicate remotely on an un‑   reduce the amount of streamed data while maintaining as
          precedented scale. With more people in quarantine and  much information as possible in the network.
          isolation, the demand for low latency applications, such
          as video streaming, online games, and teleconferencing
          has soared to the point that it has prompted some
          countries to look at ways to curb streaming data to avoid
          overwhelming the Internet.  Several large companies
          have already announced that this unintended pilot on
          remote teleworking might become the norm.

          Immersive media is likely to further exacerbate the
          issues related to bandwidth and latency (even in the new
          generation 5G networks), since all next‑generation media
          types, either omnidirectional (360 degree) or multiview
                                                               Fig. 1 – Top: Typical video streaming pipeline. In the typical system, the
          or three‑dimensional, impose bandwidth requirements  input video is encoded using video codecs and sent to the receiver which
          and latency requirements that vastly surpass those of  decodes it in the form of a lossy reconstruction that preserves most of
          traditional media.                                   the video features at a pixel level. Bottom : Proposed streaming pipeline
                                                               where the audio signal is sent through a general‑purpose WebRTC Dat‑
                                                               aChannel and at the receiver side, the proposed model converts the au‑
          With the emergence of 5G networks, ultrafast, ultra‑  dio into the video signal.
          reliable, and high bandwidth capable edge becomes
          an attractive option to media services developers. For  In a typical system (Fig. 1), the data is  irst read from
          immersive media, 5G is a crucial enabling technology,  a video source and compressed. The compressed data
          since its targeted key performance indicators stipulated  is sent over a network to the receiving end, where a
          by the architecture documents are essential to providing  decoding algorithm reconstructs a representation of
          good Quality of Experience (QoE) for the users. With  the original feed from the streamed data. Since most
          the 5G network, a videoconferencing pipeline in erratic  of the codecs are lossy, the reconstruction process at
          conditions can still be challenging and advancements will  the receiver end does not create the original feed but
          be made to lower the latency and network bandwidth   suf iciently close to the original with some distortions.
          and provide better user experience.                  The compression techniques utilize the fact that not
                                                               all the information contained within a video frame is





                                             © International Telecommunication Union, 2021                    27
   38   39   40   41   42   43   44   45   46   47   48