Page 59 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 59

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020



          4.3  Capture & Extraction function                   Finally,  the  object  regions  are  extracted  from  the
                                                               original color image by masking the other regions in
          Kirari!  for  Arena  processes  the  captured  video  to   accordance with the classification result.
          isolate (extract) only the object in real time from an
          arbitrary background to create a video stream that   The  extraction  error  of  the  object  extraction
          is the source of the virtual image. For this purpose,   compared with that of the conventional method has
          the  Kirari!  Integrated  Real-time  Image  Extraction   improved by 29–48% [26].
          System (KIRIE) was developed [1]. KIRIE can finely   4.4  Measurement & Tracking function
          extract  objects  from  4K  videos  with  arbitrary
          background images in real time. The performance of   This  function  measures  the  3D  positions  of  the
          the extraction process is illustrated in Fig. 7.     objects at the event site and tracks their positions as
                                                               they move. It uses a Laser imaging Detection And
                                                               Ranging  (LiDAR)  device  to  measure  the  positions
                                                               [29][30],  track  the  objects,  and  output  the  object
                                                               labels along with the 3D position information.
                                                               4.5  Information  Integration  &  Transport
                                                                     function
                                                               This  function  integrates  information  such  as  the
                                                               object  extraction  result  from  the  Capture  &
                                                               Extraction function and the 3D position information
                                                               of  the  object  from  the  Measurement  &  Tracking
                                                               function. The integrated data is transported to the
                                                               Depth  Expression  &  Presentation  function  with
                                                               time synchronization and low latency. This realizes
                                                               a real-time synchronized display of video from four
                                                               directions.


                       Fig. 7 – Extraction process             Position information of objects is transported by a
                                                               special profile of the MMT protocol tailored for ILE.
          First,  KIRIE  classifies  the  input  images  from  the   This profile is designed to transport metadata such
          camera into binary values by background difference   as  position  information  of  objects  and  lighting
          as the initial extraction.                           control  signals  [31][32].  The  video  and  audio
          Artificial  Intelligence  (AI)  is  used  to  correct  the   encoded/decoded  by  a  low-latency  HEVC/MMT
          extraction  errors  from  the  initial  extraction  [26].   codec  system  [33]  are  also  synchronously
          More specifically, it classifies the pixels by a Neural   transported  along  with  the  metadata  by  MMT.  A
          Network (NN). The NN is a 10-layer convolutional     video size of 3840 x 2160 pixels at 59.94 fps was
          network  with  6-dimensional  input  (RGB  of        encoded  by  HEVC with  a bitrate  of  10  Mbps.  The
          foreground  and  background  pixels)  and  2-        syntax   of   the   metadata    is   defined   by
          dimensional  output  (the  posterior  probability  of   ITU-T H.430.4 [5].
          being  foreground  and  background).  The  NN  is    4.6  Depth      Expression    &    Presentation
          trained  in  advance  by  the  pixel  pairs  of  the       function
          foreground  and  background  images  at  each
          coordinate  position.  To  prevent  the  repeated    This  function  adds  the  depth  expression  to  the
          evaluation  of  the  NN  during  the  error  correction   extracted object images in accordance with the 3D
          process, a Look-Up Table (LUT) was generated with    position information, and the processed images are
          the  quantification  of  6-bits  for  each  of  the  RGB   displayed on the four-sided display device.
          values.  The  LUT  considerably  reduced  the        In  general,  when  shooting  large  events  such  as
          computation load and helped to achieve real-time     sports  competitions  and  music  concerts  for  live
          processing.                                          public screening, the cameras are placed so as not
          Then,  to  refine  the  contours  of  the  objects,  KIRIE   to  interfere  with  the  audience,  usually  from  a
          reclassifies  them  as  the  correct  foreground  or   different height (Fig. 8 left). Therefore, if the camera
          background pixels by referring to the colors and labels in   images  are  displayed  without  any  processing  to
          the vicinity [27][28].                               match the viewpoint of the audience, the objects are




                                                © International Telecommunication Union, 2020                 37
   54   55   56   57   58   59   60   61   62   63   64