Page 59 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media

P. 59

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020

4.3 Capture & Extraction function Finally, the object regions are extracted from the
original color image by masking the other regions in
Kirari! for Arena processes the captured video to accordance with the classification result.
isolate (extract) only the object in real time from an
arbitrary background to create a video stream that The extraction error of the object extraction
is the source of the virtual image. For this purpose, compared with that of the conventional method has
the Kirari! Integrated Real-time Image Extraction improved by 29–48% [26].
System (KIRIE) was developed [1]. KIRIE can finely 4.4 Measurement & Tracking function
extract objects from 4K videos with arbitrary
background images in real time. The performance of This function measures the 3D positions of the
the extraction process is illustrated in Fig. 7. objects at the event site and tracks their positions as
they move. It uses a Laser imaging Detection And
Ranging (LiDAR) device to measure the positions
[29][30], track the objects, and output the object
labels along with the 3D position information.
4.5 Information Integration & Transport
function
This function integrates information such as the
object extraction result from the Capture &
Extraction function and the 3D position information
of the object from the Measurement & Tracking
function. The integrated data is transported to the
Depth Expression & Presentation function with
time synchronization and low latency. This realizes
a real-time synchronized display of video from four
directions.

Fig. 7 – Extraction process Position information of objects is transported by a
special profile of the MMT protocol tailored for ILE.
First, KIRIE classifies the input images from the This profile is designed to transport metadata such
camera into binary values by background difference as position information of objects and lighting
as the initial extraction. control signals [31][32]. The video and audio
Artificial Intelligence (AI) is used to correct the encoded/decoded by a low-latency HEVC/MMT
extraction errors from the initial extraction [26]. codec system [33] are also synchronously
More specifically, it classifies the pixels by a Neural transported along with the metadata by MMT. A
Network (NN). The NN is a 10-layer convolutional video size of 3840 x 2160 pixels at 59.94 fps was
network with 6-dimensional input (RGB of encoded by HEVC with a bitrate of 10 Mbps. The
foreground and background pixels) and 2- syntax of the metadata is defined by
dimensional output (the posterior probability of ITU-T H.430.4 [5].
being foreground and background). The NN is 4.6 Depth Expression & Presentation
trained in advance by the pixel pairs of the function
foreground and background images at each
coordinate position. To prevent the repeated This function adds the depth expression to the
evaluation of the NN during the error correction extracted object images in accordance with the 3D
process, a Look-Up Table (LUT) was generated with position information, and the processed images are
the quantification of 6-bits for each of the RGB displayed on the four-sided display device.
values. The LUT considerably reduced the In general, when shooting large events such as
computation load and helped to achieve real-time sports competitions and music concerts for live
processing. public screening, the cameras are placed so as not
Then, to refine the contours of the objects, KIRIE to interfere with the audience, usually from a
reclassifies them as the correct foreground or different height (Fig. 8 left). Therefore, if the camera
background pixels by referring to the colors and labels in images are displayed without any processing to
the vicinity [27][28]. match the viewpoint of the audience, the objects are

54 55 56 57 58 59 60 61 62 63 64