Page 40 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media

P. 40

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020

Overlays have been researched in the past years in The structure of the remaining parts of this paper is
several application areas but, to the best of the authors’ as follows. Section 2 shows the OMAF system
knowledge, minimally in the area of 360-degree video. architecture. Section 3 focuses on the overlay
In [6], the authors present an image overlay system to capabilities and functionalities in OMAFv2.
aid procedures in computerized tomography. The Section 4 describes how multiple omnidirectional
system in [7] shows a way to display dynamic image viewpoints can be utilized in OMAFv2. Finally,
overlays during surgical operations using a stereo section 5 concludes the paper.
camera and augmented reality visualization. The
authors of [8] describe a method for real-time overlay 2. MPEG OMAF SYSTEM ARCHITECTURE
insertion in a predefined location of a pre-encoded This section introduces the general MPEG OMAF
ITU-T H.264/AVC video sequence. The detection and system architecture. This is depicted in Fig. 1, which
extraction of an overlaid text on top of a complex video is extracted from the draft OMAFv2 standard
scene and news are studied in [9, 10]. Augmented specification [5]. The figure shows the end-to-end
reality-based image overlays on optical see-through content flow process from acquisition up to
displays mounted on the front glass of a car have been display/playback for live and on-demand streaming
studied by [11]. A similar research, but performed use cases. The specification applies to projected
using a Virtual Reality (VR) HMD with a video-based omnidirectional video (equirectangular and cube
see-through display, is presented in [12]. Here the map) as well as to fisheye video. It defines media
authors use an accelerometer to compensate for the storage and metadata signaling in the ISO Base
motion-to-photon delay between the image overlay Media File Format (ISOBMFF) [25] (i.e., interfaces F
and the reality displayed on the HMD’s screen, and and F' in Fig. 1). It also defines media encapsulation
give a method to improve the registration among the and signaling in DASH and MMT.
two. An implementation of a system that displays
overlays in unmanned aerial vehicles is presented in OMAF specifies also audio, video, image and timed
[13]. An intelligent video overlay system for text media profiles, i.e., the interfaces E'a, E'v, E'i. All
advertisements is presented in [14]. Here the overlays other interfaces depicted in the figure above are not
are not placed in fixed positions, but are located such normatively specified. Additionally, OMAF defines
that the intrusiveness to the user is minimized and is different presentation profiles for viewport-
done by detecting faces, text and salient areas in the independent and viewport-dependent streaming.
video. For further details on these two concepts, the
reader may refer to [22].
Multi-viewpoint 360-degree video streaming is a
relatively new area. For traditional mobile 2D video, Following Fig. 1, media content is initially captured.
multi-camera video remixing has been extensively Audio is encoded using 3D Audio with the MPEG-H
researched by some of the authors of this paper. See [26] audio low complexity Profile at level 1/2/3 or
[15, 16, 17], for example. The work [18] presents the MPEG-4 High Efficiency AACv2 at Level 4 codec
streaming from multiple 360-degree viewpoints [27]. Visual content is first stitched, possibly rotated,
where these are capturing the same scene from projected and packed. Subsequently, it is encoded
different angles. A challenge described by the using the MPEG High Efficiency Video Codec (HEVC)
authors is viewpoint switching and how to Main 10 profile at Level 5.1 [28] or the MPEG
minimize disruption after a switch. The authors also Advanced Video Codec (AVC) Progressive/High
emphasize the importance of switching prediction profile at Level 5.1 [29]. Images are encoded using
in order to minimize the impact on the Quality of the HEVC image profile Main 10 at Level 5.1 or Joint
Experience (QoE). The research in [19] focuses on Pictures Experct Group (JPEG) images [30]. The
low latency multi-viewpoint 360-degree interactive encoded streams are then placed into an ISOBMFF
video. The authors use multimodal learning and a file for storage or encapsulated into media
deep reinforcement learning technique to detect segments for streaming. The segments are
events (visual, audio, text) and predict future delivered to the receiver via the DASH or MMT
bandwidths, head rotation and viewpoint selection protocols. At the receiver side (player), the media is
for improving media quality and reducing latency. decapsulated, and then decoded with the respective
decoder(s), and subsequently rendered on a display
The present paper focuses on two of the new main (e.g., a HMD) or loudspeakers. The head/eye
features included in the second edition of the MPEG tracking orientation/viewport metadata
OMAF standard, namely overlays and multi-viewpoints. determine the user viewing orientation within
The new enabled use cases are also introduced.

18 © International Telecommunication Union, 2020

35 36 37 38 39 40 41 42 43 44 45