Page 87 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 87
ITU Journal: ICT Discoveries, Vol. 3(1), June 2020
A STUDY OF THE EXTENDED PERCEPTUALLY WEIGHTED PEAK SIGNAL-TO-NOISE RATIO
(XPSNR) FOR VIDEO COMPRESSION WITH DIFFERENT RESOLUTIONS AND BIT DEPTHS
1
1, 2
Christian R. Helmrich , Sebastian Bosse , Heiko Schwarz , Detlev Marpe , and Thomas Wiegand
1
1, 3
1
1 Video Coding and Analytics Department, Fraunhofer Heinrich Hertz Institute (HHI), Berlin, Germany
2 Institute for Computer Science, Free University of Berlin, Germany
3 Image Communication Group, Technical University of Berlin, Germany
Abstract – Fast and accurate estimation of the visual quality of compressed video content, particularly for
quality-of-experience (QoE) monitoring in video broadcasting and streaming, has become important. Given
the relatively poor performance of the well-known peak signal-to-noise ratio (PSNR) for such tasks, several
video quality assessment (VQA) methods have been developed. In this study, the authors’ own recent work
on an extension of the perceptually weighted PSNR, termed XPSNR, is analyzed in terms of its suitability for
objectively predicting the subjective quality of videos with different resolutions (up to UHD) and bit depths
(up to 10 bits/sample). Performance evaluations on various subjective-MOS annotated video databases and
investigations of the computational complexity in comparison with state-of-the-art VQA solutions like VMAF
and (MS-)SSIM confirm the merit of the XPSNR approach. The use of XPSNR as a reference model for visually
motivated control of the bit allocation in modern video encoders for, e. g., HEVC and VVC is outlined as well.
Keywords – Data compression, HD, HEVC, PSNR, QoE, SSIM, UHD, video coding, VMAF, VQA, VVC, WPSNR
1. INTRODUCTION Given the well-known inaccuracy of the peak signal-to-
noise ratio (PSNR) in predicting an average subjective
The consumption of compressed digital video content
judgment of perceptual coding quality [2] for a specific
via over-the-air broadcasting or Internet Protocol (IP)
codec (coder-decoder) c and image or video stimulus
based streaming services is steadily increasing. This, in (or simply, signal) s, various better performing models
turn, leads to a rapid increase in the amount of content
have been devised over the last years. The most widely
distributed using these services. Thus, it is desirable to
employed are the structural similarity measure (SSIM)
make use of schemes for automated monitoring of the [3] and its multiscale extension, MS-SSIM [4], as well as
instantaneous fidelity of the distributed audio-visual a more recently proposed video multimethod assess-
signals in order to maintain a certain degree of quality
ment fusion (VMAF) approach combining several other
of service (QoS) or, as pursued more recently, quality measures using a support vector machine [5]. Further
of experience (QoE) [1]. With regard to the video signal VQA metrics worth noting are [6]–[9], which account
part, such monitoring is realized by way of automated for frequency dependence in the human visual system.
video quality assessment (VQA) algorithms which ana-
lyze each distributed moving-picture sequence frame- Although VMAF was found to be a feasible tool for the
by-frame with the objective of providing a frame-wise evaluation of video coding technology [10],[11],its use
or scene-wise estimate of the subjective visual quality for direct encoder control is challenging since it is not
of the tested video, as it would be perceived by a group differentiable [12]. Furthermore, VMAF currently does
of human observers. Full-reference VQA methods are not allow for local quality prediction below frame level
generally employed, which means that the distributed and, owing to its reliance on several other VQA calcula-
video—here, the coded and decoded signal—is evalua- tions, is quite complex computationally. The aspect of
ted in relation to the spatio-temporally synchronized, relatively high complexity is shared by the approach of
uncoded reference video. In other words, the reference [6]–[8], utilizing block-wise 8×8 DCTs. However, low-
video represents the input sequence to the video enco- complexity reliable metrics which avoid the use of DCT
der while the distributed video is the output sequence or multiscale algorithms and which can easily be integ-
of the video decoder, as illustrated in Fig.1. rated into video encoders as control model for visually
optimized bit allocation purposes, as is the case with
Distributor Side Consumer Side PSNR and SSIM based approaches, are highly desirable.
in out
Video Video
Encoder Video Decoder 1.1 Prior work by the authors
Decoder VQA
System
In JVET-H0047 [13] the authors proposed a block-wise
Fig. 1 – Location of automatic VQA on the video distribution side. perceptually weighted distortion model as an improve-
© International Telecommunication Union, 2020 65