Page 49 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 49
ITU Journal: ICT Discoveries, Vol. 3(1), June 2020
based on several variants of the function F i that represent
different degrees of complexity will be discussed.
3. NEURAL-NETWORK-BASED INTRA PRE-
DICTORS
In our CfP response, each function F i from (1) was given
by a fully connected neural-network with three hidden
layers, [21, 22]. For each rectangular block of width W
and height H with W and H being integer powers of two
between 4 and 32, n prediction modes were supported.
Fig. 4 – Construction of the prediction signal pred at the decoder using
The number n is equal to 35 for max(W, H) < 32 and is neural-networks. A hidden layer maps an input vector x to σ(A hid x +
equal to 11, otherwise. b hid ) with σ being the exponential linear unit function [22]. The mode
The neural-network-based intra-prediction modes are il- layer maps its input x to A mode x + b mode which represents, up to nor-
malization, the logarithms of the discrete probability distribution of the
lustrated in Figures 3 and 4. Input for the prediction are modes [22]. The output of hidden layer 3 on the right is the feature vec-
the d = 2(W +H +2) reconstructed samples r on the two tor ftr ∈ R . The i-th output layer maps its input x to A i x + b i which
d
lines left and above the block, as well as the 2 × 2 corner represents the W × H-dimensional prediction signal.
on the top left. The dimension of the three hidden layers
spond to the matrix coefficients and the offset-vector en-
is equal to d. In order to improve the training and to re-
tries of the neural-network, were determined by attempt-
duce the number of parameters needed, these layers are
ing to minimize a specific loss function over a large set
shared by all prediction modes F i . Their output can be in- of training data. This loss function was defined as a lin-
d
terpreted as a set of features ftr ∈ R of the surrounding
ear combination of the ` 1 norm of the DCT-II-transform
samples. Inthelastlayer, thesefeaturesareaffine-linearly
coefficients of the prediction residual and of a sigmoid
combined where this combination depends on the predic-
term on these coefficients. The sigmoid term has a steep
tion mode i.
slope in some neighborhood of zero and its slope becomes
smaller the farther away from zero. In this way, during
training by gradient descent, the prediction modes are
steered towards modes for which the energy of the pre-
diction residual is concentrated in very few transform co-
efficients while most of the transform coefficients will be
quantized to zero. This reflects the well-known fact that
in the transform coding design of modern hybrid video
codecs, it is highly beneficial in terms of rate-saving if a
Fig. 3 – Overview of NN-based intra-prediction transform-coefficient can be quantized to zero; see for
example [28]. In the training algorithm, all prediction
For the signalization of the mode index i ∈ {0, . . . , n−1},
modes over all block shapes were trained jointly. The pa-
weused a secondneural-networkwhose inputis thesame
rameters of the neural-network used in the mode signal-
vector of reconstructed samples r as above and whose
ization were also determined in that algorithm. In the op-
output is a conditional probability mass function p over
timization, a stochastic gradient descent approach with
the n modes, given the reconstructed samples r. When
Adam-optimizer[15]wasapplied. Formoredetailsonthe
one of the intra-prediction modes is to be applied at the
training algorithm, we refer to [14].
decoder, a number index ∈ {0, . . . , n − 1} needs to be
The neural-network-based intra-prediction modes were
parsed and the probability mass function p needs to be
integrated in a software that was equivalent to the HEVC
computed. Then the index-th most probable mode with
reference software anchor with the extension that it also
respect to p has to be used; see Fig. 4. Here, the binariza-
supported non-square partitions, [22]. They were added
tion of index is such that small values of index require
ascomplementarytotheHEVCintra-predictionmodes. In
less bins.
the all-intra configuration, they gave a compression ben-
Our signalling approach shares similarities with some efit of −3.01%; see [22, Table 1]. Here and in the se-
of the machine-learning based image compression ap- quel, all objective results report luma Bjøntegaard delta
proaches mentioned in the introduction. In [4], [18], [31], (BD) rates according to [5], [6]. Moreover, the standard
an arithmetic coding engine with conditional probabili- QP-parameters 22, 27, 32 and 37 are used and the simu-
ties that are computed on the fly by a neural-network out lations are performed following JVET common test con-
of reconstructed symbols is used. In our approach, how- ditions, [7]. For the test sequences of [22], the neural-
ever, the conditional probability p is not directly invoked network prediction modes were used for approximately
into the arithmetic coding engine in order to avoid a pars- 50% of all intra blocks.
ing dependency. As reported in [22], the measured average decoding time
The parameters θ i of the prediction modes, which corre- was 248%. Throughout the paper, a conventional CPU-
© International Telecommunication Union, 2020 27