Page 49 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 49

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020




          based on several variants of the function F i that represent
          different degrees of complexity will be discussed.

          3.  NEURAL-NETWORK-BASED INTRA PRE-
              DICTORS
          In our CfP response, each function F i from (1) was given
          by a fully connected neural-network with three hidden
          layers, [21, 22]. For each rectangular block of width W
          and height H with W and H being integer powers of two
          between 4 and 32, n prediction modes were supported.
                                                               Fig. 4 – Construction of the prediction signal pred at the decoder using
          The number n is equal to 35 for max(W, H) < 32 and is  neural-networks. A hidden layer maps an input vector x to σ(A hid x +
          equal to 11, otherwise.                              b hid ) with σ being the exponential linear unit function [22]. The mode
          The neural-network-based intra-prediction modes are il-  layer maps its input x to A mode x + b mode which represents, up to nor-
                                                               malization, the logarithms of the discrete probability distribution of the
          lustrated in Figures 3 and 4. Input for the prediction are  modes [22]. The output of hidden layer 3 on the right is the feature vec-
          the d = 2(W +H +2) reconstructed samples r on the two  tor ftr ∈ R . The i-th output layer maps its input x to A i x + b i which
                                                                       d
          lines left and above the block, as well as the 2 × 2 corner  represents the W × H-dimensional prediction signal.
          on the top left. The dimension of the three hidden layers
                                                               spond to the matrix coefficients and the offset-vector en-
          is equal to d. In order to improve the training and to re-
                                                               tries of the neural-network, were determined by attempt-
          duce the number of parameters needed, these layers are
                                                               ing to minimize a specific loss function over a large set
          shared by all prediction modes F i . Their output can be in-  of training data. This loss function was defined as a lin-
                                         d
          terpreted as a set of features ftr ∈ R of the surrounding
                                                               ear combination of the ` 1 norm of the DCT-II-transform
          samples. Inthelastlayer, thesefeaturesareaffine-linearly
                                                               coefficients of the prediction residual and of a sigmoid
          combined where this combination depends on the predic-
                                                               term on these coefficients. The sigmoid term has a steep
          tion mode i.
                                                               slope in some neighborhood of zero and its slope becomes
                                                               smaller the farther away from zero. In this way, during
                                                               training by gradient descent, the prediction modes are
                                                               steered towards modes for which the energy of the pre-
                                                               diction residual is concentrated in very few transform co-
                                                               efficients while most of the transform coefficients will be
                                                               quantized to zero. This reflects the well-known fact that
                                                               in the transform coding design of modern hybrid video
                                                               codecs, it is highly beneficial in terms of rate-saving if a
                  Fig. 3 – Overview of NN-based intra-prediction  transform-coefficient can be quantized to zero; see for
                                                               example [28]. In the training algorithm, all prediction
          For the signalization of the mode index i ∈ {0, . . . , n−1},
                                                               modes over all block shapes were trained jointly. The pa-
          weused a secondneural-networkwhose inputis thesame
                                                               rameters of the neural-network used in the mode signal-
          vector of reconstructed samples r as above and whose
                                                               ization were also determined in that algorithm. In the op-
          output is a conditional probability mass function p over
                                                               timization, a stochastic gradient descent approach with
          the n modes, given the reconstructed samples r. When
                                                               Adam-optimizer[15]wasapplied. Formoredetailsonthe
          one of the intra-prediction modes is to be applied at the
                                                               training algorithm, we refer to [14].
          decoder, a number index ∈ {0, . . . , n − 1} needs to be
                                                               The neural-network-based intra-prediction modes were
          parsed and the probability mass function p needs to be
                                                               integrated in a software that was equivalent to the HEVC
          computed. Then the index-th most probable mode with
                                                               reference software anchor with the extension that it also
          respect to p has to be used; see Fig. 4. Here, the binariza-
                                                               supported non-square partitions, [22]. They were added
          tion of index is such that small values of index require
                                                               ascomplementarytotheHEVCintra-predictionmodes. In
          less bins.
                                                               the all-intra configuration, they gave a compression ben-
          Our signalling approach shares similarities with some  efit of −3.01%; see [22, Table 1]. Here and in the se-
          of the machine-learning based image compression ap-  quel, all objective results report luma Bjøntegaard delta
          proaches mentioned in the introduction. In [4], [18], [31],  (BD) rates according to [5], [6]. Moreover, the standard
          an arithmetic coding engine with conditional probabili-  QP-parameters 22, 27, 32 and 37 are used and the simu-
          ties that are computed on the fly by a neural-network out  lations are performed following JVET common test con-
          of reconstructed symbols is used. In our approach, how-  ditions, [7]. For the test sequences of [22], the neural-
          ever, the conditional probability p is not directly invoked  network prediction modes were used for approximately
          into the arithmetic coding engine in order to avoid a pars-  50% of all intra blocks.
          ing dependency.                                      As reported in [22], the measured average decoding time
          The parameters θ i of the prediction modes, which corre-  was 248%. Throughout the paper, a conventional CPU-




                                             © International Telecommunication Union, 2020                    27
   44   45   46   47   48   49   50   51   52   53   54