Page 50 - ITU Journal, ICT Discoveries, Volume 3, No. 1, June 2020 Special issue: The future of video and immersive media
P. 50

ITU Journal: ICT Discoveries, Vol. 3(1), June 2020




          based cluster was used for measuring decoder runtimes.  diction residual res. Thus, at the decoder, one can re-
          No SIMD- or GPU-based optimization was applied. Ac-  place the computation of T  −1 (c) by the computation of
          cording to the architecture of the neural-networks used,  T −1 (c+pred i,tr ). Consequently, as long as the prediction
          the total number of parameters needed for the prediction  residual is non-zero, no extra inverse transform needs to
          modes described in this section is about 5.5 million. Since  be executed when passing from pred i,tr to pred i .
          all parameters were stored in 16-bit-precision, this corre-  The weights θ i of the involved neural-networks were ob-
          sponds to a memory requirement of about 11 Megabyte.  tained in two steps. First, the same training algorithm
          Our method should be compared to the method of [16],  as in the previous section was applied and the predictors
          where also intra-prediction modes based on fully con-  were transformed to predict into the frequency domain.
          nected layers are trained and integrated into HEVC. While  Then, using again a large set of training data, for each pre-
          the compression benefit reported in [16] is similar to  dictor it was determined which of its frequency compo-
          ours, its decoder runtime is significantly higher; see Ta-  nents could be set to zero without significantly changing
          ble III of [16].                                     its quality on natural image content. For more details, we
                                                               refer to [14].
          4.  PREDICTIONINTOTHETRANSFORMDO-                    As a further development, for the signalization of con-
              MAIN                                             ventional intra-prediction modes, a mapping from neural-
                                                               network-based intra-prediction modes to conventional
          The complexity of the neural-network-based intra-    intra-prediction modes was implemented. Via this map-
          prediction modes from the previous section increases  ping, whenever a conventional intra-prediction mode is
          with the block-sizes W and H. This is particularly true  used on a given block, neighboring blocks which use the
          for the last layer of the prediction modes, where for each  neural-network-based prediction mode can be used for
          output sample of the final prediction, 2 · (W + H + 2)  the generation of the list of most probable modes on the
          many multiplications have to be carried out and a    given block. For further details, we refer to [14].
          (W · H) × (2 · (W + H + 2))-matrix has to be stored for  In an experimental setup similar to the one of the pre-
          each prediction mode.                                vious section, the intra-prediction modes of the present
          Thus, instead of predicting into the sample domain, in  section gave a compression benefit of −3.76% luma-BD-
          subsequent work [21, 14] we transformed our predictors  rate gain; see [14, Table 2]. Compared to the results of the
          such that they predict into the frequency domain of the  previous section, these results should be interpreted as
          discrete cosine transform DCT-II. Thus, if T is the matrix  saying that the prediction into the transform-domain with
          representing the DCT-II, the i-th neural-network predic-  the associated reduction of the last layer does not yield
                                                               any significant coding loss and that the mapping from
          tor from the previous section predicts a signal pred i,tr
          such that the final prediction signal is given as    neural-network-based intra-prediction modes to conven-
                                                               tional intra-prediction modes additionally improves the
                                 −1
                        pred i = T  · pred i,tr .
                                                               compression efficiency. As reported in [14], the mea-
                                                               sured decoder runtime overhead is 147%, the measured
          The key point is that each prediction mode has to follow a
          fixed sparsity pattern: For a lot of frequency components,  encoder runtime overhead is 284%. Thus, from a de-
                                                               coder perspective, the complexity of the method has been
          pred i,tr is constrained to zero in that component, inde-
                                                               significantly reduced. Also, the memory requirement of
          pendent of the input. In other words, if A i,tr is the matrix
                                                               the method was reduced significantly. In the architecture
          used in the last layer for the generation of pred i,tr , then
                                                               from Figure 5, approximately 1 Megabyte of weights need
          for each such frequency component, the row of the ma-
                                                               to be stored.
          trix A i,tr corresponding to that component consists only
          of zeros. Thus, the entries of that row do not need to be
          stored and no multiplications need to be carried out in the  5.  MATRIX-BASED   INTRA-PREDICTION
          matrix vector product A i,tr i · ftr for that row. The whole  MODES
          process of predicting into the frequency domain is illus-
                                                               In the further course of the standardization, the data-
          trated in Figure 5.
                                                               driven intra-prediction modes were again simplified lead-
                                                               ing to matrix-based intra-prediction (MIP) modes, [23,
                                                               26]. These modes were adopted into the VVC-standard
                                                               at the 14-th JVET-meeting in Geneva [9]. The complexity
                                                               of the MIP modes can be described as follows. First, the
                                                               number of multiplications per sample required by each
                                                               MIP-prediction mode is at most four and thus not higher
          Fig. 5 – intra-prediction into the DCT-II domain. The white samples in  than for the conventional intra-prediction modes which
          the output pred i,tr denote the DCT-coefficients which are constrained  require four multiplications per sample either due to the
          to zero. The pattern depends on the mode i.
                                                               four-tap interpolation filter for fractional angle positions
          In the underlying codec, the inverse transform T −1  is al-  or due to PDPC. Second, the memory requirement of the
          ready applied to the transform coefficients c of the pre-  method is strongly reduced. Namely, the memory to store





          28                                 © International Telecommunication Union, 2020
   45   46   47   48   49   50   51   52   53   54   55