Page 58 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 58

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4




                                                               Table 7 – Incremental study of multimodal adaptive normalization on
                                                               GRID data set

                                                                      Method       SSIM↑  PSNR↑  CPBD↑  blinks/sec  WER↓
                                                                    Base Model(BM)  0.776  27.99  0.213  0.02  57.9
                                                                     BM + OFP+mel  0.878  28.43  0.244  0.38  27.4
                                                                   BM + OFP+mel+pitch  0.881  28.57  0.264  0.41  24.1
                                                                 BM+OFP+mel+pitch+energy  0.908  29.78  0.272  0.45  23.7
                                                               8.   PSYCHOPHYSICAL ASSESSMENT


                                                               Results are visually rated (on a scale of 5) individually
                                                               by 25 persons, on three aspects, lip synchronization, eye
          Fig. 20 – Top: Actual frames of speaker of GRID data set. Middle: Pre‑  blinks and eyebrow raises and quality of video on a GRID
          dicted frames from proposed method with keypoints predicted from  data set. The subjects were shown anonymous videos at
          keypoint predictor. Bottom: Predicted frames from the FOMM method
          [80]                                                 the same time for the different audio clips for side‑by‑side
                                                               comparison. Table 8 clearly shows that MAN‑based pro‑
          7.3.1  Network analysis in multimodal adaptive       posed architecture performs signi icantly better in qual‑
                 normalization                                 ity and lip synchronization which is of prime importance
                                                               in videos.
          We have done the ablation study on three architectures,
          namely 2D convolution, partial 2D convolution [81, 82]  Table 8 – Psychophysical evaluation (in percentages) based on users
          and 2D convolution+Ef icient Channel Attention (ECA)  rating on GRID daatset
          [83] for extracting video features and two architectures
          namely 1D convolution and LSTM for audio features         Method       Lip‑Sync↑  Eye‑blink↑  Quality↑
          as shown in Fig. 10 and Fig. 11 to study its effect on     MAN           91.8        90.5       79.6
          multimodal adaptive normalization with optical  low    OneShotA2V[4]     90.8        88.5       76.2
                                                                  RSDGAN[42]       92.8        90.2       74.3
          predictor in the proposed method. Table 6 shows that
                                                                 Speech2Vid[39]    90.7        87.7       72.2
          2DConv+ECA+LSTM has improved the reconstruction
          metrics such as SSIM, PSNR and CPBD as well as word
          error rate and blinks/sec as compared to other networks.
          The image quality reduced with the use of partial 2D
          convolution which demonstrates that since the predicted  9.  TURING TEST
          optical  low is dense, holes in the optical  low has some
          spatial relation with other regions which are better
          captured by other networks.                          To test the naturalism of the generated videos we conduct
                                                                                                  4
                                                               an online Turing test on a GRID data set . Each test con‑
                                                               sists of 20 questions with 10 fake and 10 real videos. The
          Table 6 – Ablation study of different networks of multimodal adaptive
          normalization on GRID data set                       user is asked to label a video real or fake based on the aes‑
                                                               thetics and naturalism of the video. Approximately 300
               Method     SSIM↑  PSNR↑  CPBD↑  blinks/sec  WER↓  user data is collected and their score of the ability to spot
             2DConv+1dConv  0.875  28.65  0.261  0.35  25.6
           Partial2DConv+1dConv  0.803  28.12  0.256  0.15  29.4  fake video is displayed in Fig. 21.
           2DConv+ECA+1dConv  0.880  29.11  0.263  0.42  23.9
             2DConv+LSTM   0.896  29.25  0.086  0.260  24.1
           Partial2DConv+LSTM  0.823  28.12  0.258  0.12  28.3
            2DConv+ECA+LSTM  0.908  29.78  0.272  0.45  23.7


          7.3.2  Incremental effect of multimodal adap‑
                 tive normalization

          We study the incremental effect of multimodal adaptive
          normalization of the proposed model with the Optical
          Flow Predictor (OFP) and 2DConv+ECA+LSTM combi‑
          nation in multimodal attention normalization on a GRID
          data set. Table 7 shows the impact of the addition of
          melspectrogram features, pitch, predicted optical  low   Fig. 21 – Distribution of user scores for the online Turing test
          in multimodal adaptive normalization. The base model
          consists of generator and discriminator architecture with
          a static image in the adaptive normalization.
                                                               4 https://forms.gle/DM1DRcTToQFvUpTa7




          42                                 © International Telecommunication Union, 2021
   53   54   55   56   57   58   59   60   61   62   63