Page 56 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 56

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4




                  Table 1 – Comparison of the proposed method(MAN‑keypoint and MAN‑optical) with other previous works for GRID data set

                     Method       SSIM↑   PSNR↑    CPBD↑   WER↓    ACD‑C↓    ACD‑E↓    blinks/sec  LMD↓
                   FOMM[80]        0.833   26.72   0.214   38.21    0.004     0.088       0.56     0.718
                 OneShotA2V[4]     0.881  28.571   0.262    27.5    0.005      0.09       0.15      0.91
                   RSDGAN[42]      0.818  27.100   0.268    23.1      ‑     1.47x10 −4    0.39        ‑
                 Speech2Vid[39]    0.720  22.662   0.255    58.2    0.007   1.48x10 −4      ‑         ‑
                   ATVGnet[48]     0.83    32.15     ‑       ‑        ‑         ‑           ‑       1.29
                   X2face[43]      0.80    29.39     ‑       ‑        ‑         ‑           ‑       1.48
                CascadedGAN[46]    0.81    27.1     0.26    23.1      ‑     1.47x10 −4    0.45        ‑
                   MAN‑optical    0.908    29.78   0.272    23.7    0.005   1.41x10 ‑4    0.45      0.77
                  MAN‑keypoint     0.887   29.01   0.269    25.2    0.006   1.41x10 −4    0.48      0.80

                 Table 2 – Comparison of the proposed method(MAN‑keypoint and MAN‑optical) with other previous works for CREMA‑D data set

                     Method      SSIM↑   PSNR↑    CPBD↑   WER↓    ACD‑C↓    ACD‑E↓    blinks/sec   LMD↓
                   FOMM[80]       0.654   20.74   0.186     NA     0.007      0.12         ‑       1.041
                 OneShotA2V[4]    0.773  24.057   0.184     NA     0.006      0.96         ‑       0.632
                   RSDGAN[42]     0.700  23.565   0.216     NA       ‑     1.40x10 ‑4      ‑         ‑
                 Speech2Vid[39]   0.700  22.190   0.217     NA     0.008   1.73x10 ‑4      ‑         ‑
                   MAN‑optical    0.826  27.723   0.224     NA     0.004   1.62x10 −4      ‑       0.592
                  MAN‑keypoint    0.841   28.01   0.228     NA     0.003   1.38x10 ‑4      ‑       0.51

               Table 3 – Comparison of the proposed method(MAN‑keypoint and MAN‑optical) with other previous works for GRID Lombard data set
                   Method       SSIM↑   PSNR↑   CPBD↑    WER↓    ACD‑C↓   ACD‑E↓   blinks/sec    LMD↓
                  FOMM[80]       0.804   22.97   0.381    NA      0.003    0.078      0.37        1.09
                OneShotA2V[4]    0.922  28.978   0.453    NA      0.002    0.064       0.1        0.61
                Speech2Vid[39]   0.782  26.784   0.406    NA      0.004    0.069        ‑         0.581
                  MAN‑optical    0.895   26.94    0.43    NA      0.001    0.048      0.21        0.588
                 MAN‑keypoint   0.931   29.62    0.492    NA      0.001    0.046      0.31     textbf0.563

                Table 4 – Comparison of the proposed method(MAN‑keypoint and MAN‑optical) with other previous works for VOXCELEB2 data set
                      Method      SSIM↑   PSNR↑   CPBD↑    WER↓    ACD‑C↓   ACD‑E↓   blinks/sec  LMD↓
                  OneShotA2V[4]   0.698   20.921   0.103    NA      0.011    0.096      0.05      0.72
                    MAN‑optical   0.714    21.94   0.118    NA      0.008    0.067      0.21      0.65
                   MAN‑keypoint   0.732   22.41    0.126    NA     0.004     0.058      0.28      0.47



                                                               previous  work  [52]  where  the  proposed  model  shows
                   Table 5 – Average QoE on proposed method    better  image  reconstruction  and  lip  synchronization.
                                                                                             3
                                                               The generated videos are given at  .
                          Method             QoE ↑
                     MAN‑optical(GRID)       1.232
                    MAN‑keypoint(GRID)        1.074
                 MAN‑optical(GRID‑Lombard)    0.624
                MAN‑keypoint(GRID‑Lombard)    1.20
                   MAN‑optical(CREMA‑D)       0.797
                  MAN‑keypoint(CREMA‑D)      0.860
                   MAN‑optical(VoxCeleb2)    ‑0.595
                  MAN‑keypoint(VoxCeleb2)    ‑0.472


          7.2.3   Qualitative results                          Fig. 16 – Top: The speaker speaking the word ’bin’ , Middle : The speaker
                                                               speaking the word ’please’, Bottom: The speaker blinking his eyes
          Expressive  aspect:   Fig.  16  displays  the  lip  synchro‑
          nized  frames  of  a  speaker  speaking  the  word  ’bin’  and
          ’please’  as  well  as  the  blinking  of  the  eyes.  Fig.   17
          shows the comparison of the proposed model with      3 https://sites.google.com/view/itu2021




          40                                 © International Telecommunication Union, 2021
   51   52   53   54   55   56   57   58   59   60   61