Page 55 - ITU Journal Future and evolving technologies Volume 2 (2021), Issue 4 – AI and machine learning solutions in 5G and future networks
P. 55

ITU Journal on Future and Evolving Technologies, Volume 2 (2021), Issue 4





          1. PSNR‑ Peak Signal to Noise Ratio:  It computes the
          peak signal to noise ratio between two images. The higher
          the PSNR the better the quality of the reconstructed im‑
          age.


          2. SSIM‑ Structural Similarity Index:  It is a percep‑
          tual metric that quanti ies image quality degradation. The
          larger the value the better the quality of the reconstructed
          image.

          3. CPBD‑ Cumulative Probability Blur Detection:  It
          is a perceptual based no‑reference objective image sharp‑
          ness metric based on the cumulative probability of blur
          detection developed at the image.
                                                                          Fig. 15 – Description of the 6 eye points
          4. WER‑ Word error rate:  It is a metric to evaluate the
          performance of speech recognition in a given video. We                 =  ||  2−  6||+||  3−  5||  (17)
          have used LipNet architecture [74] which is pretrained                       ||  1−  4||
          on the GRID data set for evaluating the WER. On the GRID
          data set, Lipnet achieves 95.2 percent accuracy which sur‑  7.1.4   Training and inference
          passes the experienced human lipreaders.
                                                               Our model is implemented in pytorch and takes approx‑
                                                               imately  7  days  to  run  on  4  Nvidia  V100  GPUs  for  train‑
          5. ACD‑ Average Content Distance ([44]): It is used  ing. In the training stage, the model is trained with a mul‑
          for the identi ication of speakers from the generated  tiscale frame discriminator with adversarial loss (Equa‑
          frames using OpenPose [73]. We have calculated the Co‑  tion (6)), class activation map‑based loss (Equation (14))
          sine distance and Euclidean distance of representation of
                                                               and feature matching loss (Equation (12)). The generator
          the generated image and the actual image from Openpose.
                                                               is trained with adversarial loss (Equation (6)),  class ac‑
          The distance threshold for the OpenPose model should be
                                                               tivation map‑based loss (Equation (14)),  reconstruction
          0.02 for Cosine distance and 0.20 for Euclidean distance
                                                               loss (Equation (11)), perceptual loss (Equation 5.4), and
          [77]. The lesser the distances the more similar the gener‑
                                                               key‑point predictor/optical  low‑based mean square er‑
          ated and actual images.
                                                               ror  loss  are  also  used  to  ensure  generation  of  natural‑
                                                               looking, high quality frames.
          6. LMD ‑ Landmark Distance ([76]): To ensure real‑   We have taken the Adam optimizer [79] with learning rate
          istic and accurate lip movement, ensuring good perfor‑  = 0.002 and    = 0.0 and    = 0.90 for the generator and
                                                                                      2
                                                                           1
          mance on speech recognition we use this metric. We cal‑  discriminators.
          culate the landmark points [78] on both real and gener‑
          ated images at the scale of 256*256 and use the lip re‑  7.2  Implementation results
          gion points i.e., points 49‑68 and call then as LR and LF
          respectively. LR refers to lip region from ground truth im‑  7.2.1   Quantitative results
          age and LF corresponds to lip region from generated/fake
                                                               Tables  1,2,3,4  compare  the  proposed  method  with  its
          image. T is the number of frames. Then, we calculate the
                                                               competitors  and  shows  better  SSIM,  PSNR,  CPBD,  Word
          euclidean distance between each corresponding pairs of
                                                               Error  Rate  (WER),  blinks/sec  and  LMD  on  GRID  [68],
          landmarks on LR and LF. The LMD is de ined as:
                                                               Crema‑D  [70],  GRID‑Lombard  [69]  and  Voxceleb2  [71]
                         1   1                                 data sets, suggesting highly expressive and realistic video
                        =  ∗  ∑∑||     −     ||       (16)     synthesis.  The proposed method has shown superior re‑
                                         t,p   t,p
                                 =1   =1                       sults on most of the metrics in all the mentioned data sets.
                                                               7.2.2   QoE metric
          7. Blinks/sec:  To capture the blinks in the video, we
          are calculating the blinks/sec so that we can better un‑  We have computed the QoE metric for various data sets
          derstand the quality of animated videos. Fig. 15 shows  using Equation (15).  For our experiments we have taken
          the 6 points which are used to calculate the Eye Aspect Ra‑  the 25fps for synthesizing the video.  Table 5 shows the
          tio (EAR) given in Equation (17). We have used SVM and  QoE  metric  for  various data  sets  when  synthesizing  the
          eye landmarks along with Eye Aspect Ratio (EAR) used  video from audio using the proposed method. The higher
          in Real‑Time Eye Blink Detection using Facial Landmarks  the QoE metric is, the better the model is. We can dynam‑
          [75] to detect the blinks in a video.                ically  control  the  QoE  based  on  the  need  of  the  video
                                                               conferencing and during erratic network conditions.



                                             © International Telecommunication Union, 2021                    39
   50   51   52   53   54   55   56   57   58   59   60