Page 224 - Kaleidoscope Academic Conference Proceedings 2024
P. 224

2024 ITU Kaleidoscope Academic Conference




             3.  IMPLEMENTATION WITH EXPERIMENTAL             enabling the model to generalize better to unseen variations
                                 DETAILS                      in hand gestures and environmental conditions. Additionally,
                                                              cropping the Region of Interest (ROI), such as the hand palm,
           3.1   Dataset and preprocessing                    from the input images helps in focusing the model's attention
                                                              on  the  relevant  area  for  gesture  classification.  By
           The ICVL Hand Gesture Dataset, comprising depth images   sequentially applying these preprocessing methods, the input
           from  Intel,  with  22069  training  frames  and  1600  testing   images  are  effectively  prepared  for  feature  extraction  and
           frames, focusing on 16 hand joints, was used in training /   classification by the corresponding modules of the gesture
           testing.  Another  dataset  called  Hand  Gesture  Recognition   recognition system.
           Dataset [10] containing total 24000 images of 20 different
           gestures was employed. For training purpose, there are 900   3.2   Attention based CNN model
           images in each directory and for testing purpose there are 300
           images in each directory.  Table 1 shows sample images from   The model configuration parameters, including learning rate,
           the dataset.                                       batch  size,  and  epochs,  were  established.  The  data  was
                                                              organized into separate training and testing sets. To ensure
                     Table 1 – Dataset Sample Images          uniform  sequence  lengths,  padding  or  truncation  methods
                                                              were  implemented.  Following  this,  a  3D  attention-based
                                                              CNN model is applied to extract the feature tensors. After
                                                              the  3D  attention-based  CNN  model  processed  the  data,  it
                                                              produced a set of 16 3D vectors that showed how the hand
                                                              was positioned in space. These contained information about
                                                              where each part of the hand was, like the  fingers and the
                                                              center of the palm, and how they were moving.

                                                              The array of 3D vectors served as the foundational input for
                                                              the subsequent  model,  which  was specifically designed to
                                                              discern and classify the intricate hand gestures. By delving
                                                              into  the  minute  variations  and  spatial  dynamics  encoded
                                                              within these vectors, the model was able to discern complex
                                                              patterns  and  correlations  unique  to  each  gesture.  It
                                                              comprehensively  analyzed  the  interrelation  between
                                                              different joints and their spatial orientations, allowing for the
                                                              accurate  identification  and  classification  of  diverse  hand
                                                              gestures.  With  its  robust  analytical  framework,  the  model
                                                              effectively  discerned  the  subtle  differentiations  between
                                                              gestures,  considering  the  relative  positioning,  movement
                                                              trajectories, and spatial interactions between the hand joints.
           In the preprocessing stage for gestures, the first step involves
           resizing the input images to a fixed size. This standardization   At  the  core  of  the  enhanced  architecture  lies  an  intricate
           ensures  uniformity  in  the  input  data  fed  into  the  model,   feature  extraction  layer  with  attention  mechanism,
           reducing   computational   complexity   and   memory   seamlessly   integrating   Convolutional   2D,   Batch
           requirements  during  both  training  and  inference  phases.   Normalization,  ReLU,  Residual,  and  MaxPooling  2D
           Following this, normalization techniques are applied to scale   operations.  This  amalgamation  is  meticulously  crafted  to
           the  pixel  values  of  the  images  to  a  standardized  range,   capture  nuanced  spatial-temporal  patterns,  providing  a
           typically  [0,  1].  This  normalization  stabilizes  the  training   robust foundation for subsequent processing. This advanced
           process and aids convergence by ensuring that the input data   layer synergistically leverages the power of Convolutional
           has a consistent scale, facilitating effective learning by the   2D  operations  to  detect  hierarchical  features,  Batch
           model.  Next,  noise  reduction  techniques  are  applied  to   Normalization  for  stabilizing  and  accelerating  training,
           mitigate the impact of noise and artifacts present in the input   ReLU  for  introducing  non-linearity,  Residual  connections
           images.  Median  filter  method  is  used  to  smooth  out   for overcoming vanishing gradient issues, and MaxPooling
           irregularities and enhance image clarity. In a typical home   2D for down-sampling and preserving essential information.
           environment, noise reduction helps in improving the quality   The  collaborative  effect  of  these  operations  enhances  the
           of the input data, making it easier for the model to extract   model's capacity to discern intricate gesture nuances.
           relevant features and patterns associated with different hand
           gestures.                                          Recognizing  the  need  for  a  more  comprehensive  dataset,
                                                              both   images   and   corresponding   16-point   hand
           Finally, data augmentation techniques are applied to increase   representations, the work seamlessly transitioned to the Leap
           the  diversity  and  robustness  of  the  training  dataset.   Gesture Dataset. This dataset enriches the training data with
           Augmentation methods such as rotation, scaling, translation,   crucial  visual  information,  forming  the  cornerstone  for  a
           and  flipping  introduce  variations  to  the  training  data,   robust  hand  gesture  recognition  model.  The  decision  to





                                                          – 180 –
   219   220   221   222   223   224   225   226   227   228   229