Page 222 - Kaleidoscope Academic Conference Proceedings 2024
P. 222

2024 ITU Kaleidoscope Academic Conference




           which improves the accuracy and reduces latency to a greater   discerning  the  nuances  of  hand  movements  with  greater
           extent.                                            accuracy. The segmented hand palm region is then passed
                                                              through two distinct modules for gesture classification. The
           The ITU-T Recommendation J.1612 [6] outlines technical   first module employs an attention-based CNN model, which
           specifications for efficient smart home device management   dynamically  focuses  on  salient  features  within  the  hand
           within  IoT  ecosystems.  The  protocols  and  standards  for   region.  This  attention  mechanism  enhances  the  model's
           device  discovery,  configuration,  and  maintenance  ensure   ability  to  capture  subtle  variations  in  hand  gestures,
           seamless  integration  and  interoperability.  Apart  from  the   improving classification performance.
           security measures, the system development needs to address
           scalability and adaptability, allowing for the integration of   In  the  development  of  our  attention  mechanism,  the  Self-
           new  devices  and  services  over  time.  A  standardized   Attention  Input  (SAI)  layer  plays  a  crucial  role  by
           framework  for  smart  home  device  management  promotes   decomposing  the  feature  representations  into  Value,  Key,
           efficiency,  reliability,  and  security  in  IoT-driven  home   and Query components, operating in the format of (batch size,
           automation environments. The proposed work conforms to   channel number, height,  width). This layer employs batch
           the ITU-T Recommendations J1611 [7] and J1612 [6], while   matrix  multiplication  to  compute  attention  scores,  which
           taking  device  management  and  IoT  into  consideration.   enables the model to selectively focus on relevant spatial-
           Further, in a standardized system development approach, the   temporal features. The attention-driven approach allows for
           gateway hardware with a driver and operating system serves   more  nuanced  understanding  by emphasizing key  features
           as  a  basic  software  platform  to  manage  all  hardware   within the spatial-temporal context of the hand gestures, thus
           resources. In proposed solution, a machine learning based   improving the accuracy of gesture recognition.
           method  deployed  on  minicomputer  like  Raspberry  Pi
           supporting  IoT  devices  facilitates  understanding  and   Following the attention score computation, the system delves
           execution end user’s command in real-time.         into meticulous refinement processes. Attention UV (Att UV)
                                                              and Attention Others (Att Others) strategically process UV
           The  proposed  work  introduces  a  novel  solution  by   and  other  points  extracted  from  the  Self-Attention  Output
           employing  a  dedicated  CNN  model  with  self  and  other   (SAO) layer, ensuring that the system homes in on critical
           attention  mechanisms,  specifically  tailored  for  processing   spatial-temporal features. This attention-driven refinement is
           3D  tensors  derived  from  the  images.  Beyond  the  core   pivotal  in  preparing  the  features  for  subsequent  stages,
           classification  challenge,  the  solution  addresses  additional   facilitating a more precise and context-aware understanding
           complexities  associated  with  real-time  gesture  recognition   of gestures. It establishes a robust foundation for the feature
           devices. Integrating the entire pipeline, from real-time image   pooling  stage,  ensuring  that  the  system  captures  and
           capture to 3D tensor generation and classification, requires   processes intricate details essential for classification module.
           careful consideration of computational efficiency and system
           responsiveness. The inherent complexity of human gestures   The hand region of interest is simultaneously processed by a
           information  [8],  allowing  for  precise  3D  spatial  structure   transfer  learning  model  with  dynamic  learning  rate.  This
           capture  and  accurate  regression  of  hand  poses,  poses  a   model  adjusts  its  learning  rate  dynamically  based  on  the
           challenge  to  conventional  image-based  classification   characteristics  of  the  input  data,  optimizing  the  training
           systems.  Moreover,  issues  such  as  lighting  conditions,   process  for  improved  performance.  By  incorporating
           background noise, and varying user positions [9] add to the   dynamic  learning  rate  mechanisms,  the  model  effectively
           work’s  intricacy.  The  system  development  using  existing   adapt to variations in gesture dynamics and environmental
           machine learning model like transfer learning model ResNet   conditions.  Unlike  traditional  approaches  that  extract
           with a dynamic learning rate tries to enhance the accuracy   positional  parameters  before  inputting  them  into  learning
           and  robustness  of  gesture  recognition  systems,  enabling   models, the proposed system directly utilizes the image data.
           seamless and natural interactions between end users and IoT   This  approach  offers  several  advantages,  including
           enabled devices.                                   simplifying  the  preprocessing  stage  and  reducing
                                                              computational complexity. By feeding image data into the
                         2.  PROPOSED WORK                    learning model, the system preserves the spatial information
                                                              inherent  in  gestures,  allowing  for  more  accurate
           The  proposed  work  entails  a  comprehensive  approach  to   classification.
           hand gesture recognition for smart home appliance control
           as shown in Figure 1. Initially, the camera feed undergoes   Finally,  the  results  from  both  the  models  (attention  and
           preprocessing to enhance the quality of the frames extracted   transfer learning) are fused using a class probability fusion
           for  analysis.  This  preprocessing  step  includes  noise   technique.  This  fusion  process  intelligently  combines  the
           reduction,  image  normalization,  and  potential  background   outputs  from  the  attention-based  CNN  model  and
           subtraction  to  isolate  the  hand  region,  crucial  for  gesture   dynamically learning transfer learning model to produce a
           recognition. Following preprocessing, the region of interest,   more  robust  classification  outcome.  The  merging  of
           typically the hand palm, is segmented from the background.   predictions from two parallel channels stands out as a critical
           This segmentation step is pivotal for focusing the analysis on   component  in  the  proposed  gesture  recognition  process.
           relevant features for gesture classification. By isolating the   After  localizing  the  gestures,  one  channel  processes  the
           hand  region,  the  subsequent  models  concentrate  on   image  data  directly,  while  the  other  extracts  positional





                                                          – 178 –
   217   218   219   220   221   222   223   224   225   226   227