Page 222 - Kaleidoscope Academic Conference Proceedings 2024
P. 222
2024 ITU Kaleidoscope Academic Conference
which improves the accuracy and reduces latency to a greater discerning the nuances of hand movements with greater
extent. accuracy. The segmented hand palm region is then passed
through two distinct modules for gesture classification. The
The ITU-T Recommendation J.1612 [6] outlines technical first module employs an attention-based CNN model, which
specifications for efficient smart home device management dynamically focuses on salient features within the hand
within IoT ecosystems. The protocols and standards for region. This attention mechanism enhances the model's
device discovery, configuration, and maintenance ensure ability to capture subtle variations in hand gestures,
seamless integration and interoperability. Apart from the improving classification performance.
security measures, the system development needs to address
scalability and adaptability, allowing for the integration of In the development of our attention mechanism, the Self-
new devices and services over time. A standardized Attention Input (SAI) layer plays a crucial role by
framework for smart home device management promotes decomposing the feature representations into Value, Key,
efficiency, reliability, and security in IoT-driven home and Query components, operating in the format of (batch size,
automation environments. The proposed work conforms to channel number, height, width). This layer employs batch
the ITU-T Recommendations J1611 [7] and J1612 [6], while matrix multiplication to compute attention scores, which
taking device management and IoT into consideration. enables the model to selectively focus on relevant spatial-
Further, in a standardized system development approach, the temporal features. The attention-driven approach allows for
gateway hardware with a driver and operating system serves more nuanced understanding by emphasizing key features
as a basic software platform to manage all hardware within the spatial-temporal context of the hand gestures, thus
resources. In proposed solution, a machine learning based improving the accuracy of gesture recognition.
method deployed on minicomputer like Raspberry Pi
supporting IoT devices facilitates understanding and Following the attention score computation, the system delves
execution end user’s command in real-time. into meticulous refinement processes. Attention UV (Att UV)
and Attention Others (Att Others) strategically process UV
The proposed work introduces a novel solution by and other points extracted from the Self-Attention Output
employing a dedicated CNN model with self and other (SAO) layer, ensuring that the system homes in on critical
attention mechanisms, specifically tailored for processing spatial-temporal features. This attention-driven refinement is
3D tensors derived from the images. Beyond the core pivotal in preparing the features for subsequent stages,
classification challenge, the solution addresses additional facilitating a more precise and context-aware understanding
complexities associated with real-time gesture recognition of gestures. It establishes a robust foundation for the feature
devices. Integrating the entire pipeline, from real-time image pooling stage, ensuring that the system captures and
capture to 3D tensor generation and classification, requires processes intricate details essential for classification module.
careful consideration of computational efficiency and system
responsiveness. The inherent complexity of human gestures The hand region of interest is simultaneously processed by a
information [8], allowing for precise 3D spatial structure transfer learning model with dynamic learning rate. This
capture and accurate regression of hand poses, poses a model adjusts its learning rate dynamically based on the
challenge to conventional image-based classification characteristics of the input data, optimizing the training
systems. Moreover, issues such as lighting conditions, process for improved performance. By incorporating
background noise, and varying user positions [9] add to the dynamic learning rate mechanisms, the model effectively
work’s intricacy. The system development using existing adapt to variations in gesture dynamics and environmental
machine learning model like transfer learning model ResNet conditions. Unlike traditional approaches that extract
with a dynamic learning rate tries to enhance the accuracy positional parameters before inputting them into learning
and robustness of gesture recognition systems, enabling models, the proposed system directly utilizes the image data.
seamless and natural interactions between end users and IoT This approach offers several advantages, including
enabled devices. simplifying the preprocessing stage and reducing
computational complexity. By feeding image data into the
2. PROPOSED WORK learning model, the system preserves the spatial information
inherent in gestures, allowing for more accurate
The proposed work entails a comprehensive approach to classification.
hand gesture recognition for smart home appliance control
as shown in Figure 1. Initially, the camera feed undergoes Finally, the results from both the models (attention and
preprocessing to enhance the quality of the frames extracted transfer learning) are fused using a class probability fusion
for analysis. This preprocessing step includes noise technique. This fusion process intelligently combines the
reduction, image normalization, and potential background outputs from the attention-based CNN model and
subtraction to isolate the hand region, crucial for gesture dynamically learning transfer learning model to produce a
recognition. Following preprocessing, the region of interest, more robust classification outcome. The merging of
typically the hand palm, is segmented from the background. predictions from two parallel channels stands out as a critical
This segmentation step is pivotal for focusing the analysis on component in the proposed gesture recognition process.
relevant features for gesture classification. By isolating the After localizing the gestures, one channel processes the
hand region, the subsequent models concentrate on image data directly, while the other extracts positional
– 178 –