Page 224 - Kaleidoscope Academic Conference Proceedings 2024
P. 224
2024 ITU Kaleidoscope Academic Conference
3. IMPLEMENTATION WITH EXPERIMENTAL enabling the model to generalize better to unseen variations
DETAILS in hand gestures and environmental conditions. Additionally,
cropping the Region of Interest (ROI), such as the hand palm,
3.1 Dataset and preprocessing from the input images helps in focusing the model's attention
on the relevant area for gesture classification. By
The ICVL Hand Gesture Dataset, comprising depth images sequentially applying these preprocessing methods, the input
from Intel, with 22069 training frames and 1600 testing images are effectively prepared for feature extraction and
frames, focusing on 16 hand joints, was used in training / classification by the corresponding modules of the gesture
testing. Another dataset called Hand Gesture Recognition recognition system.
Dataset [10] containing total 24000 images of 20 different
gestures was employed. For training purpose, there are 900 3.2 Attention based CNN model
images in each directory and for testing purpose there are 300
images in each directory. Table 1 shows sample images from The model configuration parameters, including learning rate,
the dataset. batch size, and epochs, were established. The data was
organized into separate training and testing sets. To ensure
Table 1 – Dataset Sample Images uniform sequence lengths, padding or truncation methods
were implemented. Following this, a 3D attention-based
CNN model is applied to extract the feature tensors. After
the 3D attention-based CNN model processed the data, it
produced a set of 16 3D vectors that showed how the hand
was positioned in space. These contained information about
where each part of the hand was, like the fingers and the
center of the palm, and how they were moving.
The array of 3D vectors served as the foundational input for
the subsequent model, which was specifically designed to
discern and classify the intricate hand gestures. By delving
into the minute variations and spatial dynamics encoded
within these vectors, the model was able to discern complex
patterns and correlations unique to each gesture. It
comprehensively analyzed the interrelation between
different joints and their spatial orientations, allowing for the
accurate identification and classification of diverse hand
gestures. With its robust analytical framework, the model
effectively discerned the subtle differentiations between
gestures, considering the relative positioning, movement
trajectories, and spatial interactions between the hand joints.
In the preprocessing stage for gestures, the first step involves
resizing the input images to a fixed size. This standardization At the core of the enhanced architecture lies an intricate
ensures uniformity in the input data fed into the model, feature extraction layer with attention mechanism,
reducing computational complexity and memory seamlessly integrating Convolutional 2D, Batch
requirements during both training and inference phases. Normalization, ReLU, Residual, and MaxPooling 2D
Following this, normalization techniques are applied to scale operations. This amalgamation is meticulously crafted to
the pixel values of the images to a standardized range, capture nuanced spatial-temporal patterns, providing a
typically [0, 1]. This normalization stabilizes the training robust foundation for subsequent processing. This advanced
process and aids convergence by ensuring that the input data layer synergistically leverages the power of Convolutional
has a consistent scale, facilitating effective learning by the 2D operations to detect hierarchical features, Batch
model. Next, noise reduction techniques are applied to Normalization for stabilizing and accelerating training,
mitigate the impact of noise and artifacts present in the input ReLU for introducing non-linearity, Residual connections
images. Median filter method is used to smooth out for overcoming vanishing gradient issues, and MaxPooling
irregularities and enhance image clarity. In a typical home 2D for down-sampling and preserving essential information.
environment, noise reduction helps in improving the quality The collaborative effect of these operations enhances the
of the input data, making it easier for the model to extract model's capacity to discern intricate gesture nuances.
relevant features and patterns associated with different hand
gestures. Recognizing the need for a more comprehensive dataset,
both images and corresponding 16-point hand
Finally, data augmentation techniques are applied to increase representations, the work seamlessly transitioned to the Leap
the diversity and robustness of the training dataset. Gesture Dataset. This dataset enriches the training data with
Augmentation methods such as rotation, scaling, translation, crucial visual information, forming the cornerstone for a
and flipping introduce variations to the training data, robust hand gesture recognition model. The decision to
– 180 –