Page 359 - Kaleidoscope Academic Conference Proceedings 2024

P. 359

Innovation and Digital Transformation for a Sustainable World

approach for 2-D multi-individual human posture an image/video. Both fitness, health, and
assessment. This could not achieve accuracy beyond 76.4%. variants have their own set wellness applications.
Convolutional Neural Networks did achieve an accuracy of parameters and MoveNet is a bottom-up
beyond 80%. It achieved an accuracy of 82.84%, by using methodology. Single pose estimation model, using
various layers of neurons along with techniques like Batch estimation is simpler and heat-maps to accurately
Normalization, and Dropouts. This accuracy did not prove faster but required to have localize human key-points.
to be sufficient for effective training of the agent. But, it did a single person in an
become the base of many state-of-the-art learning image/video otherwise key
techniques. EfficientNet is the first technique to use CNN points from multiple
along with the Scaling method. EfficientNet scaling method persons will likely be
uniformly scales network width, depth, and resolution with estimated as being part of a
a set of fixed scaling coefficients. It also transfers well to single subject
other problems, and is shown to achieve accuracy on
CIFAR-100 (91.7%) and Flowers (98.8%), among other PoseNet again has two This is a project of
datasets. But, when this learning was used on Yoga Poses, it variants in terms of model TensorFlow, which
showed an accuracy of 85%. Our next approach was to architecture that is provides two variants,
explore Deep Residual Learning. Since deeper neural MobileNet v1 architecture Thunder and Lightning.
networks are more difficult to train, residual learning and ResNet50 architecture. Since our project requires
framework makes it easier to train the model. This makes The MobileNetV1 high accuracy, we used
the network easier to optimize, while also gaining accuracy, architecture model is Thunder in this project.
because of the increased number of hidden layers. This smaller and faster but has For latency-critical
approach provides an accuracy of 70% on Yoga Dataset. lower accuracy. The applications, Lightning is
ResNet50 variant is larger considered to be a better
PoseNet MoveNet and slower but it’s more option. Using this model,
PoseNet is profound Using MoveNet, the model accurate. Both the model is capable of
learning structure that detects 17 landmarks on MobileNetV1 and achieving an accuracy
detects human postures by the image of the human ResNet50 variants support between 0.87 and 0.90.
distinguishing joint areas body. After detecting the single pose and multi-
in a human body. But, landmarks, the model person pose estimation.
even with this structure, estimates the pose using The model returns the
accuracy kept fluctuating the distance of each coordinates of the 17 key
between 0.5 and 0.9, landmark from the centre points along with a
depending on the pose point of the image. An confidence score.
average of each individual
landmark score is taken, Table 1– PoseNet and MoveNet
and considered to be the
score of that image 7. METHODOLOGY/TECHNIQUES

PoseNet is an older MoveNet is the latest MoveNet architecture consists of two components: a feature
generation pose estimation generation pose estimation extractor and a set of prediction heads. The prediction
model released in 2017. It model released in scheme loosely follows CenterNet, with notable changes
is trained on a standard 2021.which is an ultra-fast that improve both speed and accuracy. All models are
COCO dataset and and accurate model that trained using the TensorFlow Object Detection API.
provides a single pose and detects 17 key-points of a
multiple pose estimation body. The model is offered The feature extractor in MoveNet is MobileNetV2 with an
variants. on TF Hub with two attached feature pyramid network (FPN), which allows for a
variants, known as high resolution (output stride 4), semantically rich feature
Lightning and Thunder. map output. There are four prediction heads attached to the
Lightning is intended for feature extractor, responsible for densely predicting a:
latency-critical
applications, while Person center heat-map: predicts the geometric center of
Thunder is intended for person
applications that require
high accuracy. Landmark regression field: predicts full set of key-points
for a person, used for grouping key-points into instances
The single pose variant can Both variants run faster Person regression field: predicts the location of all key-
detect only one person in than real time (30+ FPS) points, independent of person instances
an image/video and the on most modern desktops,
multi pose variant can laptops, and phones, which
detect multiple persons in proves crucial for live

– 315 –

354 355 356 357 358 359 360 361 362 363 364