Page 211 - Kaleidoscope Academic Conference Proceedings 2020

P. 211

Industry-driven digital transformation

The predicted joints in the 2D pose are the peak locations on Where xhipcenter and yhipcenter represent the hip center of the
these heat maps. This heat-map representation is convenient input skeleton.
as it can be concatenated with the other deep layer feature
maps. The 2D joint heat maps and also the intermediate 2.2.2 Theta Transformation
feature representations within the 2D module form an input
to the depth regression module. These features, which extract To make the poses rotation invariant, a rotation operation is
semantic information at multiple levels for 2D pose applied on the joints relative to the camera view angle θ. This
estimation provide additional cues for pose recovery. transformation makes sure that the projection of the vector
passing from left hip (xleft_hip , yleft_hip) to right hip (xright_hip ,
2.1 Stacked Hourglass Network yright_hip) on ground plane to stay parallel with x-axis in the
real-world coordinates where the rotation angle is computed
The idea behind stacking multiple hourglass (HG) modules by
instead of forming a giant encoder and decoder network is
that each HG module will produce a full heat map for joint −1 right_hip − left_hip � (2)
prediction [9]. In general, an HG module is an encoder- θ = tan � right_hip − left_hip
decoder architecture, where the features are first down
sampled, and then up sampled to recover the information and After obtaining the deviation angle (θ) for each skeleton joint
form a heat map. Each encoder layer would have a in the corresponding frame, the rotation around the y-axis in
connection to its decoder counterpart, and we could stack a counterclockwise fashion is performed as.
layers as needed.
′ cos sin 1
The hourglass network split the output into two paths as � ′� = �−sin cos 1� � � (3)
shown in Figure 2. The top path includes some more 1 0 0 1 1
convolutions to further process the features and then go to
the next HG module. Here the output of that convolution Moreover, to make the skeletons scale invariant, a reference
layer is used as an intermediate heat map result (red box) and skeleton is chosen randomly from a training set and the limbs
then the loss is calculated between this intermediate heat map of the remaining skeletons are rescaled to the same size of
and the ground-truth heat map. the limbs in the reference skeleton, while preserving the
original angles between the joints.

3. SYSTEM PARAMETERS AND
ALGORITHMS
2D
Input Skeleton
3.1 Feature Extraction
Encoder- Encoder-
Decoder Convolution layers Decoder In deep-learning-based feature extraction, we treat the pre-
trained network as an arbitrary feature extractor, allowing the
input frame to propagate forward, stopping at a pre-specified
layer, and taking the outputs of that layer as features.

Heat Map A sliding window of size N aggregates the skeleton data of
the first N frames. This skeleton data is preprocessed and
Figure 2 - Stacked hourglass architecture used for feature extraction, which is then fed into a classifier
to obtain the final recognition. In a video streaming
2.2 Skeleton Processing recognition framework, the window is slid frame by frame
along the time dimension of the video, and outputs a label
The skeletal data from the hourglass network is processed to for each video frame. Here the window size N is set as 5
obtain the following features. Two transformations: (i) hip during the experiment, which equals to a length of 0.5
transformation and (ii) theta transformation are applied on seconds.
the raw skeleton data.
The following features are extracted from the concatenated
frame information. Xs is a direct concatenation of joints
2.2.1 Hip Transformation
positions of the frames. The dimension of Xs vector is 13
joints multiplied by 2 positions per joint for N frames giving
To make the skeletons invariant to the location of the a 130 dimension vector. The next feature is the average
subjects, the origin of the coordinate system is transformed height of the skeleton of the previous N frames. This height
to the location of the hip center joint of the skeleton. For each equals the length from Neck to Thigh. It's used for
joint j in every pose we apply this transformation as
normalizing all features.

′ , ′ = − , − (1)
ℎ ℎ

– 153 –

206 207 208 209 210 211 212 213 214 215 216