Page 211 - Kaleidoscope Academic Conference Proceedings 2020
P. 211

Industry-driven digital transformation




           The predicted joints in the 2D pose are the peak locations on   Where  xhipcenter  and yhipcenter  represent the  hip  center  of  the
           these heat maps. This heat-map representation is convenient   input skeleton.
           as it can be concatenated with the other deep layer feature
           maps.  The  2D  joint  heat  maps  and  also  the  intermediate   2.2.2   Theta Transformation
           feature representations within the 2D module form an input
           to the depth regression module. These features, which extract   To make the poses rotation invariant, a rotation operation is
           semantic  information  at  multiple  levels  for  2D  pose   applied on the joints relative to the camera view angle θ. This
           estimation provide additional cues for pose recovery.   transformation makes sure that the projection of the vector
                                                              passing from left hip (xleft_hip , yleft_hip) to right hip (xright_hip ,
           2.1    Stacked Hourglass Network                   yright_hip) on ground plane to stay parallel with x-axis in the
                                                              real-world coordinates where the rotation angle is computed
           The idea behind stacking multiple hourglass (HG) modules   by
           instead of forming a giant encoder and decoder network is
           that each HG module will produce a full heat map for joint   −1       right_hip −      left_hip �         (2)
           prediction  [9].  In  general,  an  HG  module  is  an  encoder-  θ = tan  �       right_hip −      left_hip
           decoder  architecture,  where  the  features  are  first  down
           sampled, and then up sampled to recover the information and   After obtaining the deviation angle (θ) for each skeleton joint
           form  a  heat  map.  Each  encoder  layer  would  have  a   in the corresponding frame, the rotation around the y-axis in
           connection  to its  decoder  counterpart, and  we could  stack   a counterclockwise fashion is performed as.
           layers as needed.
                                                                   ′  cos      sin       1     
           The  hourglass  network  split  the  output  into  two  paths  as   �    ′� = �−sin       cos       1� �    �         (3)
           shown  in  Figure  2.  The  top  path  includes  some  more   1  0  0  1 1
           convolutions to further process the features and then go to
           the next  HG module.  Here the  output  of  that  convolution   Moreover, to make the skeletons scale invariant, a reference
           layer is used as an intermediate heat map result (red box) and   skeleton is chosen randomly from a training set and the limbs
           then the loss is calculated between this intermediate heat map   of the remaining skeletons are rescaled to the same size of
           and the ground-truth heat map.                     the  limbs  in  the  reference  skeleton,  while  preserving  the
                                                              original angles between the joints.

                                                                     3.     SYSTEM PARAMETERS AND
                                                                               ALGORITHMS
                                                     2D
             Input                                  Skeleton
                                                              3.1    Feature Extraction
                 Encoder-                    Encoder-
                 Decoder   Convolution layers   Decoder       In deep-learning-based feature extraction, we treat the pre-
                                                              trained network as an arbitrary feature extractor, allowing the
                                                              input frame to propagate forward, stopping at a pre-specified
                                                              layer, and taking the outputs of that layer as features.

                           Heat Map                           A sliding window of size N aggregates the skeleton data of
                                                              the  first  N frames. This  skeleton  data is preprocessed  and
                  Figure 2 - Stacked hourglass architecture   used for feature extraction, which is then fed into a classifier
                                                              to  obtain  the  final  recognition.  In  a  video  streaming
           2.2    Skeleton Processing                         recognition framework, the window is slid frame by frame
                                                              along the time dimension of the video, and outputs a label
           The skeletal data from the hourglass network is processed to   for each video frame. Here the window size N is set as 5
           obtain the following features. Two transformations: (i) hip   during  the  experiment,  which  equals  to  a  length  of  0.5
           transformation and (ii) theta transformation are applied on   seconds.
           the raw skeleton data.
                                                              The following features are extracted from the concatenated
                                                              frame  information.  Xs  is  a  direct  concatenation  of  joints
           2.2.1   Hip Transformation
                                                              positions of the frames. The dimension of Xs vector is 13
                                                              joints multiplied by 2 positions per joint for N frames giving
           To  make  the  skeletons  invariant  to  the  location  of  the   a  130  dimension  vector.  The  next  feature  is  the  average
           subjects, the origin of the coordinate system is transformed   height of the skeleton of the previous N frames. This height
           to the location of the hip center joint of the skeleton. For each   equals  the  length  from  Neck  to  Thigh.  It's  used  for
           joint j in every pose we apply this transformation as
                                                              normalizing all features.

               ′ ,     ′ =      −       ,      −              (1)
                              ℎ                                        ℎ                                




                                                          – 153 –
   206   207   208   209   210   211   212   213   214   215   216