Page 210 - Kaleidoscope Academic Conference Proceedings 2020

P. 210

2020 ITU Kaleidoscope Academic Conference

The methods based on dense trajectories [6] employing the and Section 3 describes the theory behind system
Gaussian mixture model (GMM) for codebook generation development. The implementation detail for performance
and Fisher vector encoding for action recognition have evaluation and experimental results are presented in Section
shown better performance. Although the motion trajectory 4, while Section 5 concludes the paper.
describing delicate motion represents both the dynamics and
appearance of an action in a scene, because of a low-level 2. PROPOSED SYSTEM
descriptor will not be enough for action recognition. This
happens due to the absence of action semantics at the global The outline of the proposed action recognition system is
level. shown in Figure 1. In this system model, different frames of
the video stream are used to generate the skeleton of a human
The proposed system utilizes deep machine-learning subject. The estimated skeleton is transformed using hip and
techniques to improve its performance accuracy in action theta transformation to remove the occlusion effect on the
recognition (e.g., wave, punch, kick, jump, etc.) over frames due to difference in viewpoints and camera angle.
existing approaches. The major contribution of this work is From the skeleton sequence the features of the joints are
twofold: a skeleton generator that generates the skeleton extracted, encoded using Fisher vector and reduced using
joint points for the human object, and an action detector that PCA. This optimal feature code sequence is used to train the
considers the sequence of a feature vector. Our models are classifier model which is further used to identify the action
designed to leverage deep-learning techniques while classes. Skeleton tracking is used to obtain the estimated
complying with the criteria set by Recommendation ITU-T skeleton of the human subject. This is performed by the pose
H.626.5. The system models have been developed to meet estimation method which provides a set of points which
the requirements listed in ITU-T H.626.5 – “Architecture for represents the joint coordinates of the human skeleton. It
intelligent visual surveillance systems” [7] and ITU-T F.743 consists of a two-dimensional (2D) pose estimation module
“Requirements and service description for video surveillance” and a depth regression module, which predicts the 2D joint
[8]. In our system, the target recognition and association are locations and the depth values, and it is implemented using
achieved with the combination of DNN and HGN to hourglass network architecture. The network output is a set
recognize the action performed by the human. of low-resolution heat maps. Each map represents a 2D
probability distribution of one joint.
The remainder of the paper is organized as follows. Section
2 provides the architectural details of the proposed system

Pose Estimation Preprocessing

Joint Hip
Prediction Transformation
Video Estimated
stream skeleton

Heat Map Theta
Transformation

Joint Vector
Add new
person label

Fisher Vector Body height

Action Classification Normalized JV
Class

Body Disp
Update the
action of the Dimensionality
person Reduction-PCA
Joint Disp

Feature Extraction

Figure 1 – The architecture of the proposed model

– 152 –

205 206 207 208 209 210 211 212 213 214 215