Page 210 - Kaleidoscope Academic Conference Proceedings 2020
P. 210
2020 ITU Kaleidoscope Academic Conference
The methods based on dense trajectories [6] employing the and Section 3 describes the theory behind system
Gaussian mixture model (GMM) for codebook generation development. The implementation detail for performance
and Fisher vector encoding for action recognition have evaluation and experimental results are presented in Section
shown better performance. Although the motion trajectory 4, while Section 5 concludes the paper.
describing delicate motion represents both the dynamics and
appearance of an action in a scene, because of a low-level 2. PROPOSED SYSTEM
descriptor will not be enough for action recognition. This
happens due to the absence of action semantics at the global The outline of the proposed action recognition system is
level. shown in Figure 1. In this system model, different frames of
the video stream are used to generate the skeleton of a human
The proposed system utilizes deep machine-learning subject. The estimated skeleton is transformed using hip and
techniques to improve its performance accuracy in action theta transformation to remove the occlusion effect on the
recognition (e.g., wave, punch, kick, jump, etc.) over frames due to difference in viewpoints and camera angle.
existing approaches. The major contribution of this work is From the skeleton sequence the features of the joints are
twofold: a skeleton generator that generates the skeleton extracted, encoded using Fisher vector and reduced using
joint points for the human object, and an action detector that PCA. This optimal feature code sequence is used to train the
considers the sequence of a feature vector. Our models are classifier model which is further used to identify the action
designed to leverage deep-learning techniques while classes. Skeleton tracking is used to obtain the estimated
complying with the criteria set by Recommendation ITU-T skeleton of the human subject. This is performed by the pose
H.626.5. The system models have been developed to meet estimation method which provides a set of points which
the requirements listed in ITU-T H.626.5 – “Architecture for represents the joint coordinates of the human skeleton. It
intelligent visual surveillance systems” [7] and ITU-T F.743 consists of a two-dimensional (2D) pose estimation module
“Requirements and service description for video surveillance” and a depth regression module, which predicts the 2D joint
[8]. In our system, the target recognition and association are locations and the depth values, and it is implemented using
achieved with the combination of DNN and HGN to hourglass network architecture. The network output is a set
recognize the action performed by the human. of low-resolution heat maps. Each map represents a 2D
probability distribution of one joint.
The remainder of the paper is organized as follows. Section
2 provides the architectural details of the proposed system
Pose Estimation Preprocessing
Joint Hip
Prediction Transformation
Video Estimated
stream skeleton
Heat Map Theta
Transformation
Joint Vector
Add new
person label
Fisher Vector Body height
Action Classification Normalized JV
Class
Body Disp
Update the
action of the Dimensionality
person Reduction-PCA
Joint Disp
Feature Extraction
Figure 1 – The architecture of the proposed model
– 152 –