Page 130 - Kaleidoscope Academic Conference Proceedings 2021

P. 130

2021 ITU Kaleidoscope Academic Conference

temporal annotations during training at the frame level and 3. In Section 4, the implementation detail for performance
assume the number of future frames and predicts labels for evaluation and experimental results are discussed; and this is
future frames. followed by the conclusion in Section 5.

In traditional methods, extracted features are influenced by 2. PROPOSED SYSTEM
noisy data and pattern-based human activity recognition
methods extract problem-specific features only. A The architecture of the proposed action recognition system
combination of a Convolutional Neural Network (CNN) and is shown in Figure 1. The skeleton of a human subject is
LSTM solves a cumbersome tuning process but loses some generated from the streaming video. A centroid method is
information when the input sequence is long. Therefore, used to differentiate the key points of each individual in the
multiple feature fusions based on CNN and LSTM along video frame. The skeleton sequence is used to extract the
with an attention mechanism [6] are used to obtain more features of the joints and body displacement, and it is
information avoiding the influence of noise data. However, optimized using a Linear Discriminant Analysis (LDA)
this method involves a large calculation and implementing a technique for the dimensionality reduction. The classifier
recognition system on the streaming data is more difficult to model is trained using an optimal feature code sequence
achieve the desired results considering real-time constraints. which identifies the action classes. If the videos pauses due
to the large delay in communication network, an activity
The proposed system utilizes deep learning-based techniques forecasting module helps to predict the future pose and
to improve performance accuracy for the recognition of motion.
abnormal activities in an indoor environment, The major
contribution of the work is skeleton activity forecasting for The estimated skeleton of the human obtained using skeleton
predicting the future pose and motion of the individual, and tracking is carried out by the pose estimation method. The
classifying activities as normal, abnormal or suspicious joint coordinates of the human skeleton is provided by the
activities on a streaming video. Our system model designed pose estimation method as a set of points. This method
to leverage deep-learning techniques has been developed to consists of a depth regression module and a 2-D pose
meet the requirements specified in Recommendation ITU-T estimation module. It predicts the depth values and 2-D joint
H.627 - “Signaling and protocols for a video surveillance locations. The heat map provides the maximum probable
system” [6]. point for each joint from all the predicted values. Each map
signifies a 2-D probability distribution of one joint. Using the
The rest of the paper is organized as follows. The heat map all the joints of the human subject are estimated and
architectural details of the proposed system is presented in these joint values are used for further processing.
Section 2, and algorithm development is described in Section

Spatio-temporal key
Human Pose
Video Estimation points Linear
Stream (Skeleton Discriminant
Generation) Body displacement Analysis
(Feature Selection)

Feature Extraction

Normal (walk, sit, stand)

Pose Prediction

Multi-class Abnormal (fight, chase,

Motion Prediction classification crowd running)

Suspicious (loitering,
Activity Forecasting hiding face)

Figure 1 – System architecture

– 68 –

125 126 127 128 129 130 131 132 133 134 135