Page 209 - Kaleidoscope Academic Conference Proceedings 2020

P. 209

VISUAL ACTION RECOGNITION USING DEEP LEARNING IN VIDEO SURVEILLANCE
SYSTEMS

1
Dhananjay Kumar ; Priyanka T ; Aishwarya Murugesh , Ved P. Kafle
2
1
1
1 Department of Information Technology, Anna University, MIT Campus, Chennai, India
2 National Institute of Information and Communications Technology, Tokyo, Japan

ABSTRACT attention needs to be exploited further using a higher order
of learning in predicting the action in the video.
The skeleton tracking technique allows the usage of the
skeleton information of human-like objects for action The spatio-temporal approach considering different
recognition. The major challenge in action recognition in a distribution of interest points can provide an efficient method
video surveillance system is the large variability across and for action classification [2]. However, the appearance of the
within subjects. In this paper, we propose a deep-learning- spatio-temporal points can influence the performance. A
based novel framework to recognize human actions using hierarchical spatio-temporal model [3] for action recognition
skeleton estimation. The main component of the framework for a single person, as well as identification of activities with
consists of pose estimation using a stacked hourglass interactions need to consider spatial constraints along with
network (HGN). The pose estimation module provides the temporal constraints. The technique requires the
skeleton joint points of humans. Since the position of skeleton computation of both the spatial similarity and the temporal
varies according to the point of view, we apply similarity of activities to be monitored together to provide
transformations on the skeleton points to make it invariable the superior classification result. The human activities of a
to rotation and position. The skeleton joint positions are single person in a well-defined scenario can be classified
identified using HGN-based deep neural networks (HGN- with higher accuracy. However, a suitable learning algorithm
DNN), and the feature extraction and classification is to train all parameters efficiently and effectively, can
carried out to obtain the action class. The skeleton action improve the classification ability by jointly estimating a
sequence is encoded using Fisher Vector before spatio-temporal similarity of activities. Furthermore, it can
classification. The proposed system complies with offer a unified framework in modeling both one-person
Recommendation ITU-T H.626.5 “Architecture for actions and multi-person activities.
intelligent visual surveillance systems”, and has been
evaluated over benchmarked human action recognition data A commonly used framework for human detection and
sets. The evaluation results show that the system action recognition in a video stream encompasses uniform
performance achieves a precision of 85% and the accuracy segmentation and combination of Euclidean distance and
of 95.6% in recognizing actions like wave, punch, kick, etc. joint entropy features [4]. Feature selection by Euclidean
The HGN-DNN model meets the requirements and service distance and the joint entropy-PCA (principal component
description specified in Recommendation ITU-T F.743. analysis) based method, and further classification using a
multi-class support vector machine requires a higher level of
Keywords – Action recognition, CNN, deep learning, learning. The method needs to initially intensify the frames
feature extraction, skeleton processing, video stream to extract the moving objects and later classify the region
frames based on feature vectors.
1. INTRODUCTION
The salient features using convolutional neural networks
A video stream carries a large amount of media data with (CNN) [5] for each frame can be extracted and then mapped
multiple modality (e.g., frame, motion, audio) making action onto codes. In order to minimize the computational
recognition very complex and challenging. In a machine- requirements, key frames are selected based on changes in
learning approach, the spatio-temporal attention network for code. The video snippets consisting of consecutive key
action recognition works on video segments represented by frames are applied to a hierarchical decomposition.
multiple modalities, where each modality could be modeled Furthermore, the PCA is applied on every hierarchy level to
as a single stream [1]. The representations of each video reduce the dimensions. Although an overlapping window
segment on different modalities are concatenated and can be used in selecting video, in order to improve efficiency
sequentially fed into a neural (e.g., long short-term memory) key frame selection and its binary code is used provided the
network to learn the temporal attention. The temporal snippet has sufficient information on motion representing an
action.

978-92-61-31391-3/CFP2068P @ ITU 2020 – 151 – Kaleidoscope

204 205 206 207 208 209 210 211 212 213 214