Page 129 - Kaleidoscope Academic Conference Proceedings 2021

P. 129

ABNORMAL ACTIVITY RECOGNITION USING DEEP LEARNING IN STREAMING
VIDEO FOR INDOOR APPLICATION

Dhananjay Kumar and Srinivasan Ramapriya Sailaja

Department of Information Technology, Anna University, MIT Campus, Chennai, India

ABSTRACT of continual surveillance of human behavior [2]. The
complexity of problem remains for many reasons like
Human activity recognition has emerged as a challenging distance from the camera and changes in viewpoint, the
research domain for video analysis. The major issue for complexity of the background, and sometimes discontinuity
abnormal activity recognition in a streaming video is the in the streaming video feed despite the important growth in
presence of the large spatio-temporal data along with the sensing and capturing capability in visual surveillance
constraints of communication networks affecting the quality systems.
of received data for analysis. In this paper, we propose a
deep learning-based system to identify abnormal human The state-of-the-art system in human recognition lacks
activities using a combination of Skeleton Activity sufficient intelligence to handle a large number of activities
Forecasting (SAF) and a Bi-LSTM network. The generated resulting from the motions of human subjects that are hard to
skeleton joint points of a human subject are used for the pose capture and represent in terms of frames. When annotated
estimation. The skeleton tracking and regions of interest data is sparse and hard to obtain hand-crafted features, the
points are estimated on a streaming video from an IP deep-learning models can be adopted [3]. The patterns of
networked camera. The extracted interest points and their dynamics of local motions are required to be learned, and for
corresponding features are optimized and used to classify local atomic action patterns, dense trajectories help to extract
them as normal, abnormal or suspicious actions. The spatio-temporal patterns. However, for high level actions,
proposed system complies with Recommendation ITU-T Long Short-Term Memory (LSTM) neural networks are
H.627 “Signalling and protocols for a video surveillance desirable.
system” and has been experimented and evaluated over
benchmarked data sets for the recognition of human actions. Although the application of deep-learning techniques in
The system performance attains a precision of 85.6% and an visual action recognition helps to enhance the required
accuracy of 97.2% in recognizing different actions. machine intelligence, the deployment of algorithms limits its
usage in real-time applications. The human joint key points-
Keywords – Action recognition, activity forecasting, deep based system approaches are efficient as it deals with the
learning, human skeleton, video stream selected temporal-spatial features. If some joint points are
occluded, it requires prediction to sustain the detection and
1. INTRODUCTION recognition process. The joints could be predicted using key
points to obtain a heat map and are connected using a
According to a global market research report by bipartite graph [4]. However, use of temporal information
MarketsandMarkets [1], the worldwide video surveillance improves the results of pose estimation for multi-person
market is projected to grow from $45.5 billion in 2020 to video stream.
$74.6 billion by 2025. The growing concern about
home/office safety and security, and a rise in affordability of In streaming video, a prediction mechanism in the data
IP-based camera systems are the main reasons behind the analysis process needs to be incorporated to cope with the
explosive growth in video surveillance systems. The absence of input data in time sequence due to the prevailing
automation of human action recognition in video streaming constraints of the communication networks. Predictions are
systems will lead to a new level of user experience in reliable if the model observes more data and unreliable if it
creating a peripheral, as well as indoor, security system, as predicts far into the future. A weakly supervised model
an IP-based distributed networked system solution allows generates pseudo-representations for future frames and are
anytime, anywhere access of services. forecasted to future symbolic action sequences using
attention mechanisms without any assumption about length
Human activity recognition is a challenging time series of the sequence. When predicting future actions it emits an
classification task that aims to detect simple or complex end-of-sequence token and relies on decoders to generate
activities in the real world. It is developed in the framework future action labels [5]. Some methods require precise

978-92-61-33881-7/CFP2168P @ ITU 2021 – 67 – Kaleidoscope

124 125 126 127 128 129 130 131 132 133 134