Page 129 - Kaleidoscope Academic Conference Proceedings 2021
P. 129

ABNORMAL ACTIVITY RECOGNITION USING DEEP LEARNING IN STREAMING
                                        VIDEO FOR INDOOR APPLICATION




                                       Dhananjay Kumar and Srinivasan Ramapriya Sailaja

                         Department of Information Technology, Anna University, MIT Campus, Chennai, India



                              ABSTRACT                        of  continual surveillance of human  behavior [2]. The
                                                              complexity of  problem remains for  many reasons  like
           Human activity recognition has emerged as a challenging   distance from the camera and changes in viewpoint, the
           research domain  for  video analysis. The major  issue  for   complexity of the background, and sometimes discontinuity
           abnormal activity recognition in a streaming video is the   in the streaming video feed despite the important growth in
           presence of the large spatio-temporal data along with the   sensing and  capturing capability in  visual surveillance
           constraints of communication networks affecting the quality   systems.
           of received data for analysis. In this paper, we propose a
           deep learning-based system  to identify abnormal human   The state-of-the-art system in human  recognition lacks
           activities using a combination of  Skeleton Activity   sufficient intelligence to handle a large number of activities
           Forecasting (SAF) and a Bi-LSTM network. The generated   resulting from the motions of human subjects that are hard to
           skeleton joint points of a human subject are used for the pose   capture and represent in terms of frames. When annotated
           estimation. The skeleton tracking and regions of interest   data is sparse and hard to obtain hand-crafted features, the
           points are  estimated on a streaming  video  from an IP   deep-learning models  can be  adopted [3]. The patterns of
           networked camera. The extracted interest points and their   dynamics of local motions are required to be learned, and for
           corresponding features are optimized and used to classify   local atomic action patterns, dense trajectories help to extract
           them  as normal,  abnormal or suspicious actions. The   spatio-temporal patterns. However,  for  high level actions,
           proposed system complies  with Recommendation  ITU-T   Long  Short-Term Memory (LSTM) neural  networks  are
           H.627 “Signalling  and  protocols for  a video surveillance  desirable.
           system”  and has been  experimented and evaluated  over
           benchmarked data sets for the recognition of human actions.  Although the  application of deep-learning techniques in
           The system performance attains a precision of 85.6% and an  visual action  recognition  helps to enhance the required
           accuracy of 97.2% in recognizing different actions.  machine intelligence, the deployment of algorithms limits its
                                                              usage in real-time applications. The human joint key points-
            Keywords – Action recognition, activity forecasting, deep   based system approaches are efficient as it deals with the
                   learning, human skeleton, video stream     selected temporal-spatial features. If some joint points are
                                                              occluded, it requires prediction to sustain the detection and
                          1.  INTRODUCTION                    recognition process. The joints could be predicted using key
                                                              points to  obtain a heat  map and are connected  using a
           According to a  global  market research  report by   bipartite graph [4]. However, use of temporal information
           MarketsandMarkets [1], the  worldwide  video surveillance   improves  the results of  pose  estimation for multi-person
           market is projected to grow from $45.5 billion in 2020 to   video stream.
           $74.6 billion  by 2025. The growing  concern  about
           home/office safety and security, and a rise in affordability of   In streaming  video, a  prediction mechanism in the  data
           IP-based camera systems are the main reasons behind the   analysis process needs to be incorporated to cope with the
           explosive growth in  video surveillance systems.  The   absence of input data in time sequence due to the prevailing
           automation of human action recognition in video streaming   constraints of the communication networks. Predictions are
           systems  will lead to  a new level of  user experience in   reliable if the model observes more data and unreliable if it
           creating a peripheral, as well as indoor, security system, as   predicts far into the  future. A  weakly supervised model
           an IP-based distributed networked  system solution  allows   generates pseudo-representations for future frames and are
           anytime, anywhere access of services.              forecasted to future symbolic action  sequences using
                                                              attention mechanisms without any assumption about length
           Human activity recognition  is a challenging time series   of the sequence. When predicting future actions it emits an
           classification task that aims  to  detect simple or complex   end-of-sequence token and  relies on  decoders to generate
           activities in the real world. It is developed in the framework   future action  labels [5]. Some methods  require  precise





           978-92-61-33881-7/CFP2168P @ ITU 2021           – 67 –                                   Kaleidoscope
   124   125   126   127   128   129   130   131   132   133   134