Page 209 - Kaleidoscope Academic Conference Proceedings 2020
P. 209

VISUAL ACTION RECOGNITION USING DEEP LEARNING IN VIDEO SURVEILLANCE
                                                       SYSTEMS




                                                          1
                                Dhananjay Kumar ; Priyanka T ; Aishwarya Murugesh , Ved P. Kafle
                                                                                        2
                                               1
                                                                             1
                         1 Department of Information Technology, Anna University, MIT Campus, Chennai, India
                           2 National Institute of Information and Communications Technology, Tokyo, Japan

                              ABSTRACT                        attention needs to be exploited further using a higher order
                                                              of learning in predicting the action in the video.
           The skeleton tracking  technique  allows the  usage of  the
           skeleton information of  human-like objects  for action   The  spatio-temporal  approach  considering  different
           recognition. The major challenge in action recognition in a   distribution of interest points can provide an efficient method
           video surveillance system is the large variability across and   for action classification [2]. However, the appearance of the
           within subjects. In this paper, we propose a deep-learning-  spatio-temporal  points  can  influence  the  performance.  A
           based  novel framework to recognize human actions using   hierarchical spatio-temporal model [3] for action recognition
           skeleton estimation. The main component of the framework   for a single person, as well as identification of activities with
           consists of  pose  estimation using a stacked hourglass   interactions need to consider spatial constraints along with
           network  (HGN).  The pose  estimation  module provides  the   temporal  constraints.  The  technique  requires  the
           skeleton joint points of humans. Since the position of skeleton   computation of both the spatial similarity and the temporal
           varies  according to  the point of view,  we apply   similarity of activities to be monitored together to provide
           transformations on the skeleton points to make it invariable   the superior classification result. The human activities of a
           to rotation and position.  The  skeleton joint positions  are   single  person  in  a  well-defined  scenario  can  be  classified
           identified  using  HGN-based  deep  neural  networks  (HGN-  with higher accuracy. However, a suitable learning algorithm
           DNN),  and  the  feature extraction  and classification is   to  train  all  parameters  efficiently  and  effectively,  can
           carried out to obtain the action class. The skeleton action   improve  the  classification  ability  by  jointly  estimating  a
           sequence is  encoded using  Fisher Vector before   spatio-temporal similarity of activities. Furthermore, it can
           classification.  The proposed system  complies with   offer  a  unified  framework  in  modeling  both  one-person
           Recommendation  ITU-T  H.626.5  “Architecture for   actions and multi-person activities.
           intelligent  visual  surveillance  systems”,  and  has  been
           evaluated over benchmarked human action recognition data   A  commonly  used  framework  for  human  detection  and
           sets.  The  evaluation  results  show  that the  system   action recognition in a video stream encompasses uniform
           performance achieves a precision of 85% and the accuracy   segmentation  and  combination  of  Euclidean  distance  and
           of 95.6% in recognizing actions like wave, punch, kick, etc.   joint  entropy  features  [4].  Feature  selection  by  Euclidean
           The HGN-DNN model meets the requirements and service   distance  and  the  joint  entropy-PCA  (principal  component
           description specified in Recommendation ITU-T F.743.   analysis)  based  method,  and  further  classification  using  a
                                                              multi-class support vector machine requires a higher level of
             Keywords – Action recognition, CNN, deep learning,   learning. The method needs to initially intensify the frames
              feature extraction, skeleton processing, video stream    to extract the moving objects and later classify the region
                                                              frames based on feature vectors.
                         1.  INTRODUCTION
                                                              The  salient  features  using  convolutional  neural  networks
           A video stream carries a large amount of media data with   (CNN) [5] for each frame can be extracted and then mapped
           multiple modality (e.g., frame, motion, audio) making action   onto  codes.  In  order  to  minimize  the  computational
           recognition  very  complex  and  challenging.  In  a  machine-  requirements, key frames are selected based on changes in
           learning approach, the spatio-temporal attention network for   code.  The  video  snippets  consisting  of  consecutive  key
           action recognition works on video segments represented by   frames  are  applied  to  a  hierarchical  decomposition.
           multiple modalities, where each modality could be modeled   Furthermore, the PCA is applied on every hierarchy level to
           as  a  single  stream  [1].  The  representations  of  each  video   reduce  the  dimensions.  Although  an  overlapping  window
           segment  on  different  modalities  are  concatenated  and   can be used in selecting video, in order to improve efficiency
           sequentially fed into a neural (e.g., long short-term memory)   key frame selection and its binary code is used provided the
           network  to  learn  the  temporal  attention.  The  temporal   snippet has sufficient information on motion representing an
                                                              action.





           978-92-61-31391-3/CFP2068P @ ITU 2020           – 151 –                                  Kaleidoscope
   204   205   206   207   208   209   210   211   212   213   214