Page 130 - Kaleidoscope Academic Conference Proceedings 2021
P. 130

2021 ITU Kaleidoscope Academic Conference




           temporal annotations during training at the frame level and   3. In Section 4, the implementation detail for performance
           assume the number of future frames and predicts labels for   evaluation and experimental results are discussed; and this is
           future frames.                                     followed by the conclusion in Section 5.

           In traditional methods, extracted features are influenced by     2.  PROPOSED SYSTEM
           noisy data  and pattern-based human  activity recognition
           methods extract problem-specific features  only. A   The architecture of the proposed action recognition system
           combination of a Convolutional Neural Network (CNN) and   is shown in Figure 1. The skeleton of a human subject is
           LSTM solves a cumbersome tuning process but loses some   generated from the streaming video. A centroid method is
           information when the input sequence is long. Therefore,   used to differentiate the key points of each individual in the
           multiple  feature fusions based  on CNN and  LSTM  along   video frame. The skeleton sequence is used to extract the
           with an attention mechanism [6] are  used  to obtain  more   features of the joints and  body displacement, and it is
           information avoiding the influence of noise data. However,   optimized using  a  Linear Discriminant Analysis (LDA)
           this method involves a large calculation and implementing a   technique for  the dimensionality reduction. The classifier
           recognition system on the streaming data is more difficult to   model is trained  using an optimal feature code sequence
           achieve the desired results considering real-time constraints.   which identifies the action classes. If the videos pauses due
                                                              to the large  delay in communication network, an activity
           The proposed system utilizes deep learning-based techniques   forecasting module helps  to  predict  the future pose  and
           to improve  performance accuracy  for the recognition  of   motion.
           abnormal activities in an indoor environment, The major
           contribution of the work is skeleton activity forecasting for   The estimated skeleton of the human obtained using skeleton
           predicting the future pose and motion of the individual, and   tracking is carried out by the pose estimation method. The
           classifying activities as normal, abnormal or suspicious   joint coordinates of the human skeleton is provided by the
           activities on a streaming video. Our system model designed   pose estimation method as  a set of  points. This method
           to leverage deep-learning techniques has been developed to   consists of a  depth  regression module and a  2-D  pose
           meet the requirements specified in Recommendation ITU-T   estimation module. It predicts the depth values and 2-D joint
           H.627 - “Signaling and protocols for a video surveillance  locations. The  heat  map provides  the maximum probable
           system” [6].                                       point for each joint from all the predicted values. Each map
                                                              signifies a 2-D probability distribution of one joint. Using the
           The  rest of  the paper is organized  as follows. The   heat map all the joints of the human subject are estimated and
           architectural details of the proposed system is presented in   these joint values are used for further processing.
           Section 2, and algorithm development is described in Section


                                                              Spatio-temporal key
                                      Human Pose
                Video                 Estimation                    points                       Linear
                Stream                 (Skeleton                                              Discriminant
                                      Generation)              Body displacement                Analysis
                                                                                            (Feature Selection)


                                                               Feature Extraction







                                                                       Normal (walk, sit, stand)

                 Pose Prediction

                                               Multi-class             Abnormal (fight, chase,

                Motion Prediction             classification               crowd running)


                                                                        Suspicious (loitering,
               Activity Forecasting                                         hiding face)



                                                Figure 1 – System architecture




                                                           – 68 –
   125   126   127   128   129   130   131   132   133   134   135