Page 126 - ITU KALEIDOSCOPE, ATLANTA 2019
P. 126

2019 ITU Kaleidoscope Academic Conference




           tracking generic objects rather than specific objects, which      2.  BACKGROUND
           may lead to a reduction in the efficiency of the system.
                                                              2.1    HOG Feature Descriptor
           Detection-based tracking algorithms first identify the target
           object to be tracked and then find the object in each frame of   HoG-based feature extraction [17] uses edge orientations for
           the video. Unique object identification can be achieved with   object  detection.  It  operates  on  grayscale  image  and  its
           the  help  of  salient  features  the  object  possesses.  Many   workflow  is  as  follows:  Initially  gradient  computation  is
           methods  have  exploited  the  object  appearance  as  an   carried out for each pixel in the image, by placing a mask on
           important  feature  to  represent  it  in  a  numerical  way.  The   the image with a pixel as its center and performing element-
           appearance features such as histogram of oriented gradients,   wise multiplication. The orientations of these gradients are
           scale  invariant  feature  transform  and  local  Binary  Pattern   further found out and a histogram of orientations is created
           have significantly improved the overall accuracy of object   for each block. This is then subjected to both local and global
           detection and tracking.                            normalization  to  finally  produce  the  required  feature
                                                              descriptor.
           Some  of  the  difficulties  of  existing  feature  extraction
           methods  have  been  overcome  by  CNN  in  more  complex   2.2   CNN Based Feature Extraction
           object  segmentation-cum-detection  processes  [12-13].
           Further extension of the generalized object detection using   CNN  [12]  utilizes  kernels  for  automatic  deep  feature
           CNN [14] allows specific object tracking to be performed   extraction,  classification  or  detection.  The  CNN  model
           with high precision. The usage of Long Short-Term Memory   commonly  consists  of  two  important  segments:  automatic
           (LSTM) helps in inferring deeper features from time-series   feature  extraction  and  dimensionality  reduction.  Feature
           data, thus posing it as a potential technique to be coupled   extraction is achieved with the help of convolutional layers.
           with CNN for multiple object tracking. Siamese CNN helps   A  convolutional  layer  consists  of  different  kernels  which
           in  finding  similarities  in  consecutive  frames,  due  to  its   learn  different  features,  through  backpropagation.  The
           identical sub-network components [15].             stacking of different convolutional layers allows learning of
                                                              deeper features. Dimensionality reduction is achieved with
           The ITU-T Focus Group on Artificial Intelligence for Health   the help of pooling layers and dense layers. A combination
           (FG-AI4H) has considered “Falls among the elderly” [16] as   of all these layers allows the construction of a CNN model
           one of the key areas that needs to be addressed for better   that can be utilized to solve different domain problems.
           healthcare.  Although  curvelet  coefficient-based  fall
           detection  techniques  [7]  have  translation  and  scaling   2.3   Long Short-Term Memory Network
           invariant properties, detection accuracy suffers in complex
           background and moving objects. A machine learning-based   A  long  short-term  memory  (LSTM)  network  [18]  is  a
           approach [8] can handle complex scenarios of detection, but   recurrent neural network that performs well for time-series
           training a CNN-based generic network is not only inefficient,   based analysis in extracting temporal features. The LSTM
           but  also  difficult  to  achieve  a  higher  accuracy  of  fall   network is made of stacks of cells in order to represent the
           detection in real-time environments.               sequential data better. An LSTM cell consists of an input
                                                              gate, an output gate, and a forget gate. The input gate allows
           To address the above limitations, we propose a system that   new information to enter into the cell, while the forget gate
           utilizes  machine-learning  techniques  to  improve  its   helps  in  remembering  only  the  important  information
           performance accuracy. The major contribution of this work   regarding the input data  in achieving higher performance.
           is twofold: a person tracker that considers both appearance   The LSTM cell incorporates a sigmoid activation function to
           and motion features for target association, and a fall detector   restrict the information flow within it and tanh function in
           that  considers  the  sequence  of  person  orientations.  Our   order to remember relevant features.
           models  are  designed  to  leverage  deep-learning  techniques
           while complying with the criteria set by the ITU FG-AI4H.      3.  PROPOSED SYSTEM
           Both  the  models  have  been  developed  as  per
           Recommendation  ITU-T  F.743.1  –  “Requirements  for   The architecture of the proposed system, as shown in Figure
           intelligent  visual  surveillance”.  In  our  system,  target   1,  consists  of  three  components:  client,  server  and  cloud
           recognition  and  association  are  achieved  with  the   service. The overall workflow in the proposed system is as
           combination  of  CNN  and  LSTM  to  uniquely  distinguish   follows. A client, typically a hospital room or elderly home,
           persons from other objects. The core of the system is HoG   is  configured  to  stream  videos,  to  the  receiver  (medical
           for feature extraction which is an LSTM based model for fall   center/care takers) over HTTP using ITU-T H.264 encoding.
           detection, a promising candidate for standardization in ITU.   The frame processing and video analysis are carried out on
                                                              the server end where  both  MFPD  and  VPFD  models  are
           The remainder of the paper is organized as follows. Section   executed to track persons and detect occurrences of human
           2 provides a background overview and section 3 describes   falls. The detected person’s location and image is stored in
           the  proposed  system.  The  implementation  detail  for   the cloud along with the fall occurrence status. An alert is
           performance  evaluation  and  experimental  results  are   generated at the concerned client end, either a hospital or a
           presented in section 4, while section 5 concludes the paper.




                                                          – 106 –
   121   122   123   124   125   126   127   128   129   130   131