Page 126 - ITU KALEIDOSCOPE, ATLANTA 2019

P. 126

2019 ITU Kaleidoscope Academic Conference

tracking generic objects rather than specific objects, which 2. BACKGROUND
may lead to a reduction in the efficiency of the system.
2.1 HOG Feature Descriptor
Detection-based tracking algorithms first identify the target
object to be tracked and then find the object in each frame of HoG-based feature extraction [17] uses edge orientations for
the video. Unique object identification can be achieved with object detection. It operates on grayscale image and its
the help of salient features the object possesses. Many workflow is as follows: Initially gradient computation is
methods have exploited the object appearance as an carried out for each pixel in the image, by placing a mask on
important feature to represent it in a numerical way. The the image with a pixel as its center and performing element-
appearance features such as histogram of oriented gradients, wise multiplication. The orientations of these gradients are
scale invariant feature transform and local Binary Pattern further found out and a histogram of orientations is created
have significantly improved the overall accuracy of object for each block. This is then subjected to both local and global
detection and tracking. normalization to finally produce the required feature
descriptor.
Some of the difficulties of existing feature extraction
methods have been overcome by CNN in more complex 2.2 CNN Based Feature Extraction
object segmentation-cum-detection processes [12-13].
Further extension of the generalized object detection using CNN [12] utilizes kernels for automatic deep feature
CNN [14] allows specific object tracking to be performed extraction, classification or detection. The CNN model
with high precision. The usage of Long Short-Term Memory commonly consists of two important segments: automatic
(LSTM) helps in inferring deeper features from time-series feature extraction and dimensionality reduction. Feature
data, thus posing it as a potential technique to be coupled extraction is achieved with the help of convolutional layers.
with CNN for multiple object tracking. Siamese CNN helps A convolutional layer consists of different kernels which
in finding similarities in consecutive frames, due to its learn different features, through backpropagation. The
identical sub-network components [15]. stacking of different convolutional layers allows learning of
deeper features. Dimensionality reduction is achieved with
The ITU-T Focus Group on Artificial Intelligence for Health the help of pooling layers and dense layers. A combination
(FG-AI4H) has considered “Falls among the elderly” [16] as of all these layers allows the construction of a CNN model
one of the key areas that needs to be addressed for better that can be utilized to solve different domain problems.
healthcare. Although curvelet coefficient-based fall
detection techniques [7] have translation and scaling 2.3 Long Short-Term Memory Network
invariant properties, detection accuracy suffers in complex
background and moving objects. A machine learning-based A long short-term memory (LSTM) network [18] is a
approach [8] can handle complex scenarios of detection, but recurrent neural network that performs well for time-series
training a CNN-based generic network is not only inefficient, based analysis in extracting temporal features. The LSTM
but also difficult to achieve a higher accuracy of fall network is made of stacks of cells in order to represent the
detection in real-time environments. sequential data better. An LSTM cell consists of an input
gate, an output gate, and a forget gate. The input gate allows
To address the above limitations, we propose a system that new information to enter into the cell, while the forget gate
utilizes machine-learning techniques to improve its helps in remembering only the important information
performance accuracy. The major contribution of this work regarding the input data in achieving higher performance.
is twofold: a person tracker that considers both appearance The LSTM cell incorporates a sigmoid activation function to
and motion features for target association, and a fall detector restrict the information flow within it and tanh function in
that considers the sequence of person orientations. Our order to remember relevant features.
models are designed to leverage deep-learning techniques
while complying with the criteria set by the ITU FG-AI4H. 3. PROPOSED SYSTEM
Both the models have been developed as per
Recommendation ITU-T F.743.1 – “Requirements for The architecture of the proposed system, as shown in Figure
intelligent visual surveillance”. In our system, target 1, consists of three components: client, server and cloud
recognition and association are achieved with the service. The overall workflow in the proposed system is as
combination of CNN and LSTM to uniquely distinguish follows. A client, typically a hospital room or elderly home,
persons from other objects. The core of the system is HoG is configured to stream videos, to the receiver (medical
for feature extraction which is an LSTM based model for fall center/care takers) over HTTP using ITU-T H.264 encoding.
detection, a promising candidate for standardization in ITU. The frame processing and video analysis are carried out on
the server end where both MFPD and VPFD models are
The remainder of the paper is organized as follows. Section executed to track persons and detect occurrences of human
2 provides a background overview and section 3 describes falls. The detected person’s location and image is stored in
the proposed system. The implementation detail for the cloud along with the fall occurrence status. An alert is
performance evaluation and experimental results are generated at the concerned client end, either a hospital or a
presented in section 4, while section 5 concludes the paper.

– 106 –

121 122 123 124 125 126 127 128 129 130 131