Page 128 - ITU KALEIDOSCOPE, ATLANTA 2019
P. 128
2019 ITU Kaleidoscope Academic Conference
have been used to extract the person’s image from the frame. 3.5 Motion Similarity
The previously stored person’s images are then compared
with the extracted image. The feature extraction layers in Motion prediction has been used in the proposed system in
this CNN network are made the same for both images, thus order to associate objects based on their recent movements.
making it vertically symmetrical. The resultant features of This has been implemented using LSTM on the basis of the
both images are merged by finding the element-wise previous 12 center coordinates of stored persons. This vector
squared difference. It is then fed into fully-connected layers is fed as input to the Motion LSTM model which performs
for dimensionality reduction and finally to obtain the temporal processing and predicts the new center for each
similarity score. stored person. This center indicates the next possible position
Algorithm 2: Motion similarity
Symmetrical -------------------------------------
Layers
Input: Pprev – Previously detected person
Convolutional, Max Convolutional, Max Pcurr – Currently detected person
Pooling Layers Pooling Layers
Output: S – motion similarity matrix
Model motion= Trained motion LSTM model
Feature Concatenation M = count (Pcurr)
N = count (Pprev)
Dense Layers S = Null matrix of dimensions M*N
for i = {0,1, …., M-1}
center curr = P curr [i].center
Similarity Score for j = {0, 1, ……, N-1}
center_seq prev = Pprev [j].center [{1, 2…,12}]
center pred = Model motion ()
dist = Euclid_Dist (center curr, center pred)
Figure 2 - Image similarity using Siamese CNN score = 1/dist
if score > threshold:
A custom threshold is employed, where scores greater than S[i][j] = score
a threshold value are considered similar and vice versa end if
(Algorithm 1). end for
end for
Algorithm 1: Image similarity return S
-------------------------------------
Input: Pprev – Previously detected persons of the person. The predicted centers are compared with the
centers of the currently detected persons via Euclidean
Pcurr – Currently detected persons distance and inverse of this value is considered as the overall
Output: S – appearance similarity matrix motion score (Algorithm 2).
Model app = Trained Siamese CNN model
M = count (Pcurr) 3.6 Object Mapping
N = count (Pprev)
S = Null matrix of dimensions M*N The mapping of previous to current persons is achieved by
for i = {0,1, …., M-1} finding the best candidate for each currently detected person
img curr = P curr [i].img from the appearance and motion similarity matrices. A map
for j = {0, 1, ……, N-1} data structure helps in store the detected persons in each
img prev = Pprev [j].img video frame efficiently. The map stores information in
input = (img curr, img prev) distinct key-value pairs. A unique ID is assigned for every
score = Model app (input) person appearing at any part of the video. Each person
if score > threshold: detected in the frame is stored in the map data structure with
S[i][j] = score an ID as the key and the bounding box coordinates, frame
end if number in which he/she was detected and the previous center
end for list as value. During the retrieval of the person for target
end for association, only the candidates which have a frame number
less than the current frame number are considered, in order
return S
to avoid mismatching. The best candidate for the current
person is selected by choosing the person with the highest
appearance similarity and with a motion similarity greater
than a specific threshold. The best candidate values are
updated to match the current target. If no such match is found
for the target, then the target is newly added into the map
– 108 –