Page 128 - ITU KALEIDOSCOPE, ATLANTA 2019
P. 128

2019 ITU Kaleidoscope Academic Conference




           have been used to extract the person’s image from the frame.   3.5   Motion Similarity
           The previously stored person’s images are then compared
           with the extracted image. The feature extraction layers in   Motion prediction has been used in the proposed system in
           this CNN network are made the same for both images, thus   order to associate objects based on their recent movements.
           making it vertically symmetrical. The resultant features of   This has been implemented using LSTM on the basis of the
           both  images  are  merged  by  finding  the  element-wise   previous 12 center coordinates of stored persons. This vector
           squared difference. It is then fed into fully-connected layers   is fed as input to the Motion LSTM model which performs
           for  dimensionality  reduction  and  finally  to  obtain  the   temporal  processing  and  predicts  the  new  center  for  each
           similarity score.                                  stored person. This center indicates the next possible position

                                                                 Algorithm 2: Motion similarity
                              Symmetrical                        -------------------------------------
                              Layers
                                                                 Input:  Pprev – Previously detected person
               Convolutional, Max        Convolutional, Max                   Pcurr – Currently detected person
                 Pooling Layers           Pooling Layers
                                                                 Output: S – motion similarity matrix
                                                                 Model motion= Trained motion LSTM model
                            Feature Concatenation                M = count (Pcurr)
                                                                 N = count (Pprev)
                               Dense Layers                      S = Null matrix of dimensions M*N
                                                                 for i = {0,1, …., M-1}
                                                                       center curr = P curr [i].center
                              Similarity Score                           for j = {0, 1, ……, N-1}

                                                                        center_seq prev = Pprev [j].center [{1, 2…,12}]
                                                                             center pred = Model motion ()

                                                                                  dist = Euclid_Dist (center curr, center pred)
                Figure 2 - Image similarity using Siamese CNN                 score = 1/dist
                                                                              if score > threshold:
           A custom threshold is employed, where scores greater than              S[i][j] = score
           a  threshold  value  are  considered  similar  and  vice  versa             end if
           (Algorithm 1).                                              end for
                                                                 end for
              Algorithm 1: Image similarity                      return S
              -------------------------------------

              Input:  Pprev – Previously detected persons     of the person. The predicted centers are compared with the
                                                              centers  of  the  currently  detected  persons  via  Euclidean
                           Pcurr – Currently detected persons   distance and inverse of this value is considered as the overall
              Output: S – appearance similarity matrix        motion score (Algorithm 2).
              Model app = Trained Siamese CNN model
              M = count (Pcurr)                               3.6    Object Mapping
              N = count (Pprev)
              S = Null matrix of dimensions M*N               The mapping of previous to current persons is achieved by
              for i = {0,1, …., M-1}                          finding the best candidate for each currently detected person
                   img curr = P curr [i].img                  from the appearance and motion similarity matrices. A map
                   for j = {0, 1, ……, N-1}                    data  structure  helps  in  store  the  detected  persons  in  each
                          img prev = Pprev [j].img            video  frame  efficiently.  The  map  stores  information  in
                          input = (img curr, img prev)        distinct key-value pairs. A unique ID is assigned for every
                          score = Model app (input)           person  appearing  at  any  part  of  the  video.  Each  person
                          if score > threshold:               detected in the frame is stored in the map data structure with
                                S[i][j] = score               an ID as the key and the bounding box coordinates, frame
                          end if                              number in which he/she was detected and the previous center
                   end for                                    list  as  value.  During  the  retrieval  of  the  person  for  target
              end for                                         association, only the candidates which have a frame number
                                                              less than the current frame number are considered, in order
              return S
                                                              to  avoid  mismatching.  The  best  candidate  for  the  current
                                                              person is selected by choosing the person with the highest
                                                              appearance similarity and with a motion similarity greater
                                                              than  a  specific  threshold.  The  best  candidate  values  are
                                                              updated to match the current target. If no such match is found
                                                              for the target, then the target is newly added into the map



                                                          – 108 –
   123   124   125   126   127   128   129   130   131   132   133