Page 128 - Kaleidoscope Academic Conference Proceedings 2020
P. 128

2020 ITU Kaleidoscope Academic Conference




           being stemmed, the stop words removed and formatted for   •  Cosine  similarity  is  a  straightforward  technique  of
           classification.                                    applying the cosine similarity algorithm on vectors found in
                                                              both  messages.  The  results  obtained  were  not  very
                   5.  EXPERIMENTS AND RESULTS                encouraging,  as  very  rarely  would  a  person  use  the  same
                                                              words in multiple messages in a chat. Hence, cosine similarity
           Several  experiments  were  conducted  to  verify  the   is ineffective for computing the similarity between two chat
           performance of the proposed technique. This section presents   messages.
           the details of experiments and a discussion on the results in
           the following paragraphs.                          •  Cosine  similarity  using  disambiguated  synsets –  unlike
                                                              the above cosine similarity test, where both full messages are
           Experiment:  The  performance  of  the  semantic  similarity   considered,  in  this  case  only  disambiguated  synsets  were
           technique  proposed  in  this  work  vs.  other  similarity   considered. The results obtained were even worse than cosine
           techniques  such  as  cosine  similarity  for  finding  the  best   similarity applied on complete messages.
           similarity score for grouping messages.
                                                              •  Phrase  similarity from  Spacy  –  This  is  also  a  standard
           In  this  experiment,  the  cosine  similarity  was  used  as  the   similarity  computing  technique  for  words,  phrases  and
           benchmark for evaluating the performance of the proposed   sentences.  This  particular  technique  works  with  some
           similarity  algorithm  for  chat  messages.  From  the  results   accuracy  when it comes to finding  similarity  between two
           obtained, it was evident that, the proposed algorithm performs   messages.  But,  its  performance  is  poorer  compared  to  the
           much better than the cosine similarity. It was also found that   other algorithm used for comparing messages [13].
           the cosine similarity performs very poorly when the same set
           of words is not available in both messages being compared.   5.1   Accuracy of the classifier
           This  is  a  major  drawback  as  the  same  words  are  rarely
           repeated  in  online  chatting  due  to  the  short  nature  of   The  model  selected  for classification in this work is  SVM
           messages.  Hence,  it  is  not  advisable  to  employ  cosine   classifier. SVM classifier is selected in this work due to its
           similarity  for  extracting  the  similarity  between  two  chat   advantages in classifying text compared to other techniques.
           messages [12]. The similarity results of the tested algorithms   In this work, two classes of the data are to be classified. The
           are shown in Figure 4.                             accuracy of the module proposed in this work against other
                                                              similar techniques is given in Figure 5 and Figure 6.
              .











              Figure 4 – Similarity scores for different similarity
                               algorithms

           Figure 4 shows the similarity scores generated by different
           algorithms  when  similar  messages  were  processed.  The
           proposed algorithm was compared with cosine similarity and   Figure 5 – Comparison of accuracy with
           Spacy for evaluation. From the results shown in Figure 4, it        similar techniques
           is clear that cosine similarity suffers from a big drawback in
           processing chat messages. On the other hand, the algorithm   Figure 5 shows the accuracy of the proposed module against
           proposed  in  this  work  performs  much  better  within  the   that of other similar techniques. The best accuracy of 92% has
           domain  of  chat messages. The  detailed explanation on the   been achieved by the technique proposed by Dong et al. in
           four major tried and tested  similarity algorithms  are  given   [3].Dong et al have used more than 33,000 sets of data to train
           below:                                             their model and the proposed technique was also able to equal
                                                              that  accuracy  with  a  much  smaller  set  of  data.  The  other
           •  The  algorithm  proposed  in  this  work  takes  the   observation  made  during  this  experiment  was  that  the
           disambiguated  synset  of  each  semantic component  of  each   accuracy of the proposed technique gradually increases with
           message and computes the similarity against that of the other   the  size  of  training  data.  Figure  6  shows  the  increase  in
           message and returns a similarity score.            accuracy of the technique with data size.






                                                           – 70 –
   123   124   125   126   127   128   129   130   131   132   133