Page 128 - Kaleidoscope Academic Conference Proceedings 2020
P. 128
2020 ITU Kaleidoscope Academic Conference
being stemmed, the stop words removed and formatted for • Cosine similarity is a straightforward technique of
classification. applying the cosine similarity algorithm on vectors found in
both messages. The results obtained were not very
5. EXPERIMENTS AND RESULTS encouraging, as very rarely would a person use the same
words in multiple messages in a chat. Hence, cosine similarity
Several experiments were conducted to verify the is ineffective for computing the similarity between two chat
performance of the proposed technique. This section presents messages.
the details of experiments and a discussion on the results in
the following paragraphs. • Cosine similarity using disambiguated synsets – unlike
the above cosine similarity test, where both full messages are
Experiment: The performance of the semantic similarity considered, in this case only disambiguated synsets were
technique proposed in this work vs. other similarity considered. The results obtained were even worse than cosine
techniques such as cosine similarity for finding the best similarity applied on complete messages.
similarity score for grouping messages.
• Phrase similarity from Spacy – This is also a standard
In this experiment, the cosine similarity was used as the similarity computing technique for words, phrases and
benchmark for evaluating the performance of the proposed sentences. This particular technique works with some
similarity algorithm for chat messages. From the results accuracy when it comes to finding similarity between two
obtained, it was evident that, the proposed algorithm performs messages. But, its performance is poorer compared to the
much better than the cosine similarity. It was also found that other algorithm used for comparing messages [13].
the cosine similarity performs very poorly when the same set
of words is not available in both messages being compared. 5.1 Accuracy of the classifier
This is a major drawback as the same words are rarely
repeated in online chatting due to the short nature of The model selected for classification in this work is SVM
messages. Hence, it is not advisable to employ cosine classifier. SVM classifier is selected in this work due to its
similarity for extracting the similarity between two chat advantages in classifying text compared to other techniques.
messages [12]. The similarity results of the tested algorithms In this work, two classes of the data are to be classified. The
are shown in Figure 4. accuracy of the module proposed in this work against other
similar techniques is given in Figure 5 and Figure 6.
.
Figure 4 – Similarity scores for different similarity
algorithms
Figure 4 shows the similarity scores generated by different
algorithms when similar messages were processed. The
proposed algorithm was compared with cosine similarity and Figure 5 – Comparison of accuracy with
Spacy for evaluation. From the results shown in Figure 4, it similar techniques
is clear that cosine similarity suffers from a big drawback in
processing chat messages. On the other hand, the algorithm Figure 5 shows the accuracy of the proposed module against
proposed in this work performs much better within the that of other similar techniques. The best accuracy of 92% has
domain of chat messages. The detailed explanation on the been achieved by the technique proposed by Dong et al. in
four major tried and tested similarity algorithms are given [3].Dong et al have used more than 33,000 sets of data to train
below: their model and the proposed technique was also able to equal
that accuracy with a much smaller set of data. The other
• The algorithm proposed in this work takes the observation made during this experiment was that the
disambiguated synset of each semantic component of each accuracy of the proposed technique gradually increases with
message and computes the similarity against that of the other the size of training data. Figure 6 shows the increase in
message and returns a similarity score. accuracy of the technique with data size.
– 70 –