Page 127 - Kaleidoscope Academic Conference Proceedings 2020

P. 127

Industry-driven digital transformation

4.3 Grouping semantically related messages

This is one of the most important steps in the overall process
as the short nature of the instant messages is removed by
amalgamating similar messages in this step. The messages
containing similar semantics are grouped together making
them semantically richer. The semantically rich messages are
then forwarded to the classifier stage for the purpose of
classification and prediction. Figure 2 shows the logical flow
of steps contained within the grouping stage.

Figure 3–Similarity algorithm

The above algorithm takes in the synset lists of two messages
that are to be compared with. Then each synset from the first
message is taken and the similarity between that synset and
each of the synsets in the second message is calculated. This
process is repeated iteratively until all the synsets in the first
message have been completed. Then the end similarity score
is divided by the product of the number of synsets in each
message and multiplied by four. This particular similarity
algorithm makes use of the path similarity of Wordnet
interface as the similarity between words of different classes
is found [9].

4.4 Identifying the intention of the sender

Figure 2 –Logic of the message grouper The final step after grouping the messages is classification.
The grouped messages from the previous stage are provided
In this stage, initially the semantically rich components from to the classification stage as input. In this stage, a method
the message are initially identified and disambiguated to based on machine learning is proposed to be used to classify
identify the best matching synset. Then the messages are messages as appropriate and child safe or inappropriate for
grouped according to the novel algorithm proposed in this young children. An SVM classifier is trained to classify the
work. Semantically rich words contained in each message are grouped messages based on the intention of the users. In this
used to represent the whole message. Nouns, verbs and research, only the intention to be either sexually fraudulent or
adjectives are considered as semantically rich words. Hence, not is considered for classification. Hence, only two classes
each message is replaced with a list of meaningful nouns, of messages as appropriate and inappropriate will be created
verbs and adjectives contained in the message in the order of at the end of this stage.
appearance. Then, each of the nouns, verbs and adjectives are
disambiguated to identify the best sense of each selected This is very much a straightforward technique. An SVM
word. classifier is trained to classify the intention of the message
sender based on the dataset that was collected from the
After disambiguation and finding the best suited synset for Facebook messages of few users and sex chatting websites.
each word, a synset list representing the words list During the training of the classification model, the data was
representing the message will be available. Then, synset lists first labeled, stemmed and the stop words were removed.
representing each message are grouped together to identify Then as the features, Term Frequency – Inverse Document
similar messages and then group them together. This Frequency (TFIDF) vector with a quadra-gram was selected.
grouping is very essential for enriching the input to the From that, the best 5000 features (or words) identified for
classification step. For identifying the similarity between two representing each data item from the dataset and trained the
different messages or synset lists, cosine similarity is used. classification model [10, 11].
Figure 3 lists the algorithm used to group messages in this
work. For a given grouped message, all the messages in the group
are checked for containing any swear words using a swear-
word dictionary. If any swear word is found in any message,
the entire group is marked as inappropriate without sending it
to the classifier. If no swear word is found in any of the
messages in a group, it is then forwarded to the classifier after

– 69 –

122 123 124 125 126 127 128 129 130 131 132