Page 127 - Kaleidoscope Academic Conference Proceedings 2020
P. 127

Industry-driven digital transformation




           4.3   Grouping semantically related messages

           This is one of the most important steps in the overall process
           as  the  short  nature  of  the  instant  messages  is  removed  by
           amalgamating  similar messages in this  step. The messages
           containing  similar  semantics  are  grouped  together  making
           them semantically richer. The semantically rich messages are
           then  forwarded  to  the  classifier  stage  for  the  purpose  of
           classification and prediction. Figure 2 shows the logical flow
           of steps contained within the grouping stage.




                                                                          Figure 3–Similarity algorithm

                                                              The above algorithm takes in the synset lists of two messages
                                                              that are to be compared with. Then each synset from the first
                                                              message is taken and the similarity between that synset and
                                                              each of the synsets in the second message is calculated. This
                                                              process is repeated iteratively until all the synsets in the first
                                                              message have been completed. Then the end similarity score
                                                              is divided by the product of the number of synsets in each
                                                              message  and  multiplied  by  four.  This  particular  similarity
                                                              algorithm  makes  use  of  the  path  similarity  of  Wordnet
                                                              interface as the similarity between words of different classes
                                                              is found [9].

                                                              4.4   Identifying the intention of the sender

                   Figure 2 –Logic of the message grouper     The final step after grouping the messages is classification.
                                                              The grouped messages from the previous stage are provided
           In this stage, initially the semantically rich components from   to the  classification  stage as  input.  In this  stage, a method
           the  message  are  initially  identified  and  disambiguated  to   based on machine learning is proposed to be used to classify
           identify  the  best  matching  synset.  Then  the  messages  are   messages as appropriate and child safe or inappropriate for
           grouped  according to  the novel algorithm proposed  in  this   young children. An SVM classifier is trained to classify the
           work. Semantically rich words contained in each message are   grouped messages based on the intention of the users. In this
           used  to  represent  the  whole  message.  Nouns,  verbs  and   research, only the intention to be either sexually fraudulent or
           adjectives are considered as semantically rich words. Hence,   not is considered for classification. Hence, only two classes
           each message  is  replaced  with  a  list  of meaningful  nouns,   of messages as appropriate and inappropriate will be created
           verbs and adjectives contained in the message in the order of   at the end of this stage.
           appearance. Then, each of the nouns, verbs and adjectives are
           disambiguated  to  identify  the  best  sense  of  each  selected   This  is  very  much  a  straightforward  technique.  An  SVM
           word.                                              classifier is trained to classify the intention of the message
                                                              sender  based  on  the  dataset  that  was  collected  from  the
           After disambiguation and finding the best suited synset for   Facebook messages of few users and sex chatting websites.
           each  word,  a  synset  list  representing  the  words  list   During the training of the classification model, the data was
           representing the message will be available. Then, synset lists   first  labeled,  stemmed  and  the  stop  words  were  removed.
           representing each message are grouped together to identify   Then as the features, Term Frequency – Inverse Document
           similar  messages  and  then  group  them  together.  This   Frequency (TFIDF) vector with a quadra-gram was selected.
           grouping  is  very  essential  for  enriching  the  input  to  the   From that, the  best  5000  features  (or words)  identified  for
           classification step. For identifying the similarity between two   representing each data item from the dataset and trained the
           different messages or synset lists, cosine similarity is used.   classification model [10, 11].
           Figure 3 lists the algorithm used to group messages in this
           work.                                              For a given grouped message, all the messages in the group
                                                              are checked for containing any swear words using a swear-
                                                              word dictionary. If any swear word is found in any message,
                                                              the entire group is marked as inappropriate without sending it
                                                              to  the  classifier.  If  no  swear  word  is  found  in  any  of  the
                                                              messages in a group, it is then forwarded to the classifier after





                                                           – 69 –
   122   123   124   125   126   127   128   129   130   131   132