Page 125 - Kaleidoscope Academic Conference Proceedings 2020
P. 125

Industry-driven digital transformation




           within a  single conversation and extracting their intentions   checker that is capable of checking spelling phonetically
           independently.                                        as well as semantically.

           3.3   Usage of non-text components                 3.5   Other ambiguities found in the usage of languages

           Non-text components such as hyperlinks, stickers, emoticons,   Generally  people  commonly  use  synonyms,  acronyms,
           files and graphic content (images, videos, GIFs) included in   polysemes, idioms, etc. when preparing their short messages.
           chat  messages  also  contribute  towards  conveying  the   These terms are briefly explained below:
           intention of the sender adding meaning to the conversation.
           Therefore, it is necessary to identify the meanings conveyed   -  Synonyms: Terms with similar meaning that can be used
           by  these  non-text  components  in  order  to  understand  the   interchangeably.
           intention  as  a  whole.  The  authors  propose  multiple
           preprocessing  steps  to  identify  non-text  components  and   -  Acronyms: Terms made up of using the first letter of a set
           extract and incorporate the textual meaning of them into the   of words, similar to abbreviations.
           messages. Methods used for preprocessing some of the non-
           text components are:                               -  Polysemes: Terms that have more than one interpretations
                                                                 or meanings.
           -  Hyperlinks  –  Extract  the  metadata  description  of  the
              hyperlink and replace it with the hyperlink.    -  Idiom: A group of words established by usage as having a
                                                                 meaning not deducible from those of the individual words.
           -  Emoticons – An emoticon dictionary is used to obtain the
              textual  meaning  of  emoticons.  The  Unicode  of  the   In order to handle the above issue, the best word (synset) will
              emoticon  is  identified  to  extract  the  meaning  of  the   be used. The synonyms were disambiguated and replaced by
              emoticon and then to replace it with the meaning found in   the disambiguated synset [3].
              the dictionary.
                                                                        4.  PROPOSED TECHNIQUE
           3.4   Complexity of language used
                                                              In order to provide a feasible solution for the issue discussed
           Language complexity is a part of almost all types of text. But,   above, an efficient technique that takes in text messages in
           only a few  unique usages contribute  towards  the language   conversations  as  inputs  and  outputs  the  intention  or  the
           complexity in chat messages. In order to perform the intended   appropriateness of intentions of the sender’s conversation is
           task effectively, it is important to identify these issues and   developed and implemented.  In this work,  only the  sexual
           neutralize or minimize their effect. Some such complexities   oriented  intentions  incorporated  within  messages  was
           associated with chat message processing include:   considered.  That  is,  whether  the  intention  of  the
                                                              communicator  is  sexually  appropriate  (good  intention)  or
           -  Use  of  multiple  languages:  When  communicating  via   sexually  fraudulent  (inappropriate  or  bad  intention).  The
              messages,  people  tend  to  use  many  languages  they  are   messages transmitted by the sender of each conversation are
              conversant  with.  Hence,  it  is  necessary  to  identify  the   analyzed  to  see  whether  any  sender  is  having  a  sexually
              different languages used in the message and translate them   fraudulent intention. For the classification purpose, a method
              into one common language for further processing. In this   is proposed based on machine learning by training a support
              research,  messages  created  using  only  the  English   vector machine (SVM) classifier to check the appropriateness
              language  is  considered.  Thus,  handling  complexities   of  the intention.  A  major portion of the  work  is related to
              arising  from  the  use  and  multiple  languages  is  to  be   enriching  the  content  given  as  input  to  the  classification
              considered under future research.               model. That is, chat messages are grouped based on semantic
                                                              similarity  to  form  longer  messages  containing  enhanced
           -  Use  of  abbreviations  or  short  form  for  phrases:  Chat   content.  Figure  1  shows  the  high-level  logical  flow  of  the
              messages  are  generally  limited  by  character  count  and   proposed technique.
              intended to be very short. One way people overcome this
              limitation  is  to  shorten  commonly  used  words  and  use   From Figure 1, it can be seen that as soon as the text messages
              abbreviations. In order to handle this complexity, the long   are supplied to the technique as input, it extracts chat features
              forms of shortened phrases are extracted with the aid of a   and converts them into text with the aid of the appropriate
              dictionary  and  used  to  replace  the  shortened  phrases   dictionary and then incorporates them into the messages. At
              within text messages.                           the next stage, the messages are preprocessed by tokenizing
                                                              the  messages  and  replacing  the  abbreviations  using  an
           -  Misspelled  words and  words  with  numbers:  Sometimes   abbreviation  dictionary. Then  the message  grouper assigns
              numbers are used along with characters to shorten words.   the messages to different groups based on semantic similarity.
              For example, forever is commonly spelt as 4ever. Using   Finally, the classifier extracts the intention of the sender and
              this  kind  of  spelling  creates  a  great  challenge  to
              identifying  the  actual  word  being  transmitted.  For
              handling these kinds of words, it is suggested to use a spell





                                                           – 67 –
   120   121   122   123   124   125   126   127   128   129   130