Page 126 - Kaleidoscope Academic Conference Proceedings 2020
P. 126

2020 ITU Kaleidoscope Academic Conference




                                                              4.2   Handling misspelled words

                                                              Misspelled  words  are  brought  into  chat  messages
                                                              intentionally  as  well  as  unintentionally.  Common  types  of
                                                              spelling mistakes occurring in online chatting are:


                                                              1.  Unintentional spelling mistakes: Sender not knowing the
                                                                 correct  spellings  and  typos  (mistakes  that  occur  while
                                                                 typing).

                                                              2.  Abbreviations:  A  shortened  form  of  a  word  or  phrase.
                                                                 Abbreviations are normally used to save space and time,
                                                                 to avoid repetition of long words and phrases, or simply
                                                                 to conform to conventional usage.
                                                              3.  Short words: The use of phonemes when messaging and
                                                                 the use of intentionally misspelled words.

                                                              When a message is input to Stage 2 of the process, it will be
                                                              broken into words. Once a list of words is created, each word
                                                              will be checked to see if it is a valid English word using the
                                                               Enchant (pyenchant) python library. When an invalid word is
                                                              found:
               Figure 1 –High-level logical flow of the technique
                                                              1.  The word is checked to see, if it is an abbreviation from
           labels  it  as  appropriate  or  inappropriate  with  the  aid  of  a   the dictionary. The dictionary contains the abbreviations
           special purpose swear-word dictionary. The work carried out   that are common in online chatting and their long forms.
           within each step is explained in detail below.        If  a  word is  identified to  be an abbreviation, it  will  be
                                                                 replaced by its expanded form in the message.
           4.1   Extracting  textual  meaning  from  non-text
                components                                    2.  If the word is not found in the dictionary, it is treated as a
                                                                 spelling mistake. A phonological spell checker along with
           All messages belonging to a conversation are identified and   a disambiguation module is used to correct unintentional
           extracted.  Then  these  messages  are  input  to  the  system.   spelling  mistakes.  It  is  assumed  that  any  unintentional
           Identifying the text and non-text components such as hyper-  spelling mistake may not deviate from the correct spelling
           links, emoticons, stickers, graphic content or files from these   of the word. The misspelled word is corrected following
           messages is the first step of the process. Once the text and   these steps: 1) The phoneme sequence of the misspelled
           non-text components are identified and separated, the non-  word is found first. 2) Then, for the identified phoneme
           text components are processed and replaced with their textual   sequence, the long short term memory (LSTM) is used to
           meaning as follows:                                   obtain  the  letter  sequence.  Once  the  letter  sequence  is
                                                                 obtained, it is checked to see if it is a valid English word.
           1.  When a hyperlink is encountered, its metadata description   If it is found to be a valid English word, the process moves
              is  extracted.  Then  the  hyperlink  is  replaced  with  the   to the next word in the list. Otherwise, suggestions for the
              extracted metadata description.                    letter  sequence  needs  to  be  considered  to  identify  the
                                                                 correct  word  and  disambiguate  it.  In  this  research,  the
           2.  A  dictionary-based  method  is  employed  for  handling   Wordnet  Synset [9] and  semantic  similarity along with
              emoticons.  The  dictionary  contains  emoticon  Unicode   other words in the message were used for letter sequence
              and its corresponding text meaning. Each emoticon has its   suggestion and disambiguation respectively.
              own  unique  Unicode  in  the  dictionary.  Once,  the
              emoticon Unicode is identified, it is replaced by its text   Once a message has been preprocessed as explained above, it
              meaning within the message.                     will contain no spelling errors. Once the messages have been
                                                              corrected of all the spelling mistakes and all the abbreviations
           At the current stage of the research, files and graphics content   expanded,  they  are  transferred  to  the  next  stage  for  the
           encountered with messages are just ignored. After processing   grouping of messages.
           the above non-text components and incorporating them into
           the message, the message will contain only text. The text only
           message thus created will be passed to the next stage as input.









                                                           – 68 –
   121   122   123   124   125   126   127   128   129   130   131