Page 126 - Kaleidoscope Academic Conference Proceedings 2020

P. 126

2020 ITU Kaleidoscope Academic Conference

4.2 Handling misspelled words

Misspelled words are brought into chat messages
intentionally as well as unintentionally. Common types of
spelling mistakes occurring in online chatting are:

1. Unintentional spelling mistakes: Sender not knowing the
correct spellings and typos (mistakes that occur while
typing).

2. Abbreviations: A shortened form of a word or phrase.
Abbreviations are normally used to save space and time,
to avoid repetition of long words and phrases, or simply
to conform to conventional usage.
3. Short words: The use of phonemes when messaging and
the use of intentionally misspelled words.

When a message is input to Stage 2 of the process, it will be
broken into words. Once a list of words is created, each word
will be checked to see if it is a valid English word using the
Enchant (pyenchant) python library. When an invalid word is
found:
Figure 1 –High-level logical flow of the technique
1. The word is checked to see, if it is an abbreviation from
labels it as appropriate or inappropriate with the aid of a the dictionary. The dictionary contains the abbreviations
special purpose swear-word dictionary. The work carried out that are common in online chatting and their long forms.
within each step is explained in detail below. If a word is identified to be an abbreviation, it will be
replaced by its expanded form in the message.
4.1 Extracting textual meaning from non-text
components 2. If the word is not found in the dictionary, it is treated as a
spelling mistake. A phonological spell checker along with
All messages belonging to a conversation are identified and a disambiguation module is used to correct unintentional
extracted. Then these messages are input to the system. spelling mistakes. It is assumed that any unintentional
Identifying the text and non-text components such as hyper- spelling mistake may not deviate from the correct spelling
links, emoticons, stickers, graphic content or files from these of the word. The misspelled word is corrected following
messages is the first step of the process. Once the text and these steps: 1) The phoneme sequence of the misspelled
non-text components are identified and separated, the non- word is found first. 2) Then, for the identified phoneme
text components are processed and replaced with their textual sequence, the long short term memory (LSTM) is used to
meaning as follows: obtain the letter sequence. Once the letter sequence is
obtained, it is checked to see if it is a valid English word.
1. When a hyperlink is encountered, its metadata description If it is found to be a valid English word, the process moves
is extracted. Then the hyperlink is replaced with the to the next word in the list. Otherwise, suggestions for the
extracted metadata description. letter sequence needs to be considered to identify the
correct word and disambiguate it. In this research, the
2. A dictionary-based method is employed for handling Wordnet Synset [9] and semantic similarity along with
emoticons. The dictionary contains emoticon Unicode other words in the message were used for letter sequence
and its corresponding text meaning. Each emoticon has its suggestion and disambiguation respectively.
own unique Unicode in the dictionary. Once, the
emoticon Unicode is identified, it is replaced by its text Once a message has been preprocessed as explained above, it
meaning within the message. will contain no spelling errors. Once the messages have been
corrected of all the spelling mistakes and all the abbreviations
At the current stage of the research, files and graphics content expanded, they are transferred to the next stage for the
encountered with messages are just ignored. After processing grouping of messages.
the above non-text components and incorporating them into
the message, the message will contain only text. The text only
message thus created will be passed to the next stage as input.

– 68 –

121 122 123 124 125 126 127 128 129 130 131