Page 125 - Kaleidoscope Academic Conference Proceedings 2020
P. 125
Industry-driven digital transformation
within a single conversation and extracting their intentions checker that is capable of checking spelling phonetically
independently. as well as semantically.
3.3 Usage of non-text components 3.5 Other ambiguities found in the usage of languages
Non-text components such as hyperlinks, stickers, emoticons, Generally people commonly use synonyms, acronyms,
files and graphic content (images, videos, GIFs) included in polysemes, idioms, etc. when preparing their short messages.
chat messages also contribute towards conveying the These terms are briefly explained below:
intention of the sender adding meaning to the conversation.
Therefore, it is necessary to identify the meanings conveyed - Synonyms: Terms with similar meaning that can be used
by these non-text components in order to understand the interchangeably.
intention as a whole. The authors propose multiple
preprocessing steps to identify non-text components and - Acronyms: Terms made up of using the first letter of a set
extract and incorporate the textual meaning of them into the of words, similar to abbreviations.
messages. Methods used for preprocessing some of the non-
text components are: - Polysemes: Terms that have more than one interpretations
or meanings.
- Hyperlinks – Extract the metadata description of the
hyperlink and replace it with the hyperlink. - Idiom: A group of words established by usage as having a
meaning not deducible from those of the individual words.
- Emoticons – An emoticon dictionary is used to obtain the
textual meaning of emoticons. The Unicode of the In order to handle the above issue, the best word (synset) will
emoticon is identified to extract the meaning of the be used. The synonyms were disambiguated and replaced by
emoticon and then to replace it with the meaning found in the disambiguated synset [3].
the dictionary.
4. PROPOSED TECHNIQUE
3.4 Complexity of language used
In order to provide a feasible solution for the issue discussed
Language complexity is a part of almost all types of text. But, above, an efficient technique that takes in text messages in
only a few unique usages contribute towards the language conversations as inputs and outputs the intention or the
complexity in chat messages. In order to perform the intended appropriateness of intentions of the sender’s conversation is
task effectively, it is important to identify these issues and developed and implemented. In this work, only the sexual
neutralize or minimize their effect. Some such complexities oriented intentions incorporated within messages was
associated with chat message processing include: considered. That is, whether the intention of the
communicator is sexually appropriate (good intention) or
- Use of multiple languages: When communicating via sexually fraudulent (inappropriate or bad intention). The
messages, people tend to use many languages they are messages transmitted by the sender of each conversation are
conversant with. Hence, it is necessary to identify the analyzed to see whether any sender is having a sexually
different languages used in the message and translate them fraudulent intention. For the classification purpose, a method
into one common language for further processing. In this is proposed based on machine learning by training a support
research, messages created using only the English vector machine (SVM) classifier to check the appropriateness
language is considered. Thus, handling complexities of the intention. A major portion of the work is related to
arising from the use and multiple languages is to be enriching the content given as input to the classification
considered under future research. model. That is, chat messages are grouped based on semantic
similarity to form longer messages containing enhanced
- Use of abbreviations or short form for phrases: Chat content. Figure 1 shows the high-level logical flow of the
messages are generally limited by character count and proposed technique.
intended to be very short. One way people overcome this
limitation is to shorten commonly used words and use From Figure 1, it can be seen that as soon as the text messages
abbreviations. In order to handle this complexity, the long are supplied to the technique as input, it extracts chat features
forms of shortened phrases are extracted with the aid of a and converts them into text with the aid of the appropriate
dictionary and used to replace the shortened phrases dictionary and then incorporates them into the messages. At
within text messages. the next stage, the messages are preprocessed by tokenizing
the messages and replacing the abbreviations using an
- Misspelled words and words with numbers: Sometimes abbreviation dictionary. Then the message grouper assigns
numbers are used along with characters to shorten words. the messages to different groups based on semantic similarity.
For example, forever is commonly spelt as 4ever. Using Finally, the classifier extracts the intention of the sender and
this kind of spelling creates a great challenge to
identifying the actual word being transmitted. For
handling these kinds of words, it is suggested to use a spell
– 67 –