Page 123 - Kaleidoscope Academic Conference Proceedings 2024
P. 123
Innovation and Digital Transformation for a Sustainable World
4.3 Sentiment Detection and feature extraction
4.1 Dataset Generation
4.3.1 TF-IDF
The Dataset is a collection of ordered data or information.
An Information gathering will happen by study, analysis, The terminology frequency-inverse document frequency, or
observation, and measurement. The data can be generated TF-IDF, is a frequently used technique for determining a
from various E-Dictionary. Collect English, Hindi, and word's relevance in a text. terminology frequency (t) is
Gujarati words, also Collected Social Media comments and determined by dividing its total number of occurrences in a
posts in the Code-Mixed language. In the dataset each row document by the document's word count. we can utilize IDF
represents data, and every column corresponds to field (Inverse Document Frequency) to determine a term's
(variable). The Dataset is structured. For this research we are relevance. Certain phrases, such as "is," "an," "and" and so
Gathering post and comments from the social media like on, are used frequently but have no bearing on anything.
Twitter, Facebook, etc. and use it further in the next steps. Considering that N is the total number of documents and DF
From this step we are having 2 datasets as 1) Positive and is the number of documents that contain the phrase t, the
Negative words and 2) list of Sentences with code mixed formula to compute IDF is IDF (t) = log(N/DF). TF-IDF is a
language. more effective way to convert textual information
representations into Vector Space Models (VSMs). The term
4.2 POS Text Preparation frequency would be 10/250 = 0.04 if a text had 250 words
and the word "Laptop" occurred 10 times in those 200 words.
Long paragraphs, sometimes referred to as chunks of text, Similarly, suppose that out of 50,000 papers, only 500
are broken up into tokens in this tokenization process, which mention the word "mouse." Then, IDF (Laptop) = TF-IDF
are essentially sentences. You can further divide these (Laptop) will be equal to 50000/500 = 100 and 0.04*100 =
statements into individual words. Take the statement "कल 4.N-Grams
शाम को बाजार में I met to my teacher" as an illustration and
separate tokenization comes: {“कल”, “शाम”, “को”, “बाजार”, The textual characteristics for supervised machine learning
“में”, “I”, “met”, “to”, “my”, “teacher”} algorithms will be formed by N-Gram. N-grams are
consecutive strings of words or symbols. Technically, these
A basic goal in Natural Language Processing (NLP) is to can be identified as the successive group of elements in a
assign a grammatical category (noun, verb, adjective, adverb, document. They become important while handling text data
etc.) to every word in a document. This method enables in an NLP (Natural Language Processing) project [37].
computers to analyze and interpret human language more Among all the theories in machine learning, the N-gram
correctly by improving their grasp of phrase structure and hypothesis is likely the most fundamental. A group of
semantics. Parts of Discourses Natural language processing specific number of words is called an N-gram. There are 4
(NLP) refers to the act of "tagging" each word in a text with words in comment like "A Facebook comment post" is a 4-
a particular speech component (adverb, adjective, verb, etc.) gram, "Facebook comment post" is a 3-gram (trigram), and
or grammatical category. POS tagging is helpful in NLP "Facebook comment" is a 2-gram (bigram). But first, we
applications for data extraction, machine translation, and need to look at the probabilities that are used in n-grams [38].
named entity recognition, among other things. The value of the sentence's next word, p(w|h), can be
predicted by an accurate N-gram model. A model called a
Stop Word Removal is one of the preprocessing methods that unigram only considers the frequency with which a word
is most usually used for different NLP applications. The idea occurs without considering any words that came before it.
is as straightforward as removing words that frequently Bigram is a model that predicts the current word using only
occur in each document in the body. Stop Words are previous words. A trigram model that considers the two
typically defined as articles and pronouns. These terms are preceding words. After generalizing, the above equation can
not highly discriminative because they are irrelevant for be calculated as:
some NLP tasks like classification and information retrieval. ( | 1: − 1) ≈ ( | − + 1: − 1) [39]
When indexing and retrieving entries based on a search
query, search engines are programmed to reject terms that 4.4 Polarity Opinion Classifier
are designated as stop words. Examples of such words
include "the," "a," "an," and "in", etc.
4.4.1 Support Vector Machine (SVM)
In this work, Tokenization and POS Tagging are used for text SVM is a machine learning supervised (feed-me) procedure
processing. Stop Word Removal is accomplished using NLP that can be used to resolve regression or classification issues.
through the Python package known as NLTK.
Regression is the forecast of a continuous value, whereas
classification is the forecast of a label or group. SVM is used
for classification to locate the hyper-planes that differentiates
the classes we plotted in n-dimensional space. The
functioning for classification in support vector machines as
– 79 –