Page 123 - Kaleidoscope Academic Conference Proceedings 2024
P. 123

Innovation and Digital Transformation for a Sustainable World




                                                              4.3   Sentiment Detection and feature extraction
           4.1   Dataset Generation
                                                              4.3.1   TF-IDF
           The Dataset is a collection of ordered data or information.
           An  Information  gathering  will  happen  by  study,  analysis,   The terminology frequency-inverse document frequency, or
           observation, and  measurement. The  data  can  be  generated   TF-IDF,  is  a  frequently  used  technique  for  determining  a
           from  various  E-Dictionary.  Collect  English,  Hindi,  and   word's  relevance  in  a  text.  terminology  frequency  (t)  is
           Gujarati words, also Collected Social Media comments and   determined by dividing its total number of occurrences in a
           posts in the Code-Mixed language. In the dataset each row   document by the document's word count. we can utilize IDF
           represents  data,  and  every  column  corresponds  to  field   (Inverse  Document  Frequency)  to  determine  a  term's
           (variable). The Dataset is structured. For this research we are   relevance. Certain phrases, such as "is," "an," "and" and so
           Gathering  post  and  comments  from  the  social  media  like   on,  are  used  frequently  but  have  no  bearing  on  anything.
           Twitter, Facebook, etc. and use it further in the next steps.   Considering that N is the total number of documents and DF
           From this step we are having 2 datasets as 1) Positive and   is  the  number  of  documents that contain the  phrase  t,  the
           Negative  words  and 2) list  of  Sentences  with code mixed   formula to compute IDF is IDF (t) = log(N/DF). TF-IDF is a
           language.                                          more  effective  way  to  convert  textual  information
                                                              representations into Vector Space Models (VSMs). The term
           4.2   POS Text Preparation                         frequency would be 10/250 = 0.04 if a text had 250 words
                                                              and the word "Laptop" occurred 10 times in those 200 words.
           Long paragraphs, sometimes referred to as chunks of text,   Similarly,  suppose  that  out  of  50,000  papers,  only  500
           are broken up into tokens in this tokenization process, which   mention the word "mouse." Then, IDF (Laptop) = TF-IDF
           are  essentially  sentences.  You  can  further  divide  these   (Laptop) will be equal to 50000/500 = 100 and 0.04*100 =
           statements into individual  words. Take  the  statement  "कल   4.N-Grams
           शाम को बाजार में I met to my teacher" as an illustration and
           separate  tokenization  comes:  {“कल”,  “शाम”,  “को”,  “बाजार”,   The textual characteristics for supervised machine learning
           “में”, “I”, “met”, “to”, “my”, “teacher”}          algorithms  will  be  formed  by  N-Gram.  N-grams  are
                                                              consecutive strings of words or symbols. Technically, these
           A  basic  goal in  Natural Language  Processing  (NLP)  is  to   can be identified as the successive group of elements in a
           assign a grammatical category (noun, verb, adjective, adverb,   document. They become important while handling text data
           etc.)  to  every  word  in  a  document.  This  method  enables   in  an  NLP  (Natural  Language  Processing)  project  [37].
           computers  to  analyze  and  interpret  human  language  more   Among  all  the  theories  in  machine  learning,  the  N-gram
           correctly by improving their grasp of phrase structure and   hypothesis  is  likely  the  most  fundamental.  A  group  of
           semantics. Parts of Discourses Natural language processing   specific number of words is called an N-gram. There are 4
           (NLP) refers to the act of "tagging" each word in a text with   words in comment like "A Facebook comment post" is a 4-
           a particular speech component (adverb, adjective, verb, etc.)   gram, "Facebook comment post" is a 3-gram (trigram), and
           or  grammatical  category.  POS  tagging  is  helpful  in  NLP   "Facebook  comment"  is  a  2-gram  (bigram).  But  first,  we
           applications  for  data  extraction,  machine  translation,  and   need to look at the probabilities that are used in n-grams [38].
           named entity recognition, among other things.      The  value  of  the  sentence's  next  word,  p(w|h),  can  be
                                                              predicted by an accurate N-gram model. A model called a
           Stop Word Removal is one of the preprocessing methods that   unigram  only  considers  the  frequency  with  which  a  word
           is most usually used for different NLP applications. The idea   occurs without considering any words that came before it.
           is  as  straightforward  as  removing  words  that  frequently   Bigram is a model that predicts the current word using only
           occur  in  each  document  in  the  body.  Stop  Words  are   previous  words.  A  trigram  model  that  considers  the  two
           typically defined as articles and pronouns. These terms are   preceding words. After generalizing, the above equation can
           not  highly  discriminative  because  they  are  irrelevant  for   be calculated as:
           some NLP tasks like classification and information retrieval.      (    |  1:    − 1) ≈     (    |     −    + 1:    − 1) [39]
           When  indexing  and  retrieving  entries  based  on  a  search
           query, search engines are programmed to reject terms that   4.4   Polarity Opinion Classifier
           are  designated  as  stop  words.  Examples  of  such  words
           include "the," "a," "an," and "in", etc.
                                                              4.4.1   Support Vector Machine (SVM)
           In this work, Tokenization and POS Tagging are used for text   SVM is a machine learning supervised (feed-me) procedure
           processing. Stop Word Removal is accomplished using NLP   that can be used to resolve regression or classification issues.
           through the Python package known as NLTK.
                                                              Regression  is the  forecast  of a  continuous  value,  whereas
                                                              classification is the forecast of a label or group. SVM is used
                                                              for classification to locate the hyper-planes that differentiates
                                                              the  classes  we  plotted  in  n-dimensional  space.  The
                                                              functioning for classification in support vector machines as







                                                           – 79 –
   118   119   120   121   122   123   124   125   126   127   128