Page 140 - Kaleidoscope Academic Conference Proceedings 2024
P. 140

2024 ITU Kaleidoscope Academic Conference




              11.  Update  on  individual's  health  condition  in   In addition to the established datasets, the proposed  work
                  networked database.                         also incorporates Indian regional audio clips sourced from
                                                              various social media platforms. These clips, representing a
                      4.  IMPLEMENTATION AND                  diverse range of linguistic and cultural backgrounds, offer
                            EXPERIMENTATION                   valuable  insights  into  emotional  expressions.  To  ensure
                                                              compatibility  with  the  existing  datasets,  extensive
           This  section  describes  an  implementation  of  the  proposed   normalization  techniques  have  been  employed.  This
           system architecture of health anomaly detection.   normalization  process  involves  standardizing  the  format,
                                                              quality, and linguistic characteristics of the collected audio
           4.1   Real time voice emotion recognition system   clips  to  align  seamlessly  with  the  established  datasets,
                                                              thereby facilitating integration and enhancing the diversity
           The  proposed  system  utilizes  a  Convolutional  Neural   of emotional expressions represented in the training data. By
           Network (CNN) followed by max-pooling layers to extract   incorporating  these  additional  regional  audio  clips,  our
           relevant  features  from  the  input  voice  data.  Batch   dataset becomes more comprehensive and reflective of the
           normalization  and  dropout  layers  are  incorporated  to   diverse  emotional  expressions  prevalent  among  elderly
           enhance model stability and prevent overfitting. The CNN   individuals. This augmentation not only enriches the training
           architecture  is  designed  to  capture  complex  patterns  and   process  but  also  enhances  the  generalizability  and
           variations in the audio signals, facilitating accurate emotion   effectiveness  of  the  real-time  voice  emotion  recognition
           recognition. Subsequent layers of max-pooling further down   system  in  accurately  detecting  and  interpreting  emotional
           sample the features, focusing on the most salient information   states across different cultural and linguistic contexts.
           while  reducing  computational  complexity.  Following  the
           convolutional  layers,  Long  Short-Term  Memory  (LSTM)   The finalized input features for the current model encompass
           units  are  employed  to  capture  temporal  dependencies  and   a  comprehensive  set  of  acoustic  characteristics  that  are
           sequential  patterns  in  the  voice  data.  The  LSTM  layers,   instrumental in capturing the nuances of emotional speech.
           coupled  with  dropout  regularization,  enable  the  model  to   These  features  include  Zero  Crossing  Rate  (ZCR),  Root
           effectively  learn  and  represent  the  dynamic  nature  of   Mean Squared Error (RMSE), and Mel-frequency cepstral
           emotional  expressions  over  time.  The  final  dense  layers   coefficients (MFCCs). Zero Crossing Rate (ZCR) measures
           perform  classification,  mapping  the  learned  features  to   the  rate  at  which  the  signal  changes  its  sign,  providing
           different emotional states.  By leveraging a combination of   insights  into  the  frequency  content  and  periodicity  of  the
           CNN and LSTM layers, the proposed system achieves robust   audio signal. With 108 instances of ZCR calculated across
           and accurate emotion recognition from voice inputs.   the dataset, this feature offers valuable information regarding
                                                              the temporal characteristics of the speech signals.
           The dataset used in this module includes a diverse range of
           sources,  each  contributing  distinct  attributes  and   Root Mean Squared Error (RMSE) serves as a measure of
           characteristics  to  the  training  process.  The  Surrey  Audio-  the amplitude variation in the audio signal, quantifying the
           Visual  Expressed  Emotion  (SAVEE)  [10]  dataset  features   energy distribution across the signal's time domain. Similar
           recordings from four male actors, expressing a total of seven   to ZCR, RMSE is computed 108 times across the dataset,
           different  emotions  across  480  British  English  utterances.   capturing variations in signal intensity and dynamics. Mel-
           These  sentences  meticulously  chosen  from  the  standard   frequency  cepstral  coefficients  (MFCCs)  represent  a
           TIMIT corpus, ensure phonetic balance for each emotion.   powerful feature set widely used in speech processing tasks.
           The Ryerson Audio-Visual Database of Emotional Speech   Comprising 2160 coefficients computed across the dataset,
           and Song (RAVDESS) [11] presents a multimodal collection   MFCCs  capture  the  spectral  characteristics  of  the  speech
           of emotional speech and song recordings. With contributions   signal, providing insights into the frequency distribution and
           from  24  professional  actors  vocalizing  lexically-matched   phonetic content.
           statements in a neutral North American accent, RAVDESS
           encompasses a vast repository of 7,356 files covering seven   The chosen parameters for feature extraction include a hop
           distinct emotions. The Toronto Emotional Speech Set (TESS)  length of 512 and a frame length of 2048, ensuring efficient
           [12]  contributes  stimuli  for  emotional  speech  research,   processing  and  capturing  relevant  temporal  and  spectral
           featuring 200 target words spoken by two actresses across   information.  In  total,  the  dataset  comprises  2376  input
           various  emotional  states.  Further,  the  Indian  Emotional   features,  combining  ZCR,  RMSE,  and  MFCCs,  which
           Speech Corpora (IESC) [13] was used in training / testing of   collectively  provide  a  rich  representation  of  the  acoustic
           the system  for emotion classification in speech. With 600   properties of the emotional speech signals. These features
           speech samples recorded from eight speakers, each uttering   serve  as  the  foundation  for  training  the  real-time  voice
           two sentences in five emotions, IESC provides a rich source   emotion recognition system, enabling accurate detection and
           of English-language data. Collectively, these datasets offer a   interpretation of emotional states in elderly individuals.
           comprehensive  foundation  for  training  and  validating  the
           real-time voice emotion recognition system, enabling robust   The proposed system  underwent training for a total of 38
           and accurate emotion detection in elderly individuals.   epochs,  with  early  stopping  mechanisms  employed  to
                                                              prevent overfitting and optimize model performance. Early
                                                              stopping allows the training process to halt when the model's





                                                           – 96 –
   135   136   137   138   139   140   141   142   143   144   145