Page 140 - Kaleidoscope Academic Conference Proceedings 2024
P. 140
2024 ITU Kaleidoscope Academic Conference
11. Update on individual's health condition in In addition to the established datasets, the proposed work
networked database. also incorporates Indian regional audio clips sourced from
various social media platforms. These clips, representing a
4. IMPLEMENTATION AND diverse range of linguistic and cultural backgrounds, offer
EXPERIMENTATION valuable insights into emotional expressions. To ensure
compatibility with the existing datasets, extensive
This section describes an implementation of the proposed normalization techniques have been employed. This
system architecture of health anomaly detection. normalization process involves standardizing the format,
quality, and linguistic characteristics of the collected audio
4.1 Real time voice emotion recognition system clips to align seamlessly with the established datasets,
thereby facilitating integration and enhancing the diversity
The proposed system utilizes a Convolutional Neural of emotional expressions represented in the training data. By
Network (CNN) followed by max-pooling layers to extract incorporating these additional regional audio clips, our
relevant features from the input voice data. Batch dataset becomes more comprehensive and reflective of the
normalization and dropout layers are incorporated to diverse emotional expressions prevalent among elderly
enhance model stability and prevent overfitting. The CNN individuals. This augmentation not only enriches the training
architecture is designed to capture complex patterns and process but also enhances the generalizability and
variations in the audio signals, facilitating accurate emotion effectiveness of the real-time voice emotion recognition
recognition. Subsequent layers of max-pooling further down system in accurately detecting and interpreting emotional
sample the features, focusing on the most salient information states across different cultural and linguistic contexts.
while reducing computational complexity. Following the
convolutional layers, Long Short-Term Memory (LSTM) The finalized input features for the current model encompass
units are employed to capture temporal dependencies and a comprehensive set of acoustic characteristics that are
sequential patterns in the voice data. The LSTM layers, instrumental in capturing the nuances of emotional speech.
coupled with dropout regularization, enable the model to These features include Zero Crossing Rate (ZCR), Root
effectively learn and represent the dynamic nature of Mean Squared Error (RMSE), and Mel-frequency cepstral
emotional expressions over time. The final dense layers coefficients (MFCCs). Zero Crossing Rate (ZCR) measures
perform classification, mapping the learned features to the rate at which the signal changes its sign, providing
different emotional states. By leveraging a combination of insights into the frequency content and periodicity of the
CNN and LSTM layers, the proposed system achieves robust audio signal. With 108 instances of ZCR calculated across
and accurate emotion recognition from voice inputs. the dataset, this feature offers valuable information regarding
the temporal characteristics of the speech signals.
The dataset used in this module includes a diverse range of
sources, each contributing distinct attributes and Root Mean Squared Error (RMSE) serves as a measure of
characteristics to the training process. The Surrey Audio- the amplitude variation in the audio signal, quantifying the
Visual Expressed Emotion (SAVEE) [10] dataset features energy distribution across the signal's time domain. Similar
recordings from four male actors, expressing a total of seven to ZCR, RMSE is computed 108 times across the dataset,
different emotions across 480 British English utterances. capturing variations in signal intensity and dynamics. Mel-
These sentences meticulously chosen from the standard frequency cepstral coefficients (MFCCs) represent a
TIMIT corpus, ensure phonetic balance for each emotion. powerful feature set widely used in speech processing tasks.
The Ryerson Audio-Visual Database of Emotional Speech Comprising 2160 coefficients computed across the dataset,
and Song (RAVDESS) [11] presents a multimodal collection MFCCs capture the spectral characteristics of the speech
of emotional speech and song recordings. With contributions signal, providing insights into the frequency distribution and
from 24 professional actors vocalizing lexically-matched phonetic content.
statements in a neutral North American accent, RAVDESS
encompasses a vast repository of 7,356 files covering seven The chosen parameters for feature extraction include a hop
distinct emotions. The Toronto Emotional Speech Set (TESS) length of 512 and a frame length of 2048, ensuring efficient
[12] contributes stimuli for emotional speech research, processing and capturing relevant temporal and spectral
featuring 200 target words spoken by two actresses across information. In total, the dataset comprises 2376 input
various emotional states. Further, the Indian Emotional features, combining ZCR, RMSE, and MFCCs, which
Speech Corpora (IESC) [13] was used in training / testing of collectively provide a rich representation of the acoustic
the system for emotion classification in speech. With 600 properties of the emotional speech signals. These features
speech samples recorded from eight speakers, each uttering serve as the foundation for training the real-time voice
two sentences in five emotions, IESC provides a rich source emotion recognition system, enabling accurate detection and
of English-language data. Collectively, these datasets offer a interpretation of emotional states in elderly individuals.
comprehensive foundation for training and validating the
real-time voice emotion recognition system, enabling robust The proposed system underwent training for a total of 38
and accurate emotion detection in elderly individuals. epochs, with early stopping mechanisms employed to
prevent overfitting and optimize model performance. Early
stopping allows the training process to halt when the model's
– 96 –