Page 366 - Kaleidoscope Academic Conference Proceedings 2024
P. 366

2024 ITU Kaleidoscope Academic Conference




           or low-vision individuals with sighted volunteers through  3.1  Speech and Text Processing
           a smartphone app to provide visual assistance for various
           tasks. This crowdsourced model has proven both successful  Speech Recognition:  The system begins by capturing
           and scalable, demonstrating the power of community-driven  the user’s voice input using the Web Speech API, which
           solutions.                                         transcribes spoken language into text. This enables users
                                                              to communicate with the application naturally, facilitating an
                                                              intuitive and seamless interaction.
             Aira[5] is another innovative startup,  offering a  Emotion Recognition for Optimized Responses:  The
           subscription service that connects blind users with trained  proposed system employs AI models capable of recognizing
           agents via smart glasses or a smartphone app. These agents  speech tonality and emotional cues using Mel-Frequency
           provide real-time assistance for navigation, reading, and other  Cepstral Coefficients. By analyzing these cues, the platform
           tasks.                                             adapts its responses to better suit the user’s emotional
                                                              state, making interactions more empathetic and contextually
                                                              relevant. This ensures that users receive supportive feedback
             Research also plays a critical role. Studies like those from
                                                              tailored to their current mood, enhancing their overall
           McKinsey Co., in collaboration with Tilting the Lens [6],
                                                              learning experience. [12]
           highlight the digital divide and emphasize the importance
                                                              Internet Search Integration: For information retrieval, the
           of digital accessibility for blind and low-vision consumers.
                                                              system employs the Google Custom Search JSON API. It
           Their research involves accessible digital surveys, online
                                                              processes the transcribed text as a search query and fetches
           focus groups, and interviews, underscoring the need for
                                                              relevant information from the web, ensuring that responses
           inclusive design in digital products
                                                              are informative and contextually relevant.[13]
                                                              Response Formulation: The system generates a response
             One notable study, "Explaining CLIP’s Performance  that incorporates both the informational content obtained
           Disparities on Data from Blind/Low Vision Users," [7]  from the web and the emotional context detected from
           evaluates large multi-modal models (LMMs), particularly  the user’s speech. This approach provides a personalized
           CLIP, and highlights the necessity of datasets that accurately  interaction, ensuring that the user receives accurate and
           represent blind and low-vision users to improve model  contextually appropriate information.[14]
           performance.  Another study, "ImageAssist:  Tools for  Text-to-Speech Output: The final response is converted
           Enhancing Touchscreen-Based Image Exploration Systems  back into speech using advanced text-to-speech technology,
           for Blind and Low Vision Users," [8] investigates  allowing the system to communicate with the user audibly.
           touchscreen-based systems that enrich alt text to offer a  This feature is crucial for accessibility, particularly for
           deeper understanding of digital images through touch and  visually impaired users, enabling them to receive information
           audio feedback.                                    in an auditory format.[15]
                                                              User Interface (UI): The UI displays the transcription
                                                              and processed responses with visual cues indicating which
             The paper "Artificial Intelligence for Visually Impaired"
                                                              parts of the response have been spoken.  This not only
           [9] reviews 178 studies on AI applications designed to
                                                              aids the helper in following the response but also enhances
           aid visually impaired individuals, including deep learning
                                                              comprehension. The interface is designed to be intuitive
           methods for diagnosing eye diseases and enhancing
                                                              and user-friendly, ensuring that visually impaired users can
           accessibility technologies .  "Towards Assisting Visually
                                                              navigate and interact with the platform effectively.[16]
           Impaired Individuals: A Review on Current Solutions" [10]
           provides a comprehensive overview of assistive technologies
           such as AI-driven navigation, object recognition, and
           reading assistance . Lastly, "Recent Trends in Computer
           Vision-Driven Scene Understanding for Visually Impaired"
           systematically maps the current state-of-the-art in computer
           vision-based assistive technologies, focusing on scene
           understanding and navigation solutions for visually impaired  Figure 1 – UI for Speech Processing
           users [11].
                                                              System Flow:
                   3.  RESEARCH METHODOLOGY                   The speech processing system operates through a structured
                                                              interaction between the client and server sides. Initially,
           The proposed system distinguishes itself through a suite  the user’s speech is captured and sent via the client-side
           of innovative features designed to provide a comprehensive  Speech-to-Text API to the server. The server processes the
           and accessible educational experience for visually impaired  text through an Emotion Content Analyzer, which identifies
           individuals.  These features leverage advanced AI  emotional cues and context.  This analysis informs the
           technologies to ensure inclusivity, adaptability, and enhanced  Adaptive Learning Model, which generates an appropriate
           user engagement. Below is a detailed exploration of our core  response.  The Text-to-Speech Service then converts the
           features:                                          tailored response back into speech. Finally, the audio is




                                                          – 322 –
   361   362   363   364   365   366   367   368   369   370   371