Page 367 - Kaleidoscope Academic Conference Proceedings 2024
P. 367

Innovation and Digital Transformation for a Sustainable World




           transmitted back to the client side, where it is delivered to the  Face Detection: The face detection module employs the
           user, completing the interaction cycle.            YOLOv5 object detection model to detect faces within an
                                                              input image. Detected faces are outlined with rectangular
                                                              bounding boxes, providing visual cues for their location
                                                              and size. This capability is essential for user identification,
                                                              interaction facilitation, and security enhancement. [17]





           Figure 2 – Activity Diagram for Speech and Text Processing


           3.2  Image and Text Integration

           Image Scene Descriptor for the Visually Impaired:
           The proposed system integrates advanced image recognition
           and natural language processing (NLP) to create detailed
           audio descriptions of virtual scenes and images.[18] This
           feature allows blind users to understand visual content
           through auditory descriptions, significantly enhancing their  Figure 5 – UI for Face Detection Model
           ability to engage with educational materials that contain
                                                              Sign Language to Text: The system detects hand gestures
           images.[17] It is important to note that the proposed system,
                                                              using the HSV color space for skin color detection. It extracts
           although designed for visually impaired users, will still
                                                              features such as area, perimeter, and convexity defects of
           require assistance from sighted individuals for accurate image
                                                              the hand contour, identifying extended fingers to interpret
           and text integration. The involvement of sighted assistants
                                                              sign language.  This feature allows real-time translation
           ensures that the descriptive content generated is accurate and
                                                              of sign language into text, facilitating communication for
           contextually relevant, thereby enhancing the overall learning
                                                              hearing-impaired users. [17]
           experience.
           Image Bounding Box for Learning Material: An algorithm
           within The proposed system differentiates between text and
           visual content in learning materials. Using Optical Character
           Recognition (OCR) and image-to-text models, the platform
           converts visual text into speech, making diagrams and other
           visual elements accessible. This ensures that all content,
           regardless of its original format, is available to visually
           impaired users.


                                                                     Figure 6 – UI for Sign Language Detection

                                                              System Flow:
                                                              The image processing system begins with the user uploading
                    Figure 3 – Input and Output to OCR        or accessing an image via the client-side frontend interface.
                                                              The image is sent to the server-side Image Processing API,
           Handwriting to Text:  The system provides auditory  which extracts text, identifies visual elements, and detects
           instructions for forming each alphabet.  Users draw the  object locations. The OCR Service extracts any embedded
           alphabets based on these instructions using a touch screen  text, while the Scene Description Model converts this text
           device.  The system analyzes the drawn alphabet and  and scene data into verbal descriptions.  Concurrently,
           compares it with predefined templates, providing feedback  Bounding Box Detection identifies and describes object
           on accuracy and guiding users to improve their handwriting.  locations. Finally, the Text-to-Speech Service converts these
                                                              descriptions into speech, which is delivered back to the user
                                                              through the client-side interface, completing the interaction.


                                                              3.3 Voice to Action Commands
                                                              Integration with API Endpoints: The proposed system’s
                                                              voice command recognition feature integrates seamlessly
               Figure 4 – UI for Handwriting to Text Processing
                                                              with web browsers and other tools used within the learning



                                                          – 323 –
   362   363   364   365   366   367   368   369   370   371   372