Page 367 - Kaleidoscope Academic Conference Proceedings 2024

P. 367

Innovation and Digital Transformation for a Sustainable World

transmitted back to the client side, where it is delivered to the Face Detection: The face detection module employs the
user, completing the interaction cycle. YOLOv5 object detection model to detect faces within an
input image. Detected faces are outlined with rectangular
bounding boxes, providing visual cues for their location
and size. This capability is essential for user identification,
interaction facilitation, and security enhancement. [17]

Figure 2 – Activity Diagram for Speech and Text Processing

3.2 Image and Text Integration

Image Scene Descriptor for the Visually Impaired:
The proposed system integrates advanced image recognition
and natural language processing (NLP) to create detailed
audio descriptions of virtual scenes and images.[18] This
feature allows blind users to understand visual content
through auditory descriptions, significantly enhancing their Figure 5 – UI for Face Detection Model
ability to engage with educational materials that contain
Sign Language to Text: The system detects hand gestures
images.[17] It is important to note that the proposed system,
using the HSV color space for skin color detection. It extracts
although designed for visually impaired users, will still
features such as area, perimeter, and convexity defects of
require assistance from sighted individuals for accurate image
the hand contour, identifying extended fingers to interpret
and text integration. The involvement of sighted assistants
sign language. This feature allows real-time translation
ensures that the descriptive content generated is accurate and
of sign language into text, facilitating communication for
contextually relevant, thereby enhancing the overall learning
hearing-impaired users. [17]
experience.
Image Bounding Box for Learning Material: An algorithm
within The proposed system differentiates between text and
visual content in learning materials. Using Optical Character
Recognition (OCR) and image-to-text models, the platform
converts visual text into speech, making diagrams and other
visual elements accessible. This ensures that all content,
regardless of its original format, is available to visually
impaired users.

Figure 6 – UI for Sign Language Detection

System Flow:
The image processing system begins with the user uploading
Figure 3 – Input and Output to OCR or accessing an image via the client-side frontend interface.
The image is sent to the server-side Image Processing API,
Handwriting to Text: The system provides auditory which extracts text, identifies visual elements, and detects
instructions for forming each alphabet. Users draw the object locations. The OCR Service extracts any embedded
alphabets based on these instructions using a touch screen text, while the Scene Description Model converts this text
device. The system analyzes the drawn alphabet and and scene data into verbal descriptions. Concurrently,
compares it with predefined templates, providing feedback Bounding Box Detection identifies and describes object
on accuracy and guiding users to improve their handwriting. locations. Finally, the Text-to-Speech Service converts these
descriptions into speech, which is delivered back to the user
through the client-side interface, completing the interaction.

3.3 Voice to Action Commands
Integration with API Endpoints: The proposed system’s
voice command recognition feature integrates seamlessly
Figure 4 – UI for Handwriting to Text Processing
with web browsers and other tools used within the learning

– 323 –

362 363 364 365 366 367 368 369 370 371 372