Page 759 - AI for Good Innovate for Impact

P. 759

AI for Good Innovate for Impact

transcription. Future applications include summarizing, categorizing, or translating responses
using other natural language processing tools. The specific objectives are:

• Collect humanitarian domain-specific audio data with transcription.
• Develop a language model with speech-to-text capabilities that can detect dialects and 4.9: Accessibility
produce high-quality transcriptions.
• Create a functioning prototype in KoboToolbox.
• Conduct user testing and human verifications for a validation set to prove the model's
accuracy.
• Collaborate with local academia and include underrepresented refugee communities.
• Document the process, lessons learned, and findings to create a replicable approach for
low-resource language models.
• Ensure data protection, data responsibility, human rights, and ethics through the inclusion
of UNHCR Iraq experts.

The project aims to create a template for low-resource languages in domain-specific fields by
fine-tuning existing speech-to-text models using publicly available and domain-specific data.
It addresses a gap in processing unstructured audio data by developing a language model
for Kurdish and its dialects (Sorani and Kurmanji). This approach enhances data processing
efficiency and contributes to industry innovation and infrastructure development. By integrating
the model into KoboToolbox, the project will facilitate the collection and transcription of audio
data, making the process more efficient and effective.
The expected impact and benefits of this solution include:

• Reduce manual processing of audio data in humanitarian domain.
• Improve response times to critical issues in emergency situations.
• Serve as a template for addressing similar challenges in other low-resource languages.
• Enable cross-validation of audio data.
• Generate dashboards and access previously unprocessed insights.
• Improve decision-making for resource allocation.
• Ensure replicability for other languages and operations

Use Case Status: It’s in progress.

Partners: Kobo organization, University of Kurdistan - Hewler

2�2 Benefits of the use case

Industry, Innovation and Infrastructure

This project contributes to innovation by establishing a framework for enabling low-resource
languages to function in specialized domains, such as humanitarian audio data. Through the
development of a language model tailored for Kurdish and its major dialects (Sorani and
Kurmanji), it addresses a significant gap in digital infrastructure for processing unstructured
voice data. The project fine-tunes existing automatic speech recognition (ASR) models using
both publicly available and domain-specific data, improving the efficiency of data processing
and promoting technological advancement in underrepresented linguistic communities.

723

754 755 756 757 758 759 760 761 762 763 764