Page 757 - AI for Good Innovate for Impact
P. 757

AI for Good Innovate for Impact



                   Use  Case  6:  Fine-Tuning  ASR  Language  Models  for

               Underrepresented Low-Resource Languages                                                              4.9: Accessibility









               Organization: United Nations High Commissioner for Refugees(UNHCR)

               Country: Iraq

               Contact Person(s):

                    Roshna Abdulrahman, abdulrar@ unhcr .org
                    Sofia Kyriazi, kyriazis@ unhcr .org
                    Rebeca Moreno Jimenez, morenoji@ unhcr .org


               1      Use Case Summary Table


                Item                  Details
                Category              Accessibility

                Problem Addressed     In the Kurdistan Region, UNHCR receives a lot of information from
                                      communities via unstructured text and audio recordings in Kurdish.
                                      These have to be manually processed and assessed, which is time-con-
                                      suming, so the data is often not fully analyzed.

                Key Aspects of Solution The result of this project will be a language model for Kurdish speech
                                      to text - STT that is integrated into Kobotoolbox, a survey tool used by
                                      humanitarian organizations globally. Using this tool, colleagues can
                                      have more time to work on analyzing the data instead of spending
                                      time on transcription.

                Technology Keyword    LLMs, GenAI, GPTs, Data Collection, Data Analysis, Low resource
                                      languages, ASR, STT,  NLP
                Data Availability     Public: a multi-language, open-source  voice dataset for  training
                                      speech-enabled applications [2]
                                      An experimental dataset of Sorani Kurdish that could be used in
                                      speech recognition using CMUSphinx [1].
                                      Private data collection:
                                      •  Synthetic Storytelling recordings
                                      •  Interviews with Kurdish Speakers




















                                                                                                    721
   752   753   754   755   756   757   758   759   760   761   762