Page 154 - AI for Good Innovate for Impact
P. 154

AI for Good Innovate for Impact



                      2�3     Future Works

                      This project aims to build an AI-powered AAC device to address communication challenges
                      for individuals with speech disability such as dysarthria.

                      The proposed methodology outlines a comprehensive approach leveraging speech-encoder
                      and text-encoder for an end-to-end voice conversion system [3]

                      1    TTS System Training: A seq2seq Text-to-Speech (TTS) system, specifically based on
                           the Tacotron architecture, is first trained using transcribed normal speech. This system
                           includes an encoder, attention mechanism, and decoder.
                      2    Cross-Modal KD for Speech-Encoder Training: A teacher-student framework is employed
                           for KD. The text encoder from the pre-trained TTS system acts as the "teacher". A separate
                           speech encoder (the "student") is trained using transcribed dysarthric speech. The goal is
                           to train the speech encoder to generate linguistic representations (spectral embeddings)
                           from dysarthric speech that are similar to the representations (character embeddings)
                           produced by the text encoder from text.
                      3    End-to-end voice construction: The trained speech encoder replaces the text encoder and
                           is concatenated with the attention and decoder modules from the original TTS system.
                           This forms the final end-to-end voice system. During conversion, this system takes the
                           dysarthric speech's spectral features as input, generates spectral embeddings via the
                           speech encoder, and then uses the TTS attention and decoder modules to predict a
                           normal mel-spectrogram. A WaveRNN vocoder synthesizes the final waveform.
                      4    Dataset: The UASpeech dataset will be used for the cross-modal knowledge distillation
                           phase to train the speech encoder. The UASpeech dataset is a speech dataset for
                           dysarthric speech (i.e., speech impaired by motor disorders) developed by the University
                           of Illinois at Urbana-Champaign (UIUC). It contains speech from 15 dysarthric speakers
                           and 13 control (non-dysarthric) speakers.
                      5    Fine-tuning: speech-text dataset of dysarthric speeches or people with speech disability
                           will be collected from hospitals in Nigeria, with affiliation will be pursued. This dataset will
                           be used to fine-tune the pre-trained voice conversion model.
                      6    Affiliations with Limi Hospital, Nigeria, or other hospitals in Nigeria will be actively pursued
                           in order to test and deploy the proposed solution.


                      3      Use Case Requirements

                      •    REQ-01: It is critical that the system implements an end-to-end voice conversion model
                           capable of reconstructing unintelligible speech.
                      •    REQ-02: It is critical that the system features multimodal input, that is, audio and text.
                      •    REQ-03: It is critical that the system trains on dysarthria or non-standard speech data for
                           an effective speech conversion and reconstruction that will typically fail in an ASR trained
                           on standard data.
                      •    REQ-04: It is critical that the system implements data privacy and encryption measures
                           to protect the sensitivity of the data.
                      •    REQ-04: It is critical that the system finds a balance between generalization and
                           personalization for its users.
















                  118
   149   150   151   152   153   154   155   156   157   158   159