Page 154 - AI for Good Innovate for Impact
P. 154
AI for Good Innovate for Impact
2�3 Future Works
This project aims to build an AI-powered AAC device to address communication challenges
for individuals with speech disability such as dysarthria.
The proposed methodology outlines a comprehensive approach leveraging speech-encoder
and text-encoder for an end-to-end voice conversion system [3]
1 TTS System Training: A seq2seq Text-to-Speech (TTS) system, specifically based on
the Tacotron architecture, is first trained using transcribed normal speech. This system
includes an encoder, attention mechanism, and decoder.
2 Cross-Modal KD for Speech-Encoder Training: A teacher-student framework is employed
for KD. The text encoder from the pre-trained TTS system acts as the "teacher". A separate
speech encoder (the "student") is trained using transcribed dysarthric speech. The goal is
to train the speech encoder to generate linguistic representations (spectral embeddings)
from dysarthric speech that are similar to the representations (character embeddings)
produced by the text encoder from text.
3 End-to-end voice construction: The trained speech encoder replaces the text encoder and
is concatenated with the attention and decoder modules from the original TTS system.
This forms the final end-to-end voice system. During conversion, this system takes the
dysarthric speech's spectral features as input, generates spectral embeddings via the
speech encoder, and then uses the TTS attention and decoder modules to predict a
normal mel-spectrogram. A WaveRNN vocoder synthesizes the final waveform.
4 Dataset: The UASpeech dataset will be used for the cross-modal knowledge distillation
phase to train the speech encoder. The UASpeech dataset is a speech dataset for
dysarthric speech (i.e., speech impaired by motor disorders) developed by the University
of Illinois at Urbana-Champaign (UIUC). It contains speech from 15 dysarthric speakers
and 13 control (non-dysarthric) speakers.
5 Fine-tuning: speech-text dataset of dysarthric speeches or people with speech disability
will be collected from hospitals in Nigeria, with affiliation will be pursued. This dataset will
be used to fine-tune the pre-trained voice conversion model.
6 Affiliations with Limi Hospital, Nigeria, or other hospitals in Nigeria will be actively pursued
in order to test and deploy the proposed solution.
3 Use Case Requirements
• REQ-01: It is critical that the system implements an end-to-end voice conversion model
capable of reconstructing unintelligible speech.
• REQ-02: It is critical that the system features multimodal input, that is, audio and text.
• REQ-03: It is critical that the system trains on dysarthria or non-standard speech data for
an effective speech conversion and reconstruction that will typically fail in an ASR trained
on standard data.
• REQ-04: It is critical that the system implements data privacy and encryption measures
to protect the sensitivity of the data.
• REQ-04: It is critical that the system finds a balance between generalization and
personalization for its users.
118