Improving State-of-the-Art ASR Systems for Speakers with Dysarthria

Günther, M.

Improving State-of-the-Art ASR Systems for Speakers with Dysarthria

Applying Low-Rank Adaptation Transfer Learning to Whisper

Bachelor thesis (2024)

Authors

M. Günther Electrical Engineering, Mathematics and Computer Science

Contributors

Z. Yue (mentor)

Y. Zhang (mentor)

T. Durieux Software Engineering (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

Transfer Learning Automatic Speech Recognition Dysarthria Low-Rank Adaptation Whisper Model

To reference this document use:

http://resolver.tudelft.nl/uuid:9e54041d-1ecf-42ea-823c-203a0dd0b3b1

More Info

expand_more

Published Date

27-06-2024

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Dysarthria is a speech disorder that limits an individual’s ability to clearly articulate, due to the weakening of the muscles involved in speech. Despite recent advances in Automatic Speech Recognition (ASR), the recognition of dysarthric speech remains a significant challenge because of the limited availability of dysarthric speech data, significant speaker variability, and the mismatch between typical and dysarthric speech patterns. This paper addresses these challenges by using transfer learning and Low-Rank Adaptation (LoRA) techniques to enhance the performance of the state- of-the-art ASR model Whisper on dysarthric speech. By fine-tuning Whisper with the TORGO dataset, this study aims to adapt the pre-trained models to better recognise dysarthric speech patterns, thus reducing Word Error Rates (WER) and improving accessibility for individuals with speech impairments. Experimental results indicate that this approach can improve speech recognition performance since the Large- V2, Large-V3 and the corresponding distilled models achieved a reduction in WER after fine-tuning. The Large-V3 model achieved the greatest relative WER reduction of 22.65%.

Files

Mirella-Gu_nther-Final-Paper.p... (pdf)

(pdf | 1.22 Mb)

Unknown license