Comparing data augmentation and training techniques to reduce bias against non-native accents in hybrid speech recognition systems

Zhang, Yixuan; Zhang, Y.; Patel, T.B.; Scharenborg, O.E.

Comparing data augmentation and training techniques to reduce bias against non-native accents in hybrid speech recognition systems

Conference paper (2022)

Authors

Yixuan Zhang Student

Y. Zhang Student

T.B. Patel

O.E. Scharenborg

Transfer learning Bias Data augmentation Automatic speech recognition

To reference this document use:

http://resolver.tudelft.nl/uuid:ae08180e-f5f7-4ec1-85a6-1998dc6445b3

More Info

expand_more

Published Date

2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

One important problem that needs tackling for wide deployment of Automatic Speech Recognition (ASR) is the bias in ASR, i.e., ASRs tend to generate more accurate predictions for certain speaker groups while making more errors on speech from other groups. We aim to reduce bias against non-native speakers of Dutch compared to native Dutch speakers. We investigate three different data augmentation techniques - speed and volume perturbation and pitch shift - to increase the amount of non-native accented Dutch training data, and use the augmented data for two transfer learning techniques: model fine-tuning and multi-task learning, to reduce bias in a state-of-the-art hybrid HMM-DNN Kaldi-based ASR system. Experimental results on Dutch read speech and human-machine interaction (HMI) speech showed that although individual data augmentation techniques did not always yield an improved recognition performance, the combination of all three did. Importantly, bias was reduced by more than 18% absolute compared to the baseline system for read speech when applying pitch shift and multitask training, and by more than 7% for HMI speech when applying all three data augmentation techniques during fine-tuning, while improving recognition accuracy of both native and non-native Dutch speech.

Files

Zhang22_s4sg-1.pdf

(pdf | 1.11 Mb)

Unknown license