That Sounds Familiar

Żelasko, Piotr; Moro-Velazquez, Laureano; Hasegawa-Johnson, Mark; Scharenborg, O.E.; Dehak, Najim

doi:10.21437/Interspeech.2020-2513

That Sounds Familiar

an Analysis of Phonetic Representations Transfer Across Languages

Conference paper (2020)

Authors

Piotr Żelasko Johns Hopkins University

Laureano Moro-Velazquez Johns Hopkins University

Mark Hasegawa-Johnson University of Illinois at Urbana Champaign

O.E. Scharenborg

Najim Dehak Johns Hopkins University

DOI: https://doi.org/10.21437/Interspeech.2020-2513

Speech recognition Transfer learning Crosslingual Multilingual Phone recognition Zero-shot

To reference this document use:

http://resolver.tudelft.nl/uuid:ea68435f-57a5-4472-b522-f8c90dbd218d

More Info

expand_more

Published Date

2020

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Only a handful of the world’s languages are abundant with the resources that enable practical applications of speech processing technologies. One of the methods to overcome this problem is to use the resources existing in other languages to train a multilingual automatic speech recognition (ASR) model, which, intuitively, should learn some universal phonetic representations. In this work, we focus on gaining a deeper understanding of how general these representations might be, and how individual phones are getting improved in a multilingual setting. To that end, we select a phonetically diverse set of languages, and perform a series of monolingual, multilingual and crosslingual (zero-shot) experiments. The ASR is trained to recognize the International Phonetic Alphabet (IPA) token sequences. We observe significant improvements across all languages in the multilingual setting, and stark degradation in the crosslingual setting, where the model, among other errors, considers Javanese as a tone language. Notably, as little as 10 hours of the target language training data tremendously reduces ASR error rates. Our analysis uncovered that even the phones that are unique to a single language can benefit greatly from adding training data from other languages — an encouraging result for the low-resource speech community.

Files

Piotr_etal.pdf

(pdf | 0.672 Mb)

License info not available

2513.pdf

(pdf | 0.749 Mb)

License info not available

Download not available