Unsupervised Subword Modeling Using Autoregressive Pretraining and Cross-Lingual Phone-Aware Modeling
More Info
expand_more
Abstract
This study addresses unsupervised subword modeling, i.e.,
learning feature representations that can distinguish subword
units of a language. The proposed approach adopts a two-stage
bottleneck feature (BNF) learning framework, consisting of autoregressive
predictive coding (APC) as a front-end and a DNNBNF
model as a back-end. APC pretrained features are set as
input features to a DNN-BNF model. A language-mismatched
ASR system is used to provide cross-lingual phone labels for
DNN-BNF model training. Finally, BNFs are extracted as the
subword-discriminative feature representation. A second aim of
this work is to investigate the robustness of our approach’s effectiveness
to different amounts of training data. The results on
Libri-light and the ZeroSpeech 2017 databases show that APC
is effective in front-end feature pretraining. Our whole system
outperforms the state of the art on both databases. Cross-lingual
phone labels for English data by a Dutch ASR outperform those
by a Mandarin ASR, possibly linked to the larger similarity of
Dutch compared to Mandarin with English. Our system is less
sensitive to training data amount when the training data is over
50 hours. APC pretraining leads to a reduction of needed training
material from over 5,000 hours to around 200 hours with
little performance degradation.