Analyzing and comparing different self-supervised learning speech pre-trained models in the view of phonetics
More Info
expand_more
Abstract
In this thesis, we analyzed and compared speech representations extracted from different frozen self-supervised learning (SSL) speech pre-trained models on their ability to capture articulatory feature (AF) information and their subsequent prediction of phone recognition performance in within-language and cross-language scenarios. Specifically, we compared speech representations of three SSL speech pre-trained models, including CPC, wav2vec 2.0, and HuBert. Firstly, frame-level AF probing tasks were implemented to analyze AF information captured by different speech representations. Subsequently, phone-level ASR systems were implemented to analyze the phoneme recognition performance of these speech representations. Results showed that the performance of the frame-level AF probing task and the accuracy of the phoneme recognition task were correlated. Compared to the conventional speech representation MFCC, all SSL pre-trained speech representations captured more AF information and achieved better phoneme recognition performance in within-language and cross-language scenarios, with HuBert performing best. Meanwhile, the frame-level AF probing task is a good predictor of phoneme recognition performance, showing the importance of capturing AF information in speech representations. Compared with MFCC, in the within-language scenario, the performance of these SSL speech pre-trained models on AF probing tasks achieved a maximum relative increase of 34.4%, and it resulted in the lowest PER of 10.2%. In the cross-language scenario, the maximum relative increase of 26.7% resulted in the lowest PER of 23.0%.