Comparison of Linguistic Language Classification based on Origin and Data Driven Language Classification using the IPA and Clustering

More Info
expand_more

Abstract

Language similarity is very useful for enrichment data in both Natural Lanuguage Processing (NLP) and Automatic Speech Recognition (ASR). A clustering algorithm could provide an efficient means to define language similarity in a data-driven way. This research investigates the relation between linguistic classification by origin and data driven classification based on the pronunciation of languages using k-means clustering where the focus is placed
on the Indo-European languages. The results show large variation in cluster results and consequently large variation in correspondence with linguistic
classification. This is caused by a relatively even spread of the data over the feature space. Still, the results indicate significance in the relation between
the two classification methods. Furthermore, this research functions as a foundation and a source of inspiration for a lot of possible future research.

Files