Characterization of Speech Similarity Between Australian Aboriginal and High-Resource Languages: A Case Study on Dharawal

Ting Dang, Trini Manoj Jeyaseelan, Eliathamby Ambikairajah, Vidhyasaharan Sethu

Published: 2025/9/1

Abstract

Australian Aboriginal languages are of significant cultural and linguistic value but remain severely underrepresented in modern speech AI systems. While state-of-the-art speech foundation models and automatic speech recognition excel in high-resource settings, they often struggle to generalize to low-resource languages, especially those lacking clean, annotated speech data. In this work, we collect and clean a speech dataset for Dharawal, a low-resource Australian Aboriginal language, by carefully sourcing and processing publicly available recordings. Using this dataset, we analyze the speech similarity between Dharawal and 107 high-resource languages using a pre-trained multilingual speech encoder. Our approach combines (1) misclassification rate analysis to assess language confusability, and (2) fine-grained similarity measurements using cosine similarity and Fr\'echet Inception Distance (FID) in the embedding space. Experimental results reveal that Dharawal shares strong speech similarity with languages such as Latin, M\=aori, Korean, Thai, and Welsh. These findings offer practical guidance for future transfer learning and model adaptation efforts, and underscore the importance of data collection and embedding-based analysis in supporting speech technologies for endangered language communities.

Read Full Paper (arXiv.org)