Computational Language Documentation by 2025 (CLD 2025)

  • Contact:

    Zhaolin Li, Jan Niehues

  • Funding:

    Deutsche Forschungsgemeinschaft (DFG)

There are approximately 7000 languages spoken today, but around two-thirds of them count less than 1000 speakers and are at risk of disappearing. For linguistics, manual documenting the endangered languages is labour-intensive and difficult to perform with the required level of consistency. Language technologies with deep learning techniques have shown the potential to aid in time-consuming and hard tasks. Still, the technologies are little-used in language documentation, and there are few case studies demonstrating practical usefulness in low-resource settings. Therefore, the CLD2025 project aims to implement language technologies by developing a co-construction of models and tools by field linguists and computational linguists, and the development of interfaces and systems that allow real use by field linguists.  

In this project, we focus mostly on techniques for automatic speech transcription. We will work on time alignment technique to involve timing information in audio recordings and transcriptions. Besides, with the pronunciation dictionaries and variants from our partners, we will focus on phoneset discovery and phone recognition. In addition, we will investigate leveraging tone information in speech transcription. Specifically, we propose to design tone models for two language documentation tasks: unsupervised word segmentation and phonemic and tonal transcription. For endangered languages, acquiring sufficient data resources is difficult and sometimes impossible. Therefore, we aim to develop and improve the above techniques in low-resource scenarios.