Efficiently Upgrading Large Language Models to Support New Languages

  • Subject:Large Language Models
  • Type:Masterarbeit
  • Supervisor:

    Danni Liu

  • Add on:

     

    Keywords: continual learning; multilinguality; large language models

     

    Abstract: Large language models (LLMs; Brown et al., 2020, Scao et al., 2022, Touvron et al., 2023) have shown impressive zero- and few-shot learning capabilities on a wide variety of tasks. Despite their promising results on English, these models tend to suffer from performance degradation on non-English languages, especially low-resource ones (Robinson et al., 2023).

     

    The limited representation power for non-English languages is partly related to training data. For instance, nearly 90% of the training data for Llama (Touvron et al., 2023) is in English. The other factor is model architecture. Like other NLP models, LLMs have a finite-size vocabulary. When processing languages with diverse writing systems, the model often falls back to character- or byte-level representations. This brings both computational and modeling challenges. Computationally, the inputs become longer sequences, increasing the models’ memory footprints. For the model, it is also harder to capture the long-range dependencies essential for learning languages.

     

    To resolve these challenges, prior works have looked into vocabulary expansion (Wang et al, 2020, Ebrahimi & Kann, 2021). This approach enlarges the already parameter-heavy token embedding table (which takes a large portion of the total number of model parameters), which is far from ideal for both memory and speed. Furthermore, existing works (Ebrahimi & Kann, 2021, Yong et al., 2023) mostly focus on sequence classification tasks rather than generation tasks. Nonetheless, generation tasks are much more difficult, as it involves a classification problem at every generation time step.

     

    The goal of this thesis is to fill the gap in the literature by investigating methods to efficiently extending LLMs to new languages, with a focus on generation tasks for languages that are less covered by the existing model vocabulary.

     

    Some research directions include but are not limited to: visual representations (Rust et al. 2022); efficiently reusing existing token embeddings by parameter-efficient finetuning (e.g., Hu et al., 2022).

     

    Requirements:

    Basic requirements: strong programming and debugging skills; knowledge of Python and PyTorch; knowledge of machine learning

     

    Preferred requirement: experience working on remote servers

     

    Literature:

    Brown et al., 2020, Language Models are Few-Shot Learners. NeurIPS.

     

    Ebrahimi & Kann, 2021, How to Adapt Your Pretrained Multilingual Model to 1600 Languages. ACL.

     

    Hu et al., 2022, LoRA: Low-Rank Adaptation of Large Language Models. ICLR.

     

    Robinson et al., 2023, ChatGPT MT: Competitive for High-(but not Low-) Resource Languages. arXiv preprint.

     

    Rust et al., 2022, Language Modelling with Pixels, ICLR.

     

    Scao et al., BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv preprint.

     

    Touvron et al., 2023, Llama 2: Open foundation and fine-tuned chat models. arXiv preprint.

     

    Yong et al., 2023, BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting, ACL.

     

    Wang et al, 2020, Extending Multilingual BERT to Low-Resource Languages, EMNLP findings.