Analyzing Multilingual Representations in Large Language Models

  • Forschungsthema:Large Language Models
  • Typ:Bachelorarbeit
  • Betreuung:

    Danni Liu

  • Bearbeitung:Ao Zuo
  • Zusatzfeld:

     

    Open for bachelor thesis (English only)

     

    Keywords: multilinguality; large language models; representation learning

     

    Abstract: Multilingual NLP models are known to suffer from the “curse of multilinguality”, where adding more languages to the model eventually leads to capacity bottlenecks and performance degradation. This observation is consistent on various NLP tasks, including multilingual pretraining (Conneau et al., 2020) and machine translation (Aharoni et al., 2019). Recently large language models (LLMs) have shown tremendous progress on a wide range of NLP tasks. However, how different languages are represented by these models is still not well understood. Whether the “curse of multilinguality” still applies is also to be validated. A focus in this thesis is to analyze the hidden representations for different languages (e.g., high-resource vs. low-resource) with different LLMs (e.g., multilingually-trained like BLOOM (Scao et al., 2022) vs. English-centered like LLaMA (Touvron et al., 2023)). Some techniques include but are not limited to: SVCCA (Raghu et al., 2017) and probing (Adi et al., 2017).

     

    Literature:

    Alexis Conneau, et al.. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.

     

    Roee Aharoni, et al. 2019. Massively Multilingual Neural Machine Translation. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

     

    Scao, Teven Le, et al. 2022. Bloom: A 176b-parameter open-access multilingual language model." arXiv preprint arXiv:2211.05100.

     

    Touvron, Hugo, et al. 2023. Llama: Open and efficient foundation language models." arXiv preprint arXiv:2302.13971.

     

    Raghu, Maithra, et al. 2017. SVCCA: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems 30.

     

    Adi, Yossi, et al. 2017. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In Proceedings of the 2017 International Conference on Learning Representations.