Table of Links
Abstract and 1 Introduction
2 Related Work
3 SUTRA Approach
3.1 What is SUTRA?
3.2 Architecture
3.3 Training Data
4 Training Multilingual Tokenizers
5 Multilingual MMLU
5.1 Massive Multitask Language Understanding
5.2 Extending MMLU to Multiple Languages and 5.3 Consistent Performance across Languages
5.4 Comparing with leading models for Multilingual Performance
6 Quantitative Evaluation for Real-Time Queries
7 Discussion and Conclusion, and References
Large Language Models & Multilinguality: The field of Large Language Models (LLMs) has witnessed substantial advancements, particularly through the development of models such as GPT-3 [Brown et al., 2020] and BERT [Devlin et al., 2018], which have set new benchmarks in language understanding and generation. These models utilize vast amounts of data to learn complex patterns and generate coherent text, but their primary limitation has been a focus largely on English language data. In response to the need for supporting global linguistic diversity, research has expanded into multilingual LLMs. Pioneering works like mBERT [Devlin et al., 2018] and XLM-R [Conneau et al., 2020] have demonstrated significant potential in learning representations that generalize across languages. However, these models often face challenges in balancing performance across languages, especially for those less represented in training datasets [Conneau et al., 2020]. Further, as the number of languages increases, the scalability and efficiency of these models often degrade, necessitating more specialized architectures to handle the diversity of languages effectively [Smith et al., 2021].
Neural Machine Translation: Neural Machine Translation (NMT) has been integral to the progress in multilingual model performance. Early NMT systems were limited by the complexity of their architectures and the quality of their translations, especially in low-resource languages [Wu et al., 2019]. Recent studies have revisited the core challenges of machine translation in the context of advanced Large Language Models (LLMs). The work by Koehn and Knowles [2017] offers insights into the ongoing relevance of challenges such as domain mismatch, rare word prediction, and translation of long sentences, even as LLMs have shown significant improvements in these areas. Additionally, a study by Son and Kim [2023] explored the translation performance of LLMs from the user’s perspective, highlighting their potential to enhance the translation of long sentences while also identifying persistent challenges around domain mismatch and rare word prediction. The work by Wu et al. [2016] on Google’s neural machine translation system has also served as a benchmark for progress in this field, bridging the gap between human and machine translation. Recently, the work by Costa-jussà et al. [2022] showed that the Mixture of Experts architecture can be used effectively in the context of Neural Machine Translation and have considerable gains in translation performance on various low-resource languages.
Mixture of Experts: Mixture of Experts (MoE) has emerged as a promising architecture for managing the computational costs associated with scaling up large language models (LLMs). Recent studies have explored the benefits of MoE in this context. Zhou et al. [2022] proposed a Mixture-of-Experts with Expert Choice Routing, which enables dynamic allocation of data among different experts, allowing each expert to focus on its expertise and achieve model sparsity. Similarly, Zoph [2022] investigated the design of effective sparse expert models, highlighting the importance of carefully balancing the number and size of experts to optimize performance. Additionally, Ott et al. [2022] introduced the OPT family of open pre-trained transformer language models, which leverage MoE to achieve significant improvements in efficiency and scalability compared to dense models. Furthermore, Zheng et al. [2019] explored the application of MoE in the context of Chinese idiom datasets, demonstrating the potential of this approach to enhance language understanding tasks. These studies collectively suggest that MoE can serve as an effective choice for building highly capable and computationally efficient LLMs.
Multimodal LLMs: Researchers have also explored the potential of multimodal Large Language Models that can process and generate content across different modalities, such as text, images, and video. For example, the work by Dai et al. [2019] has investigated the use of multimodal models for tasks like image captioning and visual question answering, demonstrating their ability to leverage cross-modal information to enhance performance. Similarly, the study by Nichols and Warnow [2008] has explored the application of multimodal models in the context of computational linguistic phylogeny, highlighting their potential to uncover insights from diverse data sources. Additionally, the recent advancements in the field of multimodal machine translation, as discussed by Birch [2021], have shown the benefits of integrating visual information into language models to improve translation quality.
Online LLMs: Modern Large Language Models like Llama2, GPT-3.5, and GPT-4 have been engineered as comprehensive, open-domain chatbots capable of engaging in extended dialogues on a variety of topics. Yet, they face a significant limitation: their data is time-locked, leading to a cutoff date for knowledge. Due to this, these models sometimes generate responses that are plausible yet factually incorrect, diminishing the reliability of their output as noted by Vu et al. [2023] and Press et al. [2022] and such inaccuracies are often linked to outdated information embedded in the model’s parameters. A detailed list of knowledge cutoff dates for major models is shown in Table 1. While this can be somewhat rectified through additional training with human feedback or by incorporating knowledge-intensive tasks, scaling these solutions to accommodate real-time updates, such as changes in stock prices, remains challenging [Komeili et al., 2021]. In-context learning presents a promising alternative, allowing for the incorporation of real-time data directly into the model’s prompts to guide response generation. Although there are ongoing efforts to enhance LLMs with internet search results, effectively leveraging this external data to improve the accuracy of LLM outputs is still under development. In this context, SUTRA stands out by presenting a structured approach for response augmentation, providing the ability to learn, reason, and interpret information from various knowledge sources.