Meta researchers goal with MobileLLM is ambitious: showing that, for smaller models, quality is not a direct product of how many billions parameters they have; rather, it is the result of carefully designing their architecture. To prove their point, they coupled deep and thin architectures with embedding sharing and grouped-query attention mechanisms to build 4 models of 125M, 350M, 600M, and 1B parameters able to improve accuracy over prior state-of-the-art models.
MobileLLM shifts away from the generally accepted “scaling law”, attributed to Kaplan, that relates improved performance with an increased number of parameters.
A prevalent belief (Kaplan et al., 2020) in the field suggests that the performance of transformer models is primarily determined by the number of parameters, the size of the training dataset, and the number of training iterations. […] Our experimental results, specifically for small models with limited model capacity, reveals that going deeper is more crucial than going wider for performance improvement.
Previously used for Meta TinyLlama, embedding sharing is a technique consisting in reusing the same weights across input and output embedding layers, which reduces the overall number of weights and makes the model smaller. As Meta researchers explain, this technique is less effective for larger models, where input and output embeddings only account for a minimal portion of total parameters (e.g., 3.7% in LLaMA-70B). On the contrary, for a 125M-parameter model, the embedding layers account for over 20% of parameters.
On a 30-layer 125M-parameter model,
sharing the input and output embeddings reduces the number of parameters by 16M, approximately 11.8% of total parameters with a 0.2 points drop in aver- age accuracy. The marginal accuracy drop can be readily restored by reallocating the saved parameters to add more layers.
Another technique aimed at maximizing weight utilization is immediate block-wise weight sharing, where weights are replicated between adjacent blocks. This has the effect of reducing latency without significantly increasing the model size and can be especially relevant, say the researchers, in scenarios where the main factor determining model latency is memory movement.
Leveraging these techniques and others, MobileLLM aims to define a strong baseline approach to design optimized smaller models. Meta researchers ran a number of experiments to compare MobileLLM with previous state-of-the-art sub-billion parameter models on a number of tasks, including zero-shot common sense reasoning, question answering, and reading comprehension. For example, in zero-shot reasoning,
MobileLLM-LS-125M achieves comparable or even higher results than most pre- vious 350M models. In the 350M model size category, MobileLLM surpasses previous state-of-the-art models by more than 4 points with comparable or smaller model sizes.
Analogous results hold in question answering and reading comprehension tasks.
Meta researchers say there is a growing need for large language models on mobile devices to reduce cloud costs and latency. They also highlight the increasing energy consumption and carbon-dioxide emissions of larger LLMs and advocate for the need to downsize LLMs to make them more environmentally friendly. Shifting to on-device models, they say, may be the answer to these concerns while also improving the model performance by cutting down on latency.
MobileLLM is available on Hugging Face.