Authors:
(1) Soham De, Google DeepMind and with Equal contributions;
(2) Samuel L. Smith, Google DeepMind and with Equal contributions;
(3) Anushan Fernando, Google DeepMind and with Equal contributions;
(4) Aleksandar Botev, Google DeepMind and with Equal contributions;
(5) George Cristian-Muraru, Google DeepMind and with Equal contributions;
(6) Albert Gu, Work done while at Google DeepMind;
(7) Ruba Haroun, Google DeepMind;
(8) Leonard Berrada, Google DeepMind;
(9) Yutian Chen, Google DeepMind;
(10) Srivatsan Srinivasan, Google DeepMind;
(11) Guillaume Desjardins, Google DeepMind;
(12) Arnaud Doucet, Google DeepMind;
(13) David Budden, Google DeepMind;
(14) Yee Whye Teh, Google DeepMind;
(15) David Budden, Google DeepMind;
(16) Razvan Pascanu, Google DeepMind;
(17) Nando De Freitas, Google DeepMind;
(18) Caglar Gulcehre, Google DeepMind.
Table of Links
1 Introduction
2 Model Architecture
3 Recurrent Models Scale as Efficiently as Transformers
3.1. Scaling curves
3.2. Evaluation on downstream tasks
4 Training Recurrent Models Efficiently on Device and 4.1. Model parallelism for large scale training
4.2. Efficient linear recurrences on device
4.3. Training speed on longer sequences
5. Inference Speed
5.1. A simple model of the decode step
5.2. Results
6. Long Context Modeling and 6.1. Improving next token prediction with longer contexts
6.2. Copy and retrieval capabilities
7. Related Works
8. Conclusion, Acknowledgements, and References
A. RG-LRU Recurrence Gate
B. Complex-Gated Linear Recurrent Unit (CG-LRU)
C. Model Scale Hyper-Parameters
D. Efficient Linear Recurrences on Device
E. The Local Attention Window Size of Griffin
F. Inference Speeds
G. Improving Next Token Prediction with Longer Contexts: Additional Results
H. Additional Details of the Copy and Retrieval Tasks
3.2. Evaluation on downstream tasks
In order to compare to other models in the literature, we train all our models for 300B tokens before evaluating on downstream tasks. The two external baselines that we compare to are Mamba-3B (Gu and Dao, 2023), the strongest small recurrent model reported in the literature to date, and Llama-2 (Touvron et al., 2023), a widely used open Transformer model. Both external baselines have been trained on significantly more than 300B tokens – Mamba has been trained on 600B tokens, twice more, and Llama-2 has been trained on 2T tokens, nearly seven times more. We note however that both Mamba and Llama-2 were trained on different datasets and with different hyper-parameter tuning strategies, which may partially explain our strong performance. We therefore also include our own MQA transformer baseline, trained on the same data and with the same hyper-parameter tuning budget as Hawk and Griffin.
We provide an evaluation on downstream tasks in Table 1. We find that both Hawk and Griffin achieve very strong performance. In line with other works, we report character normalized accuracy on MMLU, HellaSwag, PIQA, ARC-E and ARC-C, while we report absolute accuracy on WinoGrande with partial scoring. The performance of Hawk improves significantly as we increase the model size, and Hawk-3B achieves stronger performance on downstream tasks than Mamba-3B, despite being trained on half as many tokens. Griffin-3B significantly outperforms Mamba-3B, and Griffin-7B and Griffin-14B achieve performance competitive with Llama-2, despite being trained on nearly 7 times fewer tokens. Hawk is also competitive with our MQA Transformer baseline, while Griffin outperforms this baseline.