Authors:
(1) Jianhui Pang, from the University of Macau, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);
(2) Fanghua Ye, University College London, and work was done when Jianhui Pang and Fanghua Ye were interning at Tencent AI Lab ([email protected]);
(3) Derek F. Wong, University of Macau;
(4) Longyue Wang, Tencent AI Lab, and corresponding author.
Table of Links
Abstract and 1 Introduction
2 Related Work
3 Anchor-based Large Language Models
3.1 Background
3.2 Anchor-based Self-Attention Networks
3.3 Anchor-based Inference
4 Experiments and 4.1 Our Implementation
4.2 Data and Training Procedure
4.3 Evaluation
5 Results
6 Analysis
7 Conclusion, Limitations, Ethics Statement, and References
A More Experimental Results
B Data Settings
4.2 Data and Training Procedure
Considering that AnLLMs are expected to predict subsequent tokens within the context of keys/values hidden states of anchor tokens, this presents a significant challenge for existing open-source LLMs. To this end, by substituting the self-attention networks with anchor-based self-attention networks as detailed in Section 3.2, we continually pre-train the Llama2 model using a publicly available corpus.
Data. We employ the RedPajama-Data-1TSample dataset (Computer, 2023) for the continuous pre-training purpose.[2] This dataset comprises 850, 000 samples with approximately 1 billion tokens, which have been subjected to right truncation to fit the model context of 4, 096.
Training Loss and Perplexity. The left-hand side of Figure 3 depicts the training loss associated with our models. The loss curves for AnLLM-EP and AnLLM-AC consistently decline to approximately 1.9, with AnLLM-AC achieving a lower loss. This observation suggests that continually pre-training an LLM using anchor-based attention masks is indeed viable, enabling the LLM to effectively learn the process of compressing sequence information into anchor tokens.
The right-hand side of Figure 3 displays the perplexity evaluation of the models with varying context lengths. Full attention is utilized to assess the language modeling capabilities of all models. Following the settings of Chen et al. (2023), the perplexity is evaluated on the test samples of the Proof-Pile datasets (Rae et al., 2020). The results demonstrate that both AnLLM-EP and AnLLMAC models maintain a promising performance, exhibiting language modeling capacity comparable to the base model, Llama2-7B. Moreover, this finding suggests that AnLLMs are compatible with full attention, as indicated by minimal perplexity decline.
[2] https://huggingface.co/datasets/ togethercomputer/RedPajama-Data-1T-Sample