Table of Links
Abstract and 1 Introduction
2 Related Work
3 Model and 3.1 Associative memories
3.2 Transformer blocks
4 A New Energy Function
4.1 The layered structure
5 Cross-Entropy Loss
6 Empirical Results and 6.1 Empirical evaluation of the radius
6.2 Training GPT-2
6.3 Training Vanilla Transformers
7 Conclusion and Acknowledgments
Appendix A. Deferred Tables
Appendix B. Some Properties of the Energy Functions
Appendix C. Deferred Proofs from Section 5
Appendix D. Transformer Details: Using GPT-2 as an Example
References
Scaling laws As discussed in the introduction, we have seen consistent empirical evidence that the performance of models increases as both the size of the models and the volume of training data scale up (Kaplan et al., 2020; Khandelwal et al., 2019; Rae et al., 2021; Chowdhery et al., 2023). Intensive experiments have also been conducted to explore neural scaling laws under various conditions, including constraints on computational budget (Hoffmann et al., 2022b), data (Muennighoff et al., 2024), and instances of over-training (Gadre et al., 2024). In these analyses, a decomposition of the expected risk is utilized, leading to the following fit:
For Chinchilla models, the fitted parameters are (Hoffmann et al., 2022a)
A line of research concerns the generalization of over-parameterized neural networks (Belkin et al., 2019; Nakkiran et al., 2021; Power et al., 2022). Recent experiments show that overtrained transformers exhibits inverted U-shaped scaling behavior (Murty et al., 2023), which cannot be explained by the empirical scaling laws.
Hopfield models Classical Hopfield networks (Amari, 1972; Hopfield, 1982) were introduced as paradigmatic examples of associative memory. The network’s update dynamics define an energy function, whose fixed points correspond to the stored memories. An important indicator is the number of patterns that the model can memorize, known as the network’s storage capacity. Modifications to the energy function (Krotov and Hopfield, 2016; Demircigil et al., 2017) result in higher storage capacities (see Table 1 in Appendix A). The original model operates on binary variables. The modern continuous Hopfield network (MCHN) (Ramsauer et al., 2020) generalizes the Hopfield model to the continuous domain, making it an appealing tool for understanding the attention mechanism in Transformers, which also take vector embeddings in the real domain as inputs. Given an input (e.g., a prompt), the Hopfield layer retrieves a memory by converging to a local minimum of the energy landscape, and the update rule has a nice correspondence to the query-key-value mechanism in attention. Krotov (2021) proposes a Hierarchical Associative Memory (HAM) model that enables the description of the neural network with a global energy function, as opposed to energy functions for individual layers.