Table of Links
Abstract and 1 Introduction
2 Related Work
3 Model and 3.1 Associative memories
3.2 Transformer blocks
4 A New Energy Function
4.1 The layered structure
5 Cross-Entropy Loss
6 Empirical Results and 6.1 Empirical evaluation of the radius
6.2 Training GPT-2
6.3 Training Vanilla Transformers
7 Conclusion and Acknowledgments
Appendix A. Deferred Tables
Appendix B. Some Properties of the Energy Functions
Appendix C. Deferred Proofs from Section 5
Appendix D. Transformer Details: Using GPT-2 as an Example
References
5 Cross-Entropy Loss
We now proceed to analyze the loss of Transformer networks. The cross-entropy loss, which measures the difference between predicted probabilities and actual labels, is commonly used for training Transformer models. The attention mechanism includes a softmax operation that outputs a probability distribution p ∈ ∆n. In practice, the final softmax output is then fed into a task-specific layer for downstream tasks, such as predictions and classifications. Thus, we compare the last softmax output of the transformer blocks with the target distribution.
We have the following result regarding the cross-entropy loss.
Remark 2 The cross-entropy can be written as
When the model is severely over-parameterized, the energy function can well approximate the energy of the sample distribution. In this case, the minimal cross-entropy equals the entropy of the training samples.
Next, we take a closer look at the layer partition function. We have
In Table 2 in Appendix A, we compare the reported cross-entropy loss of various transformer-based models in the literature. Usually, a family of models ranging in a variety of sizes is reported, and we select the largest ones. We observe that similar cross-entropy loss is achieved across a wide range of architectural shapes (including depth, width, attention heads, FF dimensions, and context lengths). Nevertheless, the losses all satisfy L > 1.
Remark 3 We remark that some models add auxiliary regularization terms such as the z-loss (Chowdhery et al., 2023; Yang et al., 2023) during their training. In these cases, the scaling laws should take into consideration the additional terms. Also, modifications to the transformer blocks, such as additional layer normalization may contribute to the lower bound of the cross-entropy.