Cross-Entropy Loss Analysis In Transformer Networks

Cross-Entropy Loss Analysis in Transformer Networks | HackerNoon

Last updated: 2025/06/19 at 12:25 PM

News Room Published 19 June 2025

Table of Links

Abstract and 1 Introduction

2 Related Work

3 Model and 3.1 Associative memories

3.2 Transformer blocks

4 A New Energy Function

4.1 The layered structure

5 Cross-Entropy Loss

6 Empirical Results and 6.1 Empirical evaluation of the radius

6.2 Training GPT-2

6.3 Training Vanilla Transformers

7 Conclusion and Acknowledgments

Appendix A. Deferred Tables

Appendix B. Some Properties of the Energy Functions

Appendix C. Deferred Proofs from Section 5

Appendix D. Transformer Details: Using GPT-2 as an Example

References

5 Cross-Entropy Loss

We now proceed to analyze the loss of Transformer networks. The cross-entropy loss, which measures the difference between predicted probabilities and actual labels, is commonly used for training Transformer models. The attention mechanism includes a softmax operation that outputs a probability distribution p ∈ ∆n. In practice, the final softmax output is then fed into a task-specific layer for downstream tasks, such as predictions and classifications. Thus, we compare the last softmax output of the transformer blocks with the target distribution.

We have the following result regarding the cross-entropy loss.

Remark 2 The cross-entropy can be written as

When the model is severely over-parameterized, the energy function can well approximate the energy of the sample distribution. In this case, the minimal cross-entropy equals the entropy of the training samples.

Next, we take a closer look at the layer partition function. We have

In Table 2 in Appendix A, we compare the reported cross-entropy loss of various transformer-based models in the literature. Usually, a family of models ranging in a variety of sizes is reported, and we select the largest ones. We observe that similar cross-entropy loss is achieved across a wide range of architectural shapes (including depth, width, attention heads, FF dimensions, and context lengths). Nevertheless, the losses all satisfy L > 1.

Remark 3 We remark that some models add auxiliary regularization terms such as the z-loss (Chowdhery et al., 2023; Yang et al., 2023) during their training. In these cases, the scaling laws should take into consideration the additional terms. Also, modifications to the transformer blocks, such as additional layer normalization may contribute to the lower bound of the cross-entropy.

Cross-Entropy Loss Analysis in Transformer Networks | HackerNoon

Table of Links

5 Cross-Entropy Loss

Leave a Reply Cancel reply

Stay Connected

Latest News

BYD to bring ultra-luxury lineup to Europe, executive says · TechNode

OpenAI expands UK footprint and backs local infrastructure – UKTN

After reading this, you will never watch F1 again in the same way.

This Roomba robot vac has a clever piece of tech to protect your carpets | Stuff

World of Software is your one-stop website for the latest tech news and updates, follow us now to get the news that matters to you.

Quick Link

Topics

Sign Up for Our Newsletter

Table of Links

5 Cross-Entropy Loss

Sign Up For Daily Newsletter

Be keep up! Get the latest breaking news delivered straight to your inbox.

Leave a Reply Cancel reply

Stay Connected

Latest News