DReLU Activation Function: Matching SwiGLU Performance With 90% Sparsity

Table of Links

Abstract and 1. Introduction

Related Work and Background
Analysis

3.1 Limitations about Existing ReLUficatio

3.2 dReLU
Are Neurons in Expert still Sparsely Activated?
dReLU Sparsification
Experiments Results

6.1 Downstream Tasks Performance

6.2 Sparsity of Sparsified Models
Practical Inference Speedup Evaluation

7.1 Experiments Setting

7.2 Pure CPU Inference and 7.3 Hybrid GPU-CPU Inference

7.4 Deploy LLMs on mobile phones
Conclusion and References

A. Appendix / supplemental material

B. Limitation

C. Broader Impact

3.2 dReLU

We introduce a new activation function, named dReLU (Equation 2), where ReLU is applied after both the up- and gate-projection[1].

To demonstrate the effectiveness and performance of dReLU, we conducted an experiment comparing 300M-parameter decoder-only architecture models using dReLU and SwiGLU, both pretrained under the fineweb dataset [47] for 5B tokens. Refer to Appendix A.1 for the detailed model architecture hyperparameters. The evaluation result is shown in Table 2.

Figure 4: Training loss of small models with different activation functions.

Table 2: Validation and training loss on different activations.

Our findings reveal models employing the dReLU structure exhibit similar convergence compared to those using the SwiGLU structure. Notably, we evaluate the perplexity of both models on Wikitext2 [39]. DReLU-based models show slightly better performance on WikiText-2 [39].

Figure 4 illustrates the loss curves during training, demonstrating that models with the dReLU activation function achieve similar convergence ability compared to their SwiGLU counterparts. To further validate this observation, we evaluate the perplexity of these models on the Wikitext2 dataset. As shown in Table 2. Notably, although SwiGLU-based model has lower training loss, dReLU based model has lower validation perplexity. These results provide strong evidence that adopting the dReLU structure does not compromise model performance. We evaluate on more downstream tasks in Appendix A.1.

Another question we need to address is the dReLU-based model’s sparsity. To investigate the sparsity of the dReLU-based model, we propose a methodology for measuring and evaluating a model’s performance under different sparsity levels. Our approach involves selecting the top-k% of values activated by dReLU or other activation functions based on their absolute magnitude, as described in Equations 3 and 4.