Table of Links
Abstract and 1. Introduction
- Proposed Method: Quantized DyLoRA
- Experiments and Evaluation
- On the semi-sorted behavior of QDyLoRA
- Conclusion, Limitations, and References
A. Supplementary Material
A.1. Hyperparameters
A.2. Generated Text Quality
2 Proposed Method: Quantized DyLoRA
Following QLoRA (Dettmers et al., 2023), we used 4-bit Normal Float (NF4) for storing the double quantized pre-trained weights. As all the computations need to be calculated in BFloat16 precision, DDequant-NF4 will dequantize the stored data. Similar to (Dettmers et al., 2023), we have:
Algorithm 1 describes the workflow of our proposed QDyLoRA in detail.
3 Experiments and Evaluation
This section evaluates the efficiency and efficacy of QDyLoRA through several instruct-fine-tuning
tasks. The first experiment compares QDyLoRA with QLoRA on Massively Multitask Language Understating (MMLU) benchmark (Hendrycks et al., 2020), consisting of more than 50 different tasks, spanning from fundamental mathematics and U.S. history to computer science and law. As shown in Table 1 [1], we finetune LLaMA-7b, LLaMA-13b, LLaMA2-13b, and Falcon40b on different datasets, Alpaca (Taori et al., 2023), OASST1 (Köpf et al., 2023), Self-Instruct (Wang et al., 2022), and FLANv2 (Chung et al., 2022), using QLoRA and QDyLoRA techniques. We use the same training budget and maximum LoRA rank[2] for each technique. The results consistently show that QDyLoRA achieves a superior performance by finding the optimal rank.
The second experiment provides a more in-depth comparison between QLoRA and QDyLoRA. In particular, we fairly finetuned Falcon-40b on WebGLM (Liu et al., 2023) and GSM8k (Cobbe et al., 2021) benchmarks, and compared their test performances across different ranks. As described in Table 2, QDyLoRA attains superior performance, notably when employing its optimal ranks (Rank 2 for Web-GLM and Rank 8 for GSM8k). Furthermore, QDyLoRA exhibits consistent superiority over QLoRA, particularly at lower ranks. These findings emphasize the adaptive nature of QDyLoRA in dynamically adjusting its focus during fine-tuning, leading to enhanced efficiency and efficacy compared to its static counterpart, QLoRA. The third experiment compares the performance of DyLoRA, QDyLoRA, and QLoRA on GSM8k and TriviaQA (Joshi et al., 2017) while adopting LLaMA2-13b and LLaMA-7b as LLMs. Table 3 reports the results. As the table illustrates, for smaller-size models, i.e. LLaMA-7b, DyLoRA and QDyLoRA both perform superior than QLoRA. For larger models, i.e. LLaMA2-13b, DyLoRA fails due to the out-of-memory (OOM) error while QDyLoRA works the best in such situations.
4 On the semi-sorted behavior of QDyLoRA
As shown in Table 2, QDyLoRA reveals a semisorted performance across ranks. We justify this behavior by pointing out the limited finetuning budget. In a limited budget assumption, QDyLoRA updates its lower ranks more frequently than its higher ranks. That is because of the fact that lower ranks are also updated when higher ranks are selected. In other words, lower ranks have more chance to get updated than higher ranks. Hence, lower ranks are more tuned than higher ranks.
This paper is available on arxiv under ATTRIBUTION-NONCOMMERCIAL-SHAREALIKE 4.0 INTERNATIONAL license.
[1] The same settings as the original QLoRA work are applied here.
[2] The maximum LoRA rank is fixed to 64. While QLoRA’s rank is always fixed, QDyLoRA can split the training across ranks in range 1 to 64.