QLoRA: Efficient Finetuning of Quantized LLMs

Abstract: We present QLoRA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLoRA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters~(LoRA). Our best model family, which we name Guanaco, outperforms all previous openly released models on the Vicuna benchmark, reaching 99.3% of the performance level of ChatGPT while only requiring 24 hours of finetuning on a single GPU. QLoRA introduces a number of innovations to save memory without sacrificing performance: (a) 4-bit NormalFloat (NF4), a new data type that is information theoretically optimal for normally distributed weights (b) double quantization to reduce the average memory footprint by quantizing the quantization constants, and (c) paged optimziers to manage memory spikes. We use QLoRA to finetune more than 1,000 models, providing a detailed analysis of instruction following and chatbot performance across 8 instruction datasets, multiple model types (LLaMA, T5), and model scales that would be infeasible to run with regular finetuning (e.g. 33B and 65B parameter models). Our results show that QLoRA finetuning on a small high-quality dataset leads to state-of-the-art results, even when using smaller models than the previous SoTA. We provide a detailed analysis of chatbot performance based on both human and GPT-4 evaluations showing that GPT-4 evaluations are a cheap and reasonable alternative to human evaluation. Furthermore, we find that current chatbot benchmarks are not trustworthy to accurately evaluate the performance levels of chatbots. A lemon-picked analysis demonstrates where Guanaco fails compared to ChatGPT. We release all of our models and code, including CUDA kernels for 4-bit training.

Preview

Synopsis

Overview

Keywords: QLoRA, Finetuning, Quantized LLMs, Low-Rank Adapters, Memory Efficiency
Objective: Introduce QLoRA, an efficient finetuning method for quantized large language models (LLMs) that significantly reduces memory usage while maintaining performance.
Hypothesis: QLoRA can finetune large language models with 4-bit quantization without degrading performance compared to traditional 16-bit methods.
Innovation: Introduction of 4-bit NormalFloat quantization, double quantization, and paged optimizers to enable efficient finetuning of large models on limited hardware.

Background

Preliminary Theories:
- Quantization: The process of reducing the precision of the model weights to lower bit representations, which helps in reducing memory usage.
- Low-Rank Adapters (LoRA): A technique that adds a small number of trainable parameters to a frozen model, allowing for efficient finetuning without updating the entire model.
- Gradient Checkpointing: A method to save memory during training by storing only a subset of intermediate activations and recomputing others as needed.
- Instruction Tuning: The process of finetuning models to follow specific instructions more effectively, often using datasets designed for this purpose.
Prior Research:
- 2020: Introduction of LoRA as a parameter-efficient finetuning method.
- 2021: Development of quantization techniques focused on inference, with limited application during training.
- 2022: Advancements in memory-efficient training methods, highlighting the challenges of finetuning large models.
- 2023: Emergence of various instruction-tuning datasets and benchmarks, setting the stage for more targeted model evaluations.

Methodology

Key Ideas:
- 4-bit NormalFloat (NF4): A new quantization data type optimized for normally distributed weights, improving performance over traditional 4-bit representations.
- Double Quantization: A technique that quantizes the constants used in quantization, further reducing memory requirements.
- Paged Optimizers: A method to manage memory spikes during training, allowing for the finetuning of larger models on consumer-grade GPUs.
Experiments:
- Model Training: Finetuning of LLaMA models ranging from 7B to 65B parameters using QLoRA across multiple instruction-following datasets.
- Benchmarks: Evaluation on the Vicuna benchmark and comparison with existing models like ChatGPT and Vicuna.
- Data Analysis: Examination of the impact of dataset quality versus size on model performance.
Implications: The design of QLoRA allows for the finetuning of large models on consumer hardware, democratizing access to advanced NLP technologies.

Findings

Outcomes:
- QLoRA enables finetuning of 65B parameter models on a single 48GB GPU without performance degradation compared to 16-bit methods.
- The Guanaco model family achieved up to 99.3% of ChatGPT's performance on benchmarks, demonstrating the effectiveness of QLoRA.
- Data quality was found to be more critical than dataset size for effective instruction-following performance.
Significance: QLoRA challenges the belief that high memory requirements are necessary for effective finetuning of large models, providing a scalable solution for researchers with limited resources.
Future Work: Exploration of alternative quantization methods, further evaluation of different adapter techniques, and broader assessments of model biases and performance across various benchmarks.
Potential Impact: If pursued, these avenues could lead to even more efficient training methods, making advanced language models accessible for a wider range of applications, including mobile deployment and privacy-preserving AI solutions.

QLoRA: Efficient Finetuning of Quantized LLMs

Preview

Synopsis

Overview

Background

Methodology

Findings

Notes

Meta

QLoRA: Efficient Finetuning of Quantized LLMs

Preview

Synopsis

Overview

Background

Methodology

Findings

Notes

Meta

Related