LoRA and PEFT

What is LoRA?

Parameter-Efficient Fine-Tuning (PEFT) methods enable fine-tuning of large pre-trained language models with significantly fewer parameters. Low-Rank Adaptation (LoRA) is one of the most widely used PEFT techniques.

It freezes the pre-trained model weights and injects trainable rank decomposition matrices into each layer of the Transformer architecture, greatly reducing the number of trainable parameters for downstream tasks.

The Encyclopedia Analogy

Imagine the pre-trained Large Language Model (LLM) as a massive, multi-volume encyclopedia. It contains vast amounts of general knowledge but might lack specific, up-to-date information on a highly specialized topic (like your company’s proprietary codebase).

Full fine-tuning is like trying to rewrite the entire encyclopedia to include your new information. It is incredibly time-consuming, expensive, and risks messing up the existing knowledge.

LoRA, on the other hand, is like adding a thin booklet of sticky notes to the back of the encyclopedia. You don’t touch the original text (the weights are frozen). Instead, you just refer to the sticky notes (the LoRA matrices) whenever you need the specialized information. When performing an operation, the model reads the original encyclopedia and applies the adjustments from the sticky notes simultaneously.

Interactive LoRA Decomposition

Use the slider below to see how changing the Rank (r) affects the number of trainable parameters compared to full fine-tuning.

Select Rank (r): 8

Pre-trained Weights (Frozen)

d × d

Params: 16,777,216

LoRA Update (ΔW)

B d × r

A r × d

Trainable: 65,536

Parameter Reduction: 99.61%

Final Computation: h = Wx + BAx

The Mathematics of Intrinsic Rank

The core premise of LoRA is based on the concept of intrinsic rank. While neural network weight matrices are extremely large (e.g., 4096 × 4096), the actual intrinsic dimension of the parameter updates during fine-tuning is very small.

Let W₀ ∈ ℝ^{d × d} be a pre-trained weight matrix. The weight update is ΔW. In LoRA, we constrain this update by representing it with a low-rank decomposition:

ΔW = BA

Where B ∈ ℝ^{d × r} and A ∈ ℝ^{r × d}, and the rank r << d. The forward pass becomes:

h = W₀x + ΔWx = W₀x + BAx

During training, W₀ is frozen and does not receive gradient updates. Only the matrices A and B contain trainable parameters. Matrix A is typically initialized with random Gaussian numbers, while B is initialized to zero, ensuring that at the beginning of training, ΔW = 0.

The Need for Parameter Efficiency

As models like GPT-3 reached 175B parameters, full fine-tuning became prohibitively expensive. LoRA drastically reduces the hardware barriers for fine-tuning while retaining performance on par with full fine-tuning.

Crucial Hardware Fact

Full fine-tuning of a 175B parameter model requires several terabytes of VRAM. LoRA reduces this to ~35GB, making fine-tuning possible on a single advanced consumer GPU.

Code Implementation

Here is a conceptual implementation of LoRA in PyTorch. The LoRALayer demonstrates how the standard linear projection is augmented with matrices A and B.

Python

import torch
import torch.nn as nn
import math

class LoRALayer(nn.Module):
  def __init__(self, d_model, rank=8, alpha=16):
    super().__init__()
    self.d_model = d_model
    self.rank = rank
    self.scaling = alpha / rank

    # A matrix: d_model x rank, initialized with Gaussian distribution
    self.lora_A = nn.Parameter(torch.randn(d_model, rank) / math.sqrt(d_model))
    # B matrix: rank x d_model, initialized to zero
    self.lora_B = nn.Parameter(torch.zeros(rank, d_model))

  def forward(self, x):
    # Wx + BAx (W is not shown here)
    lora_update = (x @ self.lora_A @ self.lora_B) * self.scaling
    return lora_update