Review & Cheat Sheet

[!IMPORTANT] In this review, you will consolidate:

🧠 Rapid Recall Flashcards: Cards covering SFT, Instruction Tuning, LoRA, and RLHF concepts.

📝 Module Cheat Sheet: Quick reference tables and conceptual summaries for parameter-efficient fine-tuning and alignment.

1. Module Mastery Overview

You’ve completed the “Fine-Tuning” module. This module focused on bridging the gap between raw foundation models and helpful, aligned assistants.

SFT & Instruction Tuning: Adapting models to follow formats and instructions.
PEFT & LoRA: Efficiently fine-tuning massive models without catastrophic memory costs by using low-rank matrices.
RLHF: Aligning models to human preferences using a Reward Model and Proximal Policy Optimization (PPO).

2. Interactive: Module 04 Flashcards

Test your recall. Click a card to reveal the “Senior Engineer” answer.

[!TIP] Try it yourself: Click any card to flip it and reveal the answer.

What is the primary goal of Supervised Fine-Tuning (SFT)?

To adapt a pre-trained base model to specific tasks or formats (like Q&A) using high-quality prompt-response pairs.

What is "Catastrophic Forgetting"?

When a model aggressively updates its weights for a narrow task and loses previously learned general knowledge.

How does LoRA reduce trainable parameters?

By freezing original weights and injecting trainable low-rank decomposition matrices (A and B) into Transformer layers.

What are the three core steps of RLHF?

1. Supervised Fine-Tuning (SFT)
2. Train a Reward Model (RM)
3. Proximal Policy Optimization (PPO) loop.

Why do we need a Reward Model in RLHF?

Because humans are too slow to score millions of model outputs directly during reinforcement learning. The RM acts as a proxy for human preferences.

What is the KL Divergence Penalty in PPO?

It prevents the model from "reward hacking" (generating repetitive, nonsensical text that exploits the Reward Model) by forcing the RL model to stay close to the original SFT model's distribution.

3. Module Cheat Sheet

1. Fine-Tuning Paradigms

Paradigm	Objective	Typical Use Case	Resource Cost
Full SFT	Update all model weights.	Domain adaptation, drastic behavior shifts.	Extremely High (Compute & Memory)
PEFT (LoRA)	Update small, low-rank matrices.	Task-specific adaptation, custom voices.	Low (Consumer GPUs)
RLHF	Maximize human preference reward.	Safety alignment, helpfulness, reducing toxicity.	Very High (Multiple Models)

2. The 3 Steps of RLHF (Summary)

Step 1: Supervised Fine-Tuning (SFT): Train the base model on high-quality demonstration data to establish a baseline capability.
Step 2: Reward Model (RM): Train a secondary model on comparison data (Model A vs Model B) ranked by humans to score how “good” a response is.
Step 3: PPO Optimization: Use Reinforcement Learning where the SFT model generates responses, the RM scores them, and PPO updates the model to maximize the score without straying too far (KL Penalty).

3. Key Vocabulary

Instruction Tuning: A subset of SFT where the dataset consists of instructions (“Summarize this”, “Translate that”) helping the model generalize to unseen tasks.
Alignment Tax: The phenomenon where models explicitly aligned via RLHF sometimes show a slight degradation in standard benchmark performance (like coding or math).
Rank (r): In LoRA, the hyperparameter controlling the size of the injected matrices. Higher rank means more expressiveness but more parameters.

4. Next Steps

You now understand how to adapt and align models. In the next module, we will explore techniques to ground the model with external data and build autonomous agents.

Practice in the Vault

Module Review: Fine-Tuning