Module Review: ML Foundations

[!NOTE] This review chapter consolidates your learning on the foundational concepts of Machine Learning. Use the key takeaways, interactive flashcards, and cheat sheet to ensure you deeply understand linear models and optimization.

Key Takeaways

Linear Regression predicts continuous values by fitting a line to the data. It minimizes the Mean Squared Error (MSE).
Logistic Regression predicts probabilities for binary classification by passing a linear equation through a Sigmoid function. It minimizes Log Loss (Cross-Entropy).
Cost Functions quantify how “wrong” a model’s predictions are. They are the landscapes we navigate to find optimal parameters.
Gradient Descent is an iterative optimization algorithm that finds the minimum of a cost function by taking steps in the direction of the negative gradient.
The Learning Rate (α) controls the step size in Gradient Descent. It is the most critical hyperparameter to tune.
Batch vs. Stochastic vs. Mini-Batch defines how much data is used to compute the gradient before each parameter update. Mini-batch is the industry standard.

Interactive Flashcards

Test your recall of key ML Foundations concepts.

What is the purpose of the Sigmoid function in Logistic Regression?

It maps the unbounded output of a linear equation to a value strictly between 0 and 1, representing a probability.

Why don't we use Mean Squared Error (MSE) as the cost function for Logistic Regression?

Because the non-linear Sigmoid function makes the MSE cost surface non-convex (wavy). Gradient Descent would get stuck in local minima. We use Log Loss instead, which is convex.

What happens if the learning rate (α) in Gradient Descent is too large?

The algorithm might overshoot the minimum, bounce back and forth across the valley, and actually diverge, causing the cost to increase.

What is Mini-Batch Gradient Descent?

An optimization method that calculates the gradient using a small subset (e.g., 32 or 64) of the training examples. It balances the stability of Batch GD and the speed of Stochastic GD.

Cheat Sheet

Concept	Mathematical Equation / Update Rule	Use Case
Linear Regression	h_θ(x) = θ^Tx	Predicting continuous values (e.g., house prices).
Sigmoid Function	σ(z) = 1 / (1 + e^-z)	Squashing values to (0, 1) for probabilities.
Logistic Regression	h_θ(x) = σ(θ^Tx)	Binary classification (e.g., Spam vs Not Spam).
Gradient Descent Update	θ_j := θ_j - α × (∂ / ∂θ_j) J(θ)	Iteratively updating weights to minimize cost.

Quick Revision

Cost Function of Linear Regression: Mean Squared Error (MSE).
Cost Function of Logistic Regression: Log Loss (Cross-Entropy).
Learning Rate (α): Determines the size of the steps taken during Gradient Descent.
Convexity: A convex function has only one global minimum, ensuring Gradient Descent will find the optimal solution.

Glossary

For a complete list of terms and definitions, visit the Machine Learning Glossary.

Next Steps

Now that you understand how a simple linear model is trained using Gradient Descent, you are ready to explore more complex, non-linear algorithms. Move on to the next module: Classical ML.