Module Review: Classical ML
[!NOTE] This review chapter consolidates your learning on Classical ML. Use the key takeaways, interactive flashcards, and cheat sheet to ensure you deeply understand tree-based algorithms and ensemble methods.
Key Takeaways
- Decision Trees partition the feature space to make predictions. They are interpretable and fast but prone to overfitting.
- Impurity Measures like Gini Impurity and Entropy are used to evaluate the quality of a split. The algorithm maximizes Information Gain.
- Ensemble Learning combines multiple models to create a stronger predictor. It relies on the “Wisdom of Crowds” and independent errors.
- Bagging (Bootstrap Aggregating) reduces variance by training independent models on random subsets of data (with replacement). Random Forests use Bagging plus feature randomness.
- Gradient Boosting reduces bias by training models sequentially. Each new model attempts to predict the residual errors (pseudo-residuals) of the combined previous models.
- OOB Error in Random Forests provides a free, built-in validation metric because each tree leaves out approximately 36.8% of the data during training.
Interactive Flashcards
Test your recall of key Classical ML concepts.
What is Gini Impurity?
A metric used in Decision Trees to measure how often a randomly chosen element from a set would be incorrectly labeled if it were randomly labeled according to the distribution of labels in the subset.
How does a Random Forest inject randomness?
Through two methods: 1) Bagging (training each tree on a bootstrapped sample of data), and 2) Feature Randomness (considering only a random subset of features for each split).
What is the main difference between Bagging and Boosting?
Bagging builds independent models in parallel to reduce variance (overfitting). Boosting builds sequential models, where each tries to correct the errors of its predecessors, to reduce bias (underfitting).
What is the purpose of the learning rate (shrinkage) in Gradient Boosting?
It scales the contribution of each new tree. A smaller learning rate prevents the model from overfitting too quickly to the residuals, usually leading to better generalization.
Cheat Sheet
| Algorithm | Goal | How it Works | Key Hyperparameters |
|---|---|---|---|
| Decision Tree | Predict target | Recursively splits data to maximize purity. | max_depth, min_samples_split |
| Random Forest | Reduce Variance | Averages predictions from many independent trees trained on bootstrapped data using random feature subsets. | n_estimators, max_features |
| Gradient Boosting | Reduce Bias | Trains trees sequentially. Each tree predicts the negative gradient (pseudo-residual) of the loss function. | learning_rate, n_estimators, max_depth |
Quick Revision
- Decision Tree Inference Time: O(Depth)
- Random Forest Primary Effect: Fixes high variance (overfitting).
- Gradient Boosting Primary Effect: Fixes high bias (underfitting).
- OOB Coverage: Roughly 63.2% of data is used to train a given tree; 36.8% is Out-of-Bag.
Glossary
For a complete list of terms and definitions, visit the Machine Learning Glossary.
Next Steps
Now that you have mastered the classical tabular algorithms, you are ready to dive into the next module: Model Evaluation. We will learn how to properly cross-validate these models and evaluate them using metrics beyond simple accuracy.