Random Forests

[!NOTE] This chapter elevates the humble Decision Tree by introducing the power of Ensemble Learning. We will explore how combining many weak, noisy models can create a strong, robust predictor. We dive into the mechanics of Bagging and feature randomness, and explain why Random Forests are incredibly resilient to overfitting.

1. Intuition: The Wisdom of Crowds

Imagine trying to guess the exact weight of a cow at a state fair. If you ask one person, their guess might be wildly inaccurate. However, if you ask 1,000 independent people and average their guesses, the result is often astonishingly close to the true weight. This phenomenon is known as the “Wisdom of Crowds.”

In the software industry, consider a spam filter. A single rule (e.g., “Flag if email contains ‘viagra’”) is a weak predictor. Spammers quickly adapt. But if you have 500 different rules looking at word counts, sender domains, capitalization, and links, and you let them vote, the combined system becomes incredibly robust.

In machine learning, this translates to Ensemble Learning. A single Decision Tree is prone to overfitting—it memorizes the training data perfectly but fails to generalize. It has low bias but high variance.

A Random Forest addresses this by building a “crowd” of Decision Trees. By ensuring that each tree in the forest is slightly different (independent) and then aggregating their predictions, we drastically reduce the overall variance without significantly increasing the bias.

1.1 The Two Pillars of Randomness

For the wisdom of crowds to work, the “people” (trees) must make independent errors. If every tree is identical, averaging them does nothing. Random Forests inject randomness in two specific ways:

Bagging (Bootstrap Aggregating): Each tree is trained on a different random subset of the data, drawn with replacement.
Feature Randomness: When deciding how to split a node, the tree is only allowed to consider a random subset of the available features.

2. Interactive Analysis: Ensemble Voting

Let’s visualize how aggregation smooths out the noise. The interactive demo below simulates a classification problem. The actual decision boundary is complex. We train several “weak” Decision Trees. Notice how an individual tree’s boundary is jagged and overfits the noise, but the “Ensemble Vote” creates a much smoother, more robust boundary.

Visualization Mode

Compare individual tree boundaries vs the ensemble average.

3. The Rigor: Bagging and OOB Error

Let’s look mathematically at how Bagging works and a unique advantage it provides.

3.1 Bootstrap Aggregating (Bagging)

Given a standard training set D of size N, Bagging generates M new training sets D_i, each of size N, by sampling from D uniformly and with replacement.

Because we sample with replacement, some observations will be repeated in D_i, and some will be left out completely.

Training: We train a separate, unpruned Decision Tree on each of the M datasets.
Prediction (Regression): The final prediction is the average of the predictions from all M trees. ŷ = (1/M) Σ_i=1^M f_i(x)
Prediction (Classification): The final prediction is the majority vote (mode) across all M trees.

3.2 Out-of-Bag (OOB) Error

Because each tree is trained on a bootstrapped sample, approximately 36.8% of the original data is left out of the training set for any specific tree.

Proof: The probability of a specific sample not being chosen in a single draw is 1 - (1/N). The probability of it not being chosen in N independent draws is (1 - (1/N))^N. As N → ∞, this converges to 1/e ≈ 0.368.

These left-out samples are called Out-of-Bag (OOB) samples. We can use them as a free, built-in validation set!

To evaluate the Random Forest’s performance:

For every sample in the original dataset, find the trees where this sample was OOB (not used for training).
Aggregate the predictions only from those specific trees.
Compare the aggregated prediction to the true label to calculate the OOB Error.

[!TIP] The OOB Error is a remarkably accurate estimate of the model’s test error, meaning you often don’t need a separate validation hold-out set when tuning a Random Forest.

4. Feature Importance

Random Forests naturally provide a way to calculate how important each feature is for predicting the target variable. This makes them highly interpretable compared to “black-box” models like deep neural networks.

Gini Importance (Mean Decrease in Impurity): Every time a split is made on a variable X, the Gini impurity of the child nodes is less than that of the parent node. We can calculate the total decrease in node impurity that results from splits over that variable, averaged over all trees. Features that frequently result in large purity gains are considered more important.

[!NOTE] War Story: Credit Scoring and “Leaky” Features A fintech startup was using a Random Forest to predict loan defaults. The model showed 99% accuracy on the test set. When they examined the Feature Importance, they noticed one feature dwarfed all others: days_since_late_fee_notice.

The model was relying entirely on whether a user had already received a late fee notice to predict if they would default. This is a classic “data leakage” problem—the feature is a result of the target happening, not a predictor. Because Random Forests make feature importance so accessible, the team caught the leak before deploying a useless model to production!

5. Implementation: Scikit-Learn

While building a Random Forest from scratch using NumPy is a great exercise, in practice, highly optimized libraries like scikit-learn are used.

Python (Scikit-Learn)

from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# 1. Generate synthetic data
X, y = make_classification(n_samples=1000, n_features=20, n_informative=5, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Initialize the Random Forest
# n_estimators: Number of trees in the forest (M)
# max_features: 'sqrt' is standard for classification (injects feature randomness)
# oob_score: Enable out-of-bag evaluation
clf = RandomForestClassifier(n_estimators=100, max_features='sqrt', oob_score=True, random_state=42)

# 3. Train the model
clf.fit(X_train, y_train)

# 4. Evaluate
y_pred = clf.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.3f}")
print(f"OOB Score: {clf.oob_score_:.3f}")

# 5. Inspect Feature Importance
# Returns an array where the sum of all values is 1.0
importances = clf.feature_importances_

6. Complexity Analysis

Phase	Time Complexity	Space Complexity	Notes
Training	O(M × N log(N) × K)	O(M × Depth)	M trees, N samples, K features randomly evaluated per split. Easily parallelizable across CPU cores.
Prediction	O(M × Depth)	O(1)	Must traverse all M trees. Slower than a single Decision Tree, but highly parallelizable.

Random Forests are a powerful “out-of-the-box” algorithm that perform well with minimal tuning. In the next chapter, we look at Gradient Boosting, an ensemble method that builds trees sequentially to focus on correcting errors.

Random Forests

Random Forests

1. Intuition: The Wisdom of Crowds

1.1 The Two Pillars of Randomness

2. Interactive Analysis: Ensemble Voting

Visualization Mode

3. The Rigor: Bagging and OOB Error

3.1 Bootstrap Aggregating (Bagging)

3.2 Out-of-Bag (OOB) Error

4. Feature Importance

5. Implementation: Scikit-Learn

6. Complexity Analysis

Found this lesson helpful?