Decision Trees

[!NOTE] This chapter unpacks Decision Trees, the fundamental building block of many powerful machine learning algorithms. We will explore the intuition behind “asking the right questions,” dive into the mathematics of impurity measures (Gini and Entropy), and implement a tree from first principles.

1. Intuition: The Game of 20 Questions

Imagine playing “20 Questions.” Your goal is to identify a secret object using as few yes/no questions as possible. You wouldn’t start by asking, “Is it a specific brand of left-handed scissors?” Instead, you ask broad questions that split the possibilities roughly in half: “Is it alive?” or “Is it bigger than a microwave?”

A Decision Tree operates on exactly this principle. It is a hierarchical model that recursively partitions the feature space into distinct regions. At each node, the tree asks a “question” (e.g., “Is Feature X1 > 5.2?”) and splits the data based on the answer. This process continues until it reaches a leaf node, which represents the final prediction.

Unlike linear models that try to draw a single straight line through the data, Decision Trees can capture complex, non-linear relationships by combining many simple, orthogonal splits.

1.1 The Hardware Reality

When considering algorithmic performance, Big-O notation often abstracts away hardware realities. However, understanding how Decision Trees map to memory is crucial:

  • Prediction is Fast: Making a prediction with a Decision Tree involves a sequence of simple if-else comparisons. This translates to very few CPU instructions. The depth of the tree (typically O(log N) for balanced trees) dictates the maximum number of comparisons.
  • Branch Misprediction: CPU branch predictors try to guess the outcome of if-else statements. Highly unpredictable splits in a tree might cause pipeline flushes, slightly degrading performance. However, for most tabular data, the sheer speed of simple comparisons outweighs this.
  • Memory Locality: The tree structure itself can be scattered in memory. Traversing a deep tree might involve pointer chasing, which can lead to L1/L2 cache misses if the nodes aren’t allocated contiguously. Modern implementations often use flattened array representations to improve cache locality.

2. Interactive Analysis: Splitting the Data

Let’s visualize how a Decision Tree finds the best split. In the interactive diagram below, we have a dataset with two features (X1 and X2) and two classes (Blue and Orange). The algorithm evaluates different split points to maximize the “purity” of the resulting child nodes.

Click on the canvas to evaluate a horizontal (X2) or vertical (X1) split.

Current Split Metrics

Feature: -
Value: -
Parent Impurity: 0.500
Left Impurity: -
Right Impurity: -
Information Gain: -

3. The Rigor: Impurity and Information Gain

How does an algorithm choose the best question? It needs a mathematical metric to evaluate how well a split separates the classes. This is where impurity measures come in.

The goal is to move from a state of high uncertainty (a node with mixed classes) to states of low uncertainty (pure child nodes).

3.1 Impurity Metrics

Let pi be the probability (or fraction) of items belonging to class i in a given node.

1. Gini Impurity (CART algorithm)

Gini impurity measures the probability of incorrectly classifying a randomly chosen element from the set if it were randomly labeled according to the distribution of labels in the subset.

Gini = 1 - Σi=1C pi2

  • Minimum (0.0): Perfect purity (all elements belong to one class).
  • Maximum (0.5 for binary): Maximum uncertainty (elements are evenly distributed among classes).

2. Entropy (ID3, C4.5 algorithms)

Entropy is a concept borrowed from Information Theory, representing the expected “surprise” or uncertainty in the node.

Entropy = -Σi=1C pi log2(pi)

  • Minimum (0.0): Perfect purity.
  • Maximum (1.0 for binary): Maximum uncertainty.

[!TIP] In practice, Gini and Entropy yield very similar trees. Gini is often preferred because it avoids the computationally expensive log2 operation, making tree construction slightly faster.

3.2 Information Gain

Once we have a way to measure impurity, we evaluate a potential split by calculating the Information Gain (IG). This is the difference between the parent node’s impurity and the weighted average impurity of the resulting child nodes.

IG(D, A) = I(D) - Σv ∈ Values(A) ( Dv / D ) I(Dv)

Where:

  • D is the parent dataset.
  • A is the feature being evaluated for the split.
  • I() is the impurity function (Gini or Entropy).
  • Dv is the subset of D where feature A has value v.

The algorithm evaluates every possible split point for every feature and greedily selects the split that maximizes Information Gain.

4. Implementation: Building the Tree

Let’s look at how we might implement the core logic for finding the best split in Python using NumPy.

Python (NumPy)
```python import numpy as np def gini_impurity(y): """Calculates Gini impurity for an array of labels.""" m = len(y) if m == 0: return 0.0 # Calculate probabilities for each class _, counts = np.unique(y, return_counts=True) probabilities = counts / m # Gini = 1 - sum(p_i^2) return 1.0 - np.sum(probabilities ** 2) def find_best_split(X, y, feature_index): """Finds the optimal threshold for a single feature.""" m = len(y) if m <= 1: return None, 0.0 parent_gini = gini_impurity(y) best_ig = 0.0 best_threshold = None # Get unique sorted values for the feature thresholds = np.unique(X[:, feature_index]) # Evaluate midpoints between consecutive unique values as potential thresholds for i in range(len(thresholds) - 1): threshold = (thresholds[i] + thresholds[i+1]) / 2.0 # Create masks for left and right splits left_mask = X[:, feature_index] <= threshold right_mask = ~left_mask y_left, y_right = y[left_mask], y[right_mask] if len(y_left) == 0 or len(y_right) == 0: continue # Calculate weighted child impurity weight_left = len(y_left) / m weight_right = len(y_right) / m child_gini = (weight_left * gini_impurity(y_left) + weight_right * gini_impurity(y_right)) # Calculate Information Gain ig = parent_gini - child_gini if ig > best_ig: best_ig = ig best_threshold = threshold return best_threshold, best_ig ```

[!WARNING] The greedy nature of Decision Trees means they find the locally optimal split at each step. This does not guarantee finding the globally optimal tree structure, which is an NP-complete problem.

5. Overfitting and Regularization

A Decision Tree grown to its maximum depth (until every leaf is perfectly pure) will almost certainly overfit the training data. It will learn the noise and memorize every outlier, leading to poor generalization on unseen data.

We must constrain (regularize) the tree’s growth. Common hyperparameter strategies include:

  1. Maximum Depth (max_depth): Hard limit on how deep the tree can grow.
  2. Minimum Samples for Split (min_samples_split): A node must have at least this many samples to be considered for a split.
  3. Minimum Samples in Leaf (min_samples_leaf): A split is only valid if it leaves at least this many samples in both resulting child nodes.
  4. Maximum Features (max_features): At each split, only consider a random subset of features. (This is a core component of Random Forests).

6. Complexity Analysis

Phase Time Complexity Space Complexity Notes
Training (Building) O(N × M × log(N)) O(N + Depth) For N samples and M features. Sorting features dominates.
Prediction (Inference) O(Depth) O(1) Typically O(log N) for a balanced tree. Extremely fast.

In the next chapter, we will see how combining many Decision Trees into an ensemble (Random Forests) overcomes the inherent instability and overfitting tendencies of a single tree.