Residual Analysis & Diagnostics

[!IMPORTANT] Fitting a model is easy; validating it is hard. If your residuals show a pattern, your model is missing something.

The Ordinary Least Squares (OLS) method is powerful, but it relies on strict assumptions. If these are violated, your p-values, confidence intervals, and predictions may be garbage. We check these assumptions by analyzing the residuals (e = y - ŷ).

1. Intuition: The Signal and the Noise

Why do we want our residuals to look like random static (white noise)?

Think about Information Theory.

  • Data = Model + Residuals
  • Information = Pattern + Noise

If your residuals show a pattern (e.g., a curve, a wave), it means there is still information left in the data that your model failed to capture.

  • Ideal Model: Extracts all the pattern. The leftovers (residuals) are pure, uninformative noise (Maximum Entropy).
  • Bad Model: Leaves patterns behind. You left “money on the table.”

2. The Four Assumptions (L.I.H.N.)

We remember the assumptions using the acronym LIHN:

  1. Linearity: The relationship is actually linear.
    • Check: Residuals vs. Fitted plot should show no clear pattern (just a random cloud).
  2. Independence: Errors are independent (no serial correlation).
    • Check: Durbin-Watson test (values near 2 are good). Critical for time-series data.
  3. Homoscedasticity: Constant variance of errors.
    • Check: Residuals vs. Fitted plot should show a constant “band” of width. No “funnel” shapes.
  4. Normality: Errors are normally distributed.
    • Check: Q-Q Plot should follow the 45-degree line.

3. Hardware Reality: The Autocorrelation Trap

In system design and finance, the Independence assumption is the most dangerous one to violate.

  • Scenario: You measure CPU usage every second.
  • The Trap: High CPU usage at t likely means high CPU usage at t+1. The errors are correlated (Autocorrelation).
  • The Consequence: OLS assumes every data point adds new information. If points are correlated, they are “echoes” of each other. OLS thinks you have more sample size than you actually do.
  • Result: Your Standard Errors (SE) become tiny. Your T-statistics explode. You get P < 0.00001 and think you found a breakthrough, but you actually found nothing.

4. Interactive: The Pattern Matcher

Learn to spot the violations. Click the buttons below to simulate different datasets and see what their Residuals vs. Fitted plots look like.


5. Implementation: Calculating Residuals

Here is how to calculate residuals and check their standard deviation (Root Mean Square Error - RMSE) in Go, Java, and Python. We can generate these plots easily in Python. The most critical plot is Residuals vs Fitted.

package main

import (
  "fmt"
  "math"
)

func main() {
  yTrue := []float64{2, 4, 5, 4, 5}
  yPred := []float64{1.8, 3.9, 5.2, 4.1, 4.8}

  var sumSquaredResiduals float64
  for i := 0; i < len(yTrue); i++ {
    residual := yTrue[i] - yPred[i]
    sumSquaredResiduals += residual * residual
    fmt.Printf("Obs %d: Residual = %.2f\n", i, residual)
  }

  rmse := math.Sqrt(sumSquaredResiduals / float64(len(yTrue)))
  fmt.Printf("RMSE: %.4f\n", rmse)
}
public class ResidualsExample {
  public static void main(String[] args) {
    double[] yTrue = {2, 4, 5, 4, 5};
    double[] yPred = {1.8, 3.9, 5.2, 4.1, 4.8};

    double sumSquaredResiduals = 0;
    for (int i = 0; i < yTrue.length; i++) {
      double residual = yTrue[i] - yPred[i];
      sumSquaredResiduals += residual * residual;
      System.out.printf("Obs %d: Residual = %.2f\n", i, residual);
    }

    double rmse = Math.sqrt(sumSquaredResiduals / yTrue.length);
    System.out.printf("RMSE: %.4f\n", rmse);
  }
}
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt

# 1. Fit the model
# (Assuming X and y are already defined)
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()

# 2. Get Residuals and Fitted Values
residuals = results.resid
fitted_vals = results.fittedvalues

# 3. Create a 2x2 Diagnostic Plot
fig, ax = plt.subplots(2, 2, figsize=(12, 10))

# A. Residuals vs Fitted (Check Linearity & Homoscedasticity)
ax[0, 0].scatter(fitted_vals, residuals, alpha=0.5)
ax[0, 0].axhline(0, color='red', linestyle='--')
ax[0, 0].set_xlabel('Fitted Values')
ax[0, 0].set_ylabel('Residuals')
ax[0, 0].set_title('Residuals vs Fitted')

# B. Q-Q Plot (Check Normality)
sm.qqplot(residuals, line='45', ax=ax[0, 1])
ax[0, 1].set_title('Normal Q-Q')

# C. Scale-Location (Check Homoscedasticity)
# Square root of standardized residuals vs fitted
standardized_resid = np.sqrt(np.abs(results.get_influence().resid_studentized_internal))
ax[1, 0].scatter(fitted_vals, standardized_resid, alpha=0.5)
ax[1, 0].set_xlabel('Fitted Values')
ax[1, 0].set_ylabel('Sqrt(|Standardized Residuals|)')
ax[1, 0].set_title('Scale-Location')

# D. Histogram of Residuals (Check Normality)
ax[1, 1].hist(residuals, bins=15, edgecolor='black', alpha=0.7)
ax[1, 1].set_title('Histogram of Residuals')

plt.tight_layout()
plt.show()

Interpretation Guide

Plot What to look for Violation Sign Solution
Residuals vs Fitted Random scatter around 0 Curved “U” shape Add polynomial terms (x<sup>2</sup>)
Residuals vs Fitted Constant width band Funnel shape Log-transform y or Weighted Least Squares (WLS)
Normal Q-Q Points on the red line Points deviating at tails Robust regression or check for outliers

[!TIP] Heteroscedasticity often occurs when analyzing financial data (e.g., income vs. spending). High-income earners have more variance in spending than low-income earners.

6. Summary

  • Always plot your residuals. R-squared alone is deceptive.
  • Ideal Residuals: Random noise, constant variance, normally distributed.
  • Violations:
  • Curve: Model is underfitting (needs non-linear terms).
  • Funnel: Heteroscedasticity (needs transformation).
  • Outliers: Data quality issues or special cases.

Next: Regularization (Ridge & Lasso)