Residual Analysis & Diagnostics
[!IMPORTANT] Fitting a model is easy; validating it is hard. If your residuals show a pattern, your model is missing something.
The Ordinary Least Squares (OLS) method is powerful, but it relies on strict assumptions. If these are violated, your p-values, confidence intervals, and predictions may be garbage. We check these assumptions by analyzing the residuals (e = y - ŷ).
1. Intuition: The Signal and the Noise
Why do we want our residuals to look like random static (white noise)?
Think about Information Theory.
- Data = Model + Residuals
- Information = Pattern + Noise
If your residuals show a pattern (e.g., a curve, a wave), it means there is still information left in the data that your model failed to capture.
- Ideal Model: Extracts all the pattern. The leftovers (residuals) are pure, uninformative noise (Maximum Entropy).
- Bad Model: Leaves patterns behind. You left “money on the table.”
2. The Four Assumptions (L.I.H.N.)
We remember the assumptions using the acronym LIHN:
- Linearity: The relationship is actually linear.
- Check: Residuals vs. Fitted plot should show no clear pattern (just a random cloud).
- Independence: Errors are independent (no serial correlation).
- Check: Durbin-Watson test (values near 2 are good). Critical for time-series data.
- Homoscedasticity: Constant variance of errors.
- Check: Residuals vs. Fitted plot should show a constant “band” of width. No “funnel” shapes.
- Normality: Errors are normally distributed.
- Check: Q-Q Plot should follow the 45-degree line.
3. Hardware Reality: The Autocorrelation Trap
In system design and finance, the Independence assumption is the most dangerous one to violate.
- Scenario: You measure CPU usage every second.
- The Trap: High CPU usage at
tlikely means high CPU usage att+1. The errors are correlated (Autocorrelation). - The Consequence: OLS assumes every data point adds new information. If points are correlated, they are “echoes” of each other. OLS thinks you have more sample size than you actually do.
-
Result: Your Standard Errors (
SE) become tiny. Your T-statistics explode. You get P < 0.00001 and think you found a breakthrough, but you actually found nothing.
4. Interactive: The Pattern Matcher
Learn to spot the violations. Click the buttons below to simulate different datasets and see what their Residuals vs. Fitted plots look like.
5. Implementation: Calculating Residuals
Here is how to calculate residuals and check their standard deviation (Root Mean Square Error - RMSE) in Go, Java, and Python. We can generate these plots easily in Python. The most critical plot is Residuals vs Fitted.
package main
import (
"fmt"
"math"
)
func main() {
yTrue := []float64{2, 4, 5, 4, 5}
yPred := []float64{1.8, 3.9, 5.2, 4.1, 4.8}
var sumSquaredResiduals float64
for i := 0; i < len(yTrue); i++ {
residual := yTrue[i] - yPred[i]
sumSquaredResiduals += residual * residual
fmt.Printf("Obs %d: Residual = %.2f\n", i, residual)
}
rmse := math.Sqrt(sumSquaredResiduals / float64(len(yTrue)))
fmt.Printf("RMSE: %.4f\n", rmse)
}
public class ResidualsExample {
public static void main(String[] args) {
double[] yTrue = {2, 4, 5, 4, 5};
double[] yPred = {1.8, 3.9, 5.2, 4.1, 4.8};
double sumSquaredResiduals = 0;
for (int i = 0; i < yTrue.length; i++) {
double residual = yTrue[i] - yPred[i];
sumSquaredResiduals += residual * residual;
System.out.printf("Obs %d: Residual = %.2f\n", i, residual);
}
double rmse = Math.sqrt(sumSquaredResiduals / yTrue.length);
System.out.printf("RMSE: %.4f\n", rmse);
}
}
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
# 1. Fit the model
# (Assuming X and y are already defined)
model = sm.OLS(y, sm.add_constant(X))
results = model.fit()
# 2. Get Residuals and Fitted Values
residuals = results.resid
fitted_vals = results.fittedvalues
# 3. Create a 2x2 Diagnostic Plot
fig, ax = plt.subplots(2, 2, figsize=(12, 10))
# A. Residuals vs Fitted (Check Linearity & Homoscedasticity)
ax[0, 0].scatter(fitted_vals, residuals, alpha=0.5)
ax[0, 0].axhline(0, color='red', linestyle='--')
ax[0, 0].set_xlabel('Fitted Values')
ax[0, 0].set_ylabel('Residuals')
ax[0, 0].set_title('Residuals vs Fitted')
# B. Q-Q Plot (Check Normality)
sm.qqplot(residuals, line='45', ax=ax[0, 1])
ax[0, 1].set_title('Normal Q-Q')
# C. Scale-Location (Check Homoscedasticity)
# Square root of standardized residuals vs fitted
standardized_resid = np.sqrt(np.abs(results.get_influence().resid_studentized_internal))
ax[1, 0].scatter(fitted_vals, standardized_resid, alpha=0.5)
ax[1, 0].set_xlabel('Fitted Values')
ax[1, 0].set_ylabel('Sqrt(|Standardized Residuals|)')
ax[1, 0].set_title('Scale-Location')
# D. Histogram of Residuals (Check Normality)
ax[1, 1].hist(residuals, bins=15, edgecolor='black', alpha=0.7)
ax[1, 1].set_title('Histogram of Residuals')
plt.tight_layout()
plt.show()
Interpretation Guide
| Plot | What to look for | Violation Sign | Solution |
|---|---|---|---|
| Residuals vs Fitted | Random scatter around 0 | Curved “U” shape | Add polynomial terms (x<sup>2</sup>) |
| Residuals vs Fitted | Constant width band | Funnel shape | Log-transform y or Weighted Least Squares (WLS) |
| Normal Q-Q | Points on the red line | Points deviating at tails | Robust regression or check for outliers |
[!TIP] Heteroscedasticity often occurs when analyzing financial data (e.g., income vs. spending). High-income earners have more variance in spending than low-income earners.
6. Summary
- Always plot your residuals. R-squared alone is deceptive.
- Ideal Residuals: Random noise, constant variance, normally distributed.
- Violations:
- Curve: Model is underfitting (needs non-linear terms).
- Funnel: Heteroscedasticity (needs transformation).
- Outliers: Data quality issues or special cases.