Linear Regression

[!NOTE] This chapter covers Linear Regression, the simplest and most fundamental algorithm in Machine Learning. We will explore the intuition behind drawing a line through data, derive the mathematical loss function, and see how it works under the hood.

1. Intuition: Drawing a Line

Imagine you have a scatter plot showing house sizes versus their prices. You want to predict the price of a house you haven’t seen before based on its size. The simplest approach is to draw a straight line that “best fits” the data points.

This is Linear Regression. It assumes a linear relationship between the input variables (features) and the single output variable (target).

The Formula

The equation of a line is:

y = mx + b

In Machine Learning notation, this is often written as:

hθ(x) = θ0 + θ1x1 + θ2x2 + … + θnxn

  • hθ(x): The hypothesis (our prediction).
  • θ0: The y-intercept (bias term).
  • θ1…θn: The weights (coefficients) for each feature.
  • x1…xn: The feature values.

2. Interactive Analysis: Finding the Best Fit

Adjust the slope and intercept below to see how the line fits the data points. Watch how the Mean Squared Error (MSE) changes as you move the line. The goal is to minimize the MSE.

Model Parameters

MSE: 0.00

3. The Rigor: Mean Squared Error (MSE)

How do we mathematically define the “best fit”? We need a metric that measures how far off our predictions are from the actual values. This is our Cost Function (or Loss Function).

For Linear Regression, the standard cost function is the Mean Squared Error (MSE).

J(θ) = (1 / 2m) × Σi=1m (hθ(x(i)) - y(i))2

Where:

  • J(θ) is the cost given the parameters θ.
  • m is the number of training examples.
  • hθ(x(i)) is the predicted value for the i-th example.
  • y(i) is the actual target value for the i-th example.

We square the errors so that negative and positive errors don’t cancel each other out, and to heavily penalize large outliers. The 1 / 2m term is a mathematical convenience that makes the derivative cleaner later on.

4. Ordinary Least Squares (OLS)

In linear regression, if the dataset is small enough, we don’t even need to “search” for the best parameters iteratively. We can solve for the optimal weights algebraically using the Normal Equation.

θ = (XTX)-1 XTy

Where:

  • X is the matrix of feature values (with a column of 1s added for the intercept term).
  • y is the vector of target values.
  • θ is the vector of optimal weights.

Hardware Reality

While mathematically elegant, computing the inverse of (XTX) has a time complexity of approximately O(n3), where n is the number of features. If you have hundreds of thousands of features, this becomes computationally infeasible, and you must use iterative methods like Gradient Descent instead.

5. Implementation: Scikit-Learn

Python (Scikit-Learn)
```python import numpy as np from sklearn.linear_model import LinearRegression from sklearn.metrics import mean_squared_error # 1. Prepare Data # X must be a 2D array: (n_samples, n_features) X = np.array([[1], [2], [3], [4], [5]]) y = np.array([2.1, 4.0, 6.2, 8.1, 9.9]) # 2. Initialize Model model = LinearRegression() # 3. Fit the model (This uses the OLS algebraic solution under the hood) model.fit(X, y) # 4. View Parameters print(f"Intercept: {model.intercept_:.2f}") print(f"Coefficient (Slope): {model.coef_[0]:.2f}") # 5. Make Predictions predictions = model.predict(X) mse = mean_squared_error(y, predictions) print(f"MSE: {mse:.4f}") ```

6. Complexity Analysis

Phase Time Complexity Space Complexity Notes
Training (OLS Normal Equation) O(N × M2 + M3) O(N × M) N=samples, M=features. Matrix inversion is O(M³).
Training (Gradient Descent) O(K × N × M) O(N × M) K=iterations. Better for huge number of features.
Prediction O(M) O(1) Extremely fast; just a dot product.