Famous Architectures: From LeNet to ResNet

The history of Deep Learning is written in the architecture of Neural Networks. From the early days of digit recognition to beating humans at image classification, each new architecture introduced a key innovation that pushed the field forward.

In this chapter, we will trace the evolution of the “Big Five” architectures: LeNet, AlexNet, VGG, Inception, and ResNet.


1. The Classics: LeNet and AlexNet

LeNet-5 (1998)

Yann LeCun’s pioneering network designed for reading handwritten digits (MNIST).

  • Innovation: Introduced the Convolution → Pooling → Convolution pattern.
  • Structure: 2 Convolutional layers, 2 Pooling layers, 3 Fully Connected layers.
  • Impact: Proved that backpropagation could train convolutions.

AlexNet (2012)

The network that started the Deep Learning boom by winning the ImageNet challenge (ILSVRC) by a massive margin.

  • Innovation: Used ReLU activation (instead of Sigmoid/Tanh) and Dropout to prevent overfitting. Trained on GPUs.
  • Structure: 5 Convolutional layers, 3 Fully Connected layers.
  • Impact: Reduced top-5 error rate from 26% to 15.3%.

2. Going Deeper: VGG (2014)

The Visual Geometry Group (VGG) at Oxford showed that depth matters.

  • Innovation: Replaced large kernels (11×11, 5×5) with stacks of small 3×3 kernels.
  • Logic: Two 3×3 convolutions have the same receptive field as one 5×5, but with fewer parameters and more non-linearity.
  • Drawback: Extremely heavy in parameters (138 Million) due to large dense layers at the end.

3. The Residual Revolution: ResNet (2015)

As networks got deeper (20+ layers), they became harder to train due to the Vanishing Gradient Problem. Gradients would shrink to zero as they backpropagated through many layers.

Microsoft Research introduced ResNet (Residual Network).

  • Innovation: Skip Connections (or Residual Connections).
  • Mechanism: Instead of learning the mapping H(x), the network learns the residual F(x) = H(x) - x. The output is F(x) + x.
  • Analogy: It’s like giving the gradient a “highway” to flow backward without hitting traffic lights (layers).
  • Result: Allowed training of networks with 100+ layers (ResNet-152).

4. Going Wider: Inception (GoogLeNet) (2014/2015)

Google asked: “Why choose between a 3×3 filter or a 5×5 filter? Why not use both?”

  • Innovation: The Inception Module. It applies multiple filter sizes (1×1, 3×3, 5×5) in parallel and concatenates the outputs.
  • 1×1 Convolutions: Used to reduce dimensionality (depth) before expensive operations, acting as a “bottleneck” to save computation.

5. Interactive Architecture Explorer

Compare the depth, parameter count, and Top-1 Accuracy (ImageNet) of these famous models. Click on a bar to see details.

LeNet AlexNet VGG-16 Inception ResNet-50

Select a Model

Click on the metrics above or the bars to explore the evolution of CNN architectures.

6. PyTorch Implementation

You don’t need to build these from scratch. torchvision comes with pretrained models.

import torch
import torchvision.models as models

# 1. Load VGG16
# 'weights=VGG16_Weights.DEFAULT' loads the best available weights trained on ImageNet
vgg = models.vgg16(weights=models.VGG16_Weights.DEFAULT)

# 2. Load ResNet50
resnet = models.resnet50(weights=models.ResNet50_Weights.DEFAULT)

# Inspect the architecture
print(resnet)

# Example: Using a pretrained model for prediction
def predict(image_tensor, model):
  model.eval() # Set to evaluation mode (turns off dropout, etc.)
  with torch.no_grad():
    output = model(image_tensor)
    probabilities = torch.nn.functional.softmax(output[0], dim=0)
  return probabilities

# Note: ImageNet has 1000 classes. The output shape will be [Batch, 1000].

[!NOTE] ResNet50 is widely considered the default choice for general-purpose computer vision tasks today due to its balance of accuracy and efficiency.


7. Summary

  • LeNet: The proof of concept.
  • AlexNet: The breakthrough (ReLU, Dropout, GPU).
  • VGG: Small filters (3×3) are better. Depth matters.
  • ResNet: Residual connections allow infinite depth.
  • Inception: Parallel filters (width) and 1×1 bottlenecks.

These architectures form the basis for Transfer Learning, which we will explore in the next chapter.