Memory & Cache Optimization

Code that pleases the hardware. Understand Cache Lines, False Sharing, Data Locality, Struct Padding, and how it impacts time complexity and optimization.

1. The Hook: The CPU is Fast, RAM is Slow

Your CPU operates at 4 GHz (0.25 ns per cycle). Reading from RAM takes ~100 ns. That is 400 CPU cycles waiting for data. This is why Cache Layout matters more than instruction count for high-performance systems.

War Story: The LMAX Disruptor In high-frequency trading (HFT), milliseconds cost millions. The LMAX exchange built the “Disruptor,” a concurrency framework that processes 6 million orders per second on a single thread. Their secret? They didn’t write a magical new lock. They simply optimized their data structures (Ring Buffers) to fit perfectly within CPU Cache Lines and ruthlessly eliminated False Sharing. They stopped fighting the hardware and started working with it.


2. The Concepts

A. Spatial Locality & Cache Lines

CPUs fetch memory in chunks of 64 bytes (a Cache Line).

  • Good: Iterating an int[] array. (Next int is in the same cache line).
  • Bad: Iterating a LinkedList. (Next node is at random address → Cache Miss).

Interactive Simulation: Memory Locality

Watch how quickly the CPU can process data when it's contiguous (Array) vs. scattered (Linked List).

B. Struct Padding & Alignment

CPUs read words aligned to 4 or 8 bytes.

struct Bad {
  char c;   // 1 byte
  // 7 bytes wasted padding!
  long x;   // 8 bytes
};

Optimization: Reorder fields from largest to smallest to minimize padding wastage.

C. False Sharing

If two threads modify different variables that sit on the same cache line, the cores force each other to invalidate the cache line repeatedly. This silent performance killer can make multi-threaded code run slower than single-threaded code.

The Problem Scenario:

class Counter {
  // Thread 1 increments x, Thread 2 increments y.
  // Because they are adjacent, they sit on the SAME 64-byte cache line.
  volatile long x;
  volatile long y;
}

Solution (Padding): Pad variables to ensure they live on different 64-byte lines.

class PaddedCounter {
  volatile long x;
  // 56 bytes of padding to push 'y' to the next Cache Line
  long p1, p2, p3, p4, p5, p6, p7;
  volatile long y;
}
// Note: Modern Java provides @Contended to handle this automatically.

3. Data-Oriented Design (DoD)

OOP (Object Oriented Programming): Array of Structures (AoS)

class Ball {
  float x, y; // 8 bytes
  Color c;    // 4 bytes + padding
}
Ball[] balls = new Ball[1000];
  • Scatter memory: Iterating over balls to update positions means loading the Color data into the cache even though it’s never used in the physics calculation. This wastes precious cache bandwidth.

DoD (Data-Oriented Design): Structure of Arrays (SoA)

class BallManager {
  float[] xs; // Contiguous block of X positions
  float[] ys; // Contiguous block of Y positions
  Color[] cs; // Contiguous block of Colors
}
  • Benefit: If you only need to update positions, you iterate xs and ys. You never load cs into cache. Every byte pulled into the L1 cache is strictly utilized. This maximizes cache utilization and throughput.

4. Deep Dive Strategy Lab: Memory

Intuition Through Analogy

RAM is a Warehouse. L1 Cache is your Backpack.

  • Linked List: You run to the warehouse, grab one item, come back. Run back for the next.
  • Array: You take a forklift and bring a whole pallet (Cache Line) to your desk. You work on 16 items instantly before needing to go back.

Mathematical Anchor: Latency Numbers

Jeff Dean’s famous numbers (approximate):

  • L1 Cache Ref: 0.5 ns
  • L2 Cache Ref: 7 ns
  • Main Memory Ref: 100 ns
  • SSD Random Read: 150,000 ns
  • Network Round Trip: 150,000,000 ns

Optimization is shifting work up this list.