Kafka (Wait-Free Architecture)

[!TIP] Dumb Broker, Smart Consumer: Traditional queues (RabbitMQ) track “who read what” on the server. Kafka puts that burden on the client. The broker is just a high-speed append-only log. This design choice allows Kafka to handle massive throughput.

1. Event Streaming vs Message Queues

Kafka is often confused with Message Queues like RabbitMQ or ActiveMQ.

  • Message Queue (RabbitMQ): “Smart Broker, Dumb Consumer”. The broker tracks state. Messages are deleted once consumed. Good for task processing (job queues).
  • Event Streaming (Kafka): “Dumb Broker, Smart Consumer”. The broker is a storage engine. Messages are retained based on time (e.g., 7 days). Consumers track their own position (Offset). Good for replayability, sourcing, and high throughput.

2. The Core Abstraction: The Log

A Log is the simplest storage abstraction:

  • Append-Only: New data goes to the end. No overwrites.
  • Ordered: Event A happened before Event B.
  • Immutable: You can’t change history.

Topics & Partitions

Kafka scales by splitting a Topic (e.g., “UserClicks”) into Partitions (Shards).

  • Ordering Guarantee: Kafka guarantees order within a partition, but NOT across the whole topic.
  • Partitioning Strategy: hash(UserID) % NumPartitions. This ensures all events for User A go to the same partition (and stay ordered).
  • Offset: A unique ID (integer) for every message in a partition. It is monotonic but not necessarily contiguous (in recent versions).

Storage Internals (Segments & Indexes)

Kafka achieves high-performance lookups and transfers by combining Sparse Indexing with Zero-Copy data paths, as visualized in the storage engine diagram below.

  1. The .log file: Stores the raw message blocks (blue).
  2. The .index file: A Sparse Index (gold) that maps logical offsets to physical byte positions on disk.
    • Efficiency: Instead of mapping every offset, it maps a chunk of data (e.g., every 4KB).
    • The Search: To find a specific message, Kafka finds the nearest offset in the index, jumps to the byte position in the log, and scans sequentially.
Storage Engine: Zero-Copy & Sparse Index
Sparse Offset Mapping | sendfile() Bypass | Page Cache Utilization
Disk Storage (Log Segment)
sparse.index
Off: 0 | Pos: 0
Off: 1000 | Pos: 4096
Off: 2000 | Pos: 8192
Off: 3000 | Pos: 12288
000000.log
Block 0k
...
Target Data (4k)
Block 8k
Operating System Memory
User Space (JVM App)
User Buffer
Kernel Space
Page Cache (Kernel Buffer)
Socket Buffer
Zero Copy Path ✓
NETWORK NIC
1. Find Byte Pos 2. Read to PageCache sendfile() Bypass 3. To Consumer

4. Why is Kafka so Fast?

Kafka pushes millions of messages per second on spinning disks. The secret is doing Less Work.

  1. Sequential I/O: Random Disk Access is slow (seeking head). Sequential Access is incredibly fast (faster than random RAM access!).
  2. Zero Copy:
    • Standard I/O: The kernel reads data from disk to Page Cache. Then it copies it to User Space (Application). Then the Application copies it back to Kernel Space (Socket Buffer). Finally, to the NIC. (4 Copies, 4 Context Switches).
    • Zero Copy (sendfile): The kernel transfers data directly from the Page Cache to the NIC Buffer (via DMA). The data never enters the Application (JVM) memory.
    • Result: CPU usage drops by 60%, and garbage collection is non-existent for data transfer.

Interactive Demo: Zero Copy Racer

Visualize the difference in CPU cycles and context switches.

  • Top Lane: Standard I/O (The “Taxed” Path).
  • Bottom Lane: Zero Copy (The “Gold” Path).
Standard I/O (4 Copies)
Disk
Kernel
User
Kernel
NIC
Zero Copy (2 Copies)
Disk
Kernel (PageCache)
NIC
Ready to Race.

5. Reliability: ISR and High Watermark

How do we ensure data durability without sacrificing speed?

  • Leader: Handles writes.
  • Follower: Passively replicates data.
  • ISR (In-Sync Replicas): Replicas that are “caught up” with the leader.
  • High Watermark (HW): The offset of the last message successfully replicated to all ISRs.
    • Safety: Consumers can only read up to the HW. This prevents “Ghost Reads” (reading data that gets lost if the leader crashes).
  • Acks:
    • acks=0: Fire and forget. Fast, unsafe.
    • acks=1: Leader ack. Balanced.
    • acks=all: All ISRs ack. Slow, safe.

Interactive Demo: The High Watermark

Visualize how data becomes “Visible” to consumers only after replication.

  1. Produce: Send messages to Leader.
  2. Replicate: Follower pulls data.
  3. Commit: HW moves. Consumers can now read.
Leader (P0)
HW
LEO
Follower (P0)
LEO
LEO = Log End Offset (Written to Disk) | HW = High Watermark (Committed)
Consumers waiting for committed data...
Consumer View
No Data Visible

6. Consumer Groups (The Secret Weapon)

This is how Kafka scales reads horizontally.

  • Consumer Group: A logical application (e.g., “BillingService”).
  • Rule: Each partition is consumed by exactly one consumer in the group.
  • Example: If you have 4 partitions and 2 consumers, each gets 2 partitions. If you have 5 consumers, 1 is idle!

Rebalancing

When a consumer joins or leaves, Kafka must redistribute partitions.

  • Stop-the-World (Eager): All consumers stop reading, give up partitions, and rejoin. (Latency spike).
  • Cooperative (Incremental): Only moving partitions are revoked. Consumers keep reading untouched partitions.

Interactive Demo: Consumer Rebalancing

Visualize how partitions are reassigned when consumers join or leave.

  • Goal: Ensure every partition has a consumer.
  • Idle: If Consumers > Partitions, some consumers are idle.
Topic: UserClicks (4 Partitions)
P0
P1
P2
P3
No Consumers Active
System Stable.

7. Exactly-Once Semantics (EOS)

“Exactly-Once” is the holy grail. Kafka supports it via:

  1. Idempotent Producer: The producer assigns a Sequence Number to every batch. If the broker sees a duplicate sequence number, it ignores it.
  2. Transactions: You can write to multiple partitions atomically. “All or Nothing”. This is critical for “Consume-Process-Produce” loops (e.g., KStreams).

8. Log Compaction

Normally, Kafka deletes old data after 7 days (Retention Policy). Log Compaction is different. It retains the latest value for every key.

  • Use Case: Restoring a database state (CDC).
  • How: If Key A has values v1, v2, v3, compaction deletes v1 and v2. The log effectively becomes a Snapshot of the final state.

9. Case Study: Uber’s Trillion Message Pipeline

Uber operates one of the largest Kafka deployments in the world, powering everything from dynamic pricing to fraud detection.

1. Scenario

  • Scale: Trillions of messages per day across thousands of topics.
  • Users: Driver App, Rider App, Eats, Freight.
  • Requirement: Real-time dispatching, Active-Active redundancy between regions, and zero data loss for billing.

2. The Challenge

  • Head-of-Line Blocking: One bad message (poison pill) can halt an entire partition processing pipeline.
  • Replication Lag: Synchronizing state between US-West and US-East in real-time.
  • Thundering Herd: If a cluster fails, reconnecting thousands of microservices simultaneously can DDOS the backup cluster.

3. The Decision

Why did Uber choose Kafka over RabbitMQ or a REST-based architecture?

Feature REST / HTTP RabbitMQ Kafka (Chosen)
Throughput Low (Sync overhead) Medium Extreme (Seq I/O)
Persistence None (Ephemeral) Short-term Long-term (Replayable)
Ordering No Weak Strict (Per partition)
Backpressure Circuit Breakers needed Limited by RAM Native (Pull-based)

4. The Decision Visualizer

Kafka acted as the “Universal Buffer” allowing producers to write at any speed, independent of consumer capacity.

5. Architecture

Uber employs a Federated Kafka architecture to solve global scale.

  • Regional Clusters: “Edge” clusters in each data center accept local writes from services.
  • uReplicator: A custom replication solution (superior to standard MirrorMaker) to bridge Regional Clusters to the Core Aggregation Cluster.
  • Chaperone: An audit service that verifies every message produced is eventually consumed (completeness check).

6. Core Abstraction

The Log is the single source of truth.

  • Immutable History: Every trip request, GPS ping, and transaction is an event in the log.
  • Materialized Views: Downstream databases (Cassandra/MySQL) are just projections of the Kafka log.

7. Deep Dive: Dead Letter Queues (DLQ)

Handling “Poison Pills” (malformed messages that crash consumers) without stopping the world.

  1. Fast Lane: The main topic. 99.9% of messages flow here.
  2. Retry Lane: If consumption fails, republish the message to a “Retry Topic” with a delay.
  3. DLQ: If retries fail N times, move to a Dead Letter Queue for manual inspection.
    • Outcome: The partition never blocks.

8. Trade-offs

  • Latency vs. Durability: Uber configures acks=all for Billing events (Zero Loss, higher latency) but acks=1 for GPS pings (Speed over safety).
  • Complexity: Managing thousands of topics requires automated tooling (uReplicator) which adds operational overhead.

9. Results

  • 99.99% Uptime: Even during regional outages, the buffer allows services to queue data until systems recover.
  • Decoupling: New teams can build “Fraud Detection” features by simply subscribing to existing topics without asking the Driver Team for API access.

10. Future

  • Tiered Storage: Offloading old segments to S3 to save costs while keeping data queryable.
  • Kafka on Kubernetes: Moving from bare metal to K8s for better resource utilization.

10. Observability & Tracing

Kafka is complex. Visibility is mandatory.

RED Method

  1. Rate: Messages/sec (In/Out).
  2. Errors: Request Timed Out, Not Leader For Partition.
  3. Duration: Request Latency (P99).

Critical Kafka Metrics

  • Consumer Lag: Max(LogEndOffset - CurrentOffset). The most important metric. If this grows, consumers are falling behind.
  • Under Replicated Partitions (URP): Number of partitions where Replicas < ISR. If > 0, you are at risk of data loss.
  • ISR Shrink/Expand: Flapping ISRs indicate network issues or slow disks.

11. Deployment Strategy

Rolling Restarts

  • Why? Config changes, upgrades, OS patches.
  • Procedure: Restart one broker at a time.
    1. Controlled Shutdown: Broker sends signal to Controller to migrate leaderships before stopping. Reduces “Not Leader” errors.
    2. Restart: Broker comes back.
    3. Catch Up: Wait for it to rejoin ISR.
    4. Next: Move to next broker.

Topic Config Changes

  • Dynamic: Retention, Partition Count (Increase only).
  • Static: Message Format Version (requires restart).

12. Requirements Traceability Matrix

Requirement Architectural Solution
Extreme Throughput Sequential I/O + Zero Copy (sendfile).
Scalability Partitioning (Horizontal Scale) + Consumer Groups.
Durability Replication (ISR) + WAL.
Decoupling Log abstraction (Broker doesn’t know Consumer).
Replayability Time-based retention (e.g., 7 days).

13. Interview Gauntlet

I. Internals

  • Explain Zero Copy. It avoids copying data to User Space. Disk -> PageCache -> NIC.
  • What is the High Watermark? The offset of the last message replicated to all ISRs. Consumers cannot read past it.
  • Pull vs Push? Kafka is Pull. Consumers control the rate (Backpressure).

II. Scalability

  • How to handle a Slow Consumer? Add more partitions and consumers. Or optimize the consumer processing logic.
  • Can you decrease partitions? No. This would break data ordering and hashing. You must create a new topic.

III. Failure Modes

  • What if Zookeeper dies? Old Kafka stops working (Controller election fails). New Kafka (KRaft) removes ZK dependency.
  • Message Duplication? Use Idempotent Producer (enable.idempotence=true).

14. Summary: The Whiteboard Strategy

If asked to design Kafka, draw this 4-Quadrant Layout:

1. Requirements

  • Func: Pub/Sub, Replay, Order.
  • Non-Func: High Throughput, Durability.
  • Scale: Trillions/day.

2. Architecture

[Producers] -> [Broker Cluster]
(Topic -> Partitions)
|
[Consumer Groups]

* Broker: Dumb Log Storage.
* ZK/KRaft: Metadata.

3. Storage

Partition: [Seg 1] [Seg 2] (Append Only)
Index: Sparse (Offset -> Byte)
Message: Key, Value, Timestamp

4. Performance

  • Zero Copy: `sendfile` bypasses CPU.
  • Sequential I/O: Disk seeks are the enemy.
  • Batching: Compress messages into sets.

Next, we look at Chubby, the system that coordinates all this madness: Chubby (Distributed Lock).