Kafka (Wait-Free Architecture)

[!TIP] Dumb Broker, Smart Consumer: Traditional queues (RabbitMQ) track “who read what” on the server. Kafka puts that burden on the client. The broker is just a high-speed append-only log. This design choice allows Kafka to handle massive throughput.

1. Event Streaming vs Message Queues

Kafka is often confused with Message Queues like RabbitMQ or ActiveMQ.

Message Queue (RabbitMQ): “Smart Broker, Dumb Consumer”. The broker tracks state. Messages are deleted once consumed. Good for task processing (job queues).
Event Streaming (Kafka): “Dumb Broker, Smart Consumer”. The broker is a storage engine. Messages are retained based on time (e.g., 7 days). Consumers track their own position (Offset). Good for replayability, sourcing, and high throughput.

2. The Core Abstraction: The Log

A Log is the simplest storage abstraction:

Append-Only: New data goes to the end. No overwrites.
Ordered: Event A happened before Event B.
Immutable: You can’t change history.

Topics & Partitions

Kafka scales by splitting a Topic (e.g., “UserClicks”) into Partitions (Shards).

Ordering Guarantee: Kafka guarantees order within a partition, but NOT across the whole topic.
Partitioning Strategy: hash(UserID) % NumPartitions. This ensures all events for User A go to the same partition (and stay ordered).
Offset: A unique ID (integer) for every message in a partition. It is monotonic but not necessarily contiguous (in recent versions).

Storage Internals (Segments & Indexes)

Kafka achieves high-performance lookups and transfers by combining Sparse Indexing with Zero-Copy data paths, as visualized in the storage engine diagram below.

The .log file: Stores the raw message blocks (blue).
The .index file: A Sparse Index (gold) that maps logical offsets to physical byte positions on disk.
- Efficiency: Instead of mapping every offset, it maps a chunk of data (e.g., every 4KB).
- The Search: To find a specific message, Kafka finds the nearest offset in the index, jumps to the byte position in the log, and scans sequentially.

Storage Engine: Zero-Copy & Sparse Index

Sparse Offset Mapping | sendfile() Bypass | Page Cache Utilization

Disk Storage (Log Segment)

sparse.index

Off: 0 | Pos: 0

Off: 1000 | Pos: 4096

Off: 2000 | Pos: 8192

Off: 3000 | Pos: 12288

000000.log

Block 0k

...

Target Data (4k)

Block 8k

Operating System Memory

User Space (JVM App)

User Buffer

❌

Kernel Space

Page Cache (Kernel Buffer)

Socket Buffer

Zero Copy Path ✓

NETWORK NIC

4. Why is Kafka so Fast?

Kafka pushes millions of messages per second on spinning disks. The secret is doing Less Work.

Sequential I/O: Random Disk Access is slow (seeking head). Sequential Access is incredibly fast (faster than random RAM access!).
Zero Copy:
- Standard I/O: The kernel reads data from disk to Page Cache. Then it copies it to User Space (Application). Then the Application copies it back to Kernel Space (Socket Buffer). Finally, to the NIC. (4 Copies, 4 Context Switches).
- Zero Copy (sendfile): The kernel transfers data directly from the Page Cache to the NIC Buffer (via DMA). The data never enters the Application (JVM) memory.
- Result: CPU usage drops by 60%, and garbage collection is non-existent for data transfer.

Interactive Demo: Zero Copy Racer

Visualize the difference in CPU cycles and context switches.

Top Lane: Standard I/O (The “Taxed” Path).
Bottom Lane: Zero Copy (The “Gold” Path).

Standard I/O (4 Copies)

Disk

Kernel

User

Kernel

NIC

Zero Copy (2 Copies)

Disk

Kernel (PageCache)

NIC

Ready to Race.

5. Reliability: ISR and High Watermark

How do we ensure data durability without sacrificing speed?

Leader: Handles writes.
Follower: Passively replicates data.
ISR (In-Sync Replicas): Replicas that are “caught up” with the leader.
High Watermark (HW): The offset of the last message successfully replicated to all ISRs.
- Safety: Consumers can only read up to the HW. This prevents “Ghost Reads” (reading data that gets lost if the leader crashes).
Acks:
- acks=0: Fire and forget. Fast, unsafe.
- acks=1: Leader ack. Balanced.
- acks=all: All ISRs ack. Slow, safe.

Interactive Demo: The High Watermark

Visualize how data becomes “Visible” to consumers only after replication.

Produce: Send messages to Leader.
Replicate: Follower pulls data.
Commit: HW moves. Consumers can now read.

Leader (P0)

LEO

Follower (P0)

LEO

LEO = Log End Offset (Written to Disk) | HW = High Watermark (Committed)

            Consumers waiting for committed data...
        

Consumer View

No Data Visible

6. Consumer Groups (The Secret Weapon)

This is how Kafka scales reads horizontally.

Consumer Group: A logical application (e.g., “BillingService”).
Rule: Each partition is consumed by exactly one consumer in the group.
Example: If you have 4 partitions and 2 consumers, each gets 2 partitions. If you have 5 consumers, 1 is idle!

Rebalancing

When a consumer joins or leaves, Kafka must redistribute partitions.

Stop-the-World (Eager): All consumers stop reading, give up partitions, and rejoin. (Latency spike).
Cooperative (Incremental): Only moving partitions are revoked. Consumers keep reading untouched partitions.

Interactive Demo: Consumer Rebalancing

Visualize how partitions are reassigned when consumers join or leave.

Goal: Ensure every partition has a consumer.
Idle: If Consumers > Partitions, some consumers are idle.

Topic: UserClicks (4 Partitions)

No Consumers Active

            System Stable.
        

7. Exactly-Once Semantics (EOS)

“Exactly-Once” is the holy grail. Kafka supports it via:

Idempotent Producer: The producer assigns a Sequence Number to every batch. If the broker sees a duplicate sequence number, it ignores it.
Transactions: You can write to multiple partitions atomically. “All or Nothing”. This is critical for “Consume-Process-Produce” loops (e.g., KStreams).

8. Log Compaction

Normally, Kafka deletes old data after 7 days (Retention Policy). Log Compaction is different. It retains the latest value for every key.

Use Case: Restoring a database state (CDC).
How: If Key A has values v1, v2, v3, compaction deletes v1 and v2. The log effectively becomes a Snapshot of the final state.

9. Case Study: Uber’s Trillion Message Pipeline

Uber operates one of the largest Kafka deployments in the world, powering everything from dynamic pricing to fraud detection.

1. Scenario

Scale: Trillions of messages per day across thousands of topics.
Users: Driver App, Rider App, Eats, Freight.
Requirement: Real-time dispatching, Active-Active redundancy between regions, and zero data loss for billing.

2. The Challenge

Head-of-Line Blocking: One bad message (poison pill) can halt an entire partition processing pipeline.
Replication Lag: Synchronizing state between US-West and US-East in real-time.
Thundering Herd: If a cluster fails, reconnecting thousands of microservices simultaneously can DDOS the backup cluster.

3. The Decision

Why did Uber choose Kafka over RabbitMQ or a REST-based architecture?

Feature	REST / HTTP	RabbitMQ	Kafka (Chosen)
Throughput	Low (Sync overhead)	Medium	Extreme (Seq I/O)
Persistence	None (Ephemeral)	Short-term	Long-term (Replayable)
Ordering	No	Weak	Strict (Per partition)
Backpressure	Circuit Breakers needed	Limited by RAM	Native (Pull-based)

4. The Decision Visualizer

Kafka acted as the “Universal Buffer” allowing producers to write at any speed, independent of consumer capacity.

5. Architecture

Uber employs a Federated Kafka architecture to solve global scale.

Regional Clusters: “Edge” clusters in each data center accept local writes from services.
uReplicator: A custom replication solution (superior to standard MirrorMaker) to bridge Regional Clusters to the Core Aggregation Cluster.
Chaperone: An audit service that verifies every message produced is eventually consumed (completeness check).

6. Core Abstraction

The Log is the single source of truth.

Immutable History: Every trip request, GPS ping, and transaction is an event in the log.
Materialized Views: Downstream databases (Cassandra/MySQL) are just projections of the Kafka log.

7. Deep Dive: Dead Letter Queues (DLQ)

Handling “Poison Pills” (malformed messages that crash consumers) without stopping the world.

Fast Lane: The main topic. 99.9% of messages flow here.
Retry Lane: If consumption fails, republish the message to a “Retry Topic” with a delay.
DLQ: If retries fail N times, move to a Dead Letter Queue for manual inspection.
- Outcome: The partition never blocks.

8. Trade-offs

Latency vs. Durability: Uber configures acks=all for Billing events (Zero Loss, higher latency) but acks=1 for GPS pings (Speed over safety).
Complexity: Managing thousands of topics requires automated tooling (uReplicator) which adds operational overhead.

9. Results

99.99% Uptime: Even during regional outages, the buffer allows services to queue data until systems recover.
Decoupling: New teams can build “Fraud Detection” features by simply subscribing to existing topics without asking the Driver Team for API access.

10. Future

Tiered Storage: Offloading old segments to S3 to save costs while keeping data queryable.
Kafka on Kubernetes: Moving from bare metal to K8s for better resource utilization.

10. Observability & Tracing

Kafka is complex. Visibility is mandatory.

RED Method

Rate: Messages/sec (In/Out).
Errors: Request Timed Out, Not Leader For Partition.
Duration: Request Latency (P99).

Critical Kafka Metrics

Consumer Lag: Max(LogEndOffset - CurrentOffset). The most important metric. If this grows, consumers are falling behind.
Under Replicated Partitions (URP): Number of partitions where Replicas < ISR. If > 0, you are at risk of data loss.
ISR Shrink/Expand: Flapping ISRs indicate network issues or slow disks.

11. Deployment Strategy

Rolling Restarts

Why? Config changes, upgrades, OS patches.
Procedure: Restart one broker at a time.
1. Controlled Shutdown: Broker sends signal to Controller to migrate leaderships before stopping. Reduces “Not Leader” errors.
2. Restart: Broker comes back.
3. Catch Up: Wait for it to rejoin ISR.
4. Next: Move to next broker.

Topic Config Changes

Dynamic: Retention, Partition Count (Increase only).
Static: Message Format Version (requires restart).

12. Requirements Traceability Matrix

Requirement	Architectural Solution
Extreme Throughput	Sequential I/O + Zero Copy (sendfile).
Scalability	Partitioning (Horizontal Scale) + Consumer Groups.
Durability	Replication (ISR) + WAL.
Decoupling	Log abstraction (Broker doesn’t know Consumer).
Replayability	Time-based retention (e.g., 7 days).

13. Interview Gauntlet

I. Internals

Explain Zero Copy. It avoids copying data to User Space. Disk -> PageCache -> NIC.
What is the High Watermark? The offset of the last message replicated to all ISRs. Consumers cannot read past it.
Pull vs Push? Kafka is Pull. Consumers control the rate (Backpressure).

II. Scalability

How to handle a Slow Consumer? Add more partitions and consumers. Or optimize the consumer processing logic.
Can you decrease partitions? No. This would break data ordering and hashing. You must create a new topic.

III. Failure Modes

What if Zookeeper dies? Old Kafka stops working (Controller election fails). New Kafka (KRaft) removes ZK dependency.
Message Duplication? Use Idempotent Producer (enable.idempotence=true).

14. Summary: The Whiteboard Strategy

If asked to design Kafka, draw this 4-Quadrant Layout:

1. Requirements

Func: Pub/Sub, Replay, Order.
Non-Func: High Throughput, Durability.
Scale: Trillions/day.

2. Architecture

[Producers] -> [Broker Cluster]
(Topic -> Partitions)
|
[Consumer Groups]

* Broker: Dumb Log Storage.
* ZK/KRaft: Metadata.

3. Storage

            Partition: [Seg 1] [Seg 2] (Append Only)

            Index: Sparse (Offset -> Byte)

            Message: Key, Value, Timestamp

4. Performance

Zero Copy: `sendfile` bypasses CPU.
Sequential I/O: Disk seeks are the enemy.
Batching: Compress messages into sets.

Next, we look at Chubby, the system that coordinates all this madness: Chubby (Distributed Lock).