Design an API Rate Limiter

1. What is a Rate Limiter?

A Rate Limiter restricts the number of requests a user can make to a service within a given time window. It is the first line of defense against DOS attacks, Resource Starvation, and Cost Overruns.

Real-World Examples:

Twitter API: “You have exceeded your limit of 300 tweets per 3 hours.”
Login Forms: “Too many failed attempts. Try again in 15 minutes.”

[!TIP] Throttling vs Rate Limiting:

Rate Limiting: “You can make 10 requests per minute.” (Hard Limit, returns 429).

Throttling: “You are making too many requests, I will slow you down.” (Soft Limit, queues requests). In interviews, these terms are often used interchangeably, but usually refer to the “Hard Limit” scenario.

2. Requirements & Goals

Functional Requirements

Limit: Accurately limit requests (e.g., 10 req/sec) based on rules.
Feedback: Return clear HTTP headers (X-Ratelimit-Remaining, Retry-After) and error code (HTTP 429 Too Many Requests).
Granularity: Limit by User ID, IP Address, or API Key.

Non-Functional Requirements

Low Latency: The check must be super fast (< 2ms). It is in the hot path of every request.
Distributed: Works across multiple servers.
Fail-Open: If the rate limiter crashes, it should allow traffic (don’t break the site) rather than blocking legitimate users.

3. Capacity Estimation

If we have 1 Million Active Users and each makes 100 requests/day:

Total Requests: 100M/day $\approx$ 1,150 QPS.
Peak QPS might be 10k+.
Storage: We need to store counters in memory.
- UserID (8 bytes) + Count (4 bytes) + Timestamp (4 bytes) $\approx$ 20 bytes.
- For 1M users, memory usage is tiny (~20MB). We can easily fit this in Redis.

4. System Interface

The Rate Limiter usually sits as Middleware (e.g., inside the API Gateway shown in the diagram) or as a Sidecar.

Response Headers:

HTTP 429 Too Many Requests
X-Ratelimit-Limit: 100
X-Ratelimit-Remaining: 5
X-Ratelimit-Reset: 1640995200 (Unix timestamp)

5. Database Design: Redis Counters

We don’t use a SQL database for rate limiting because disk I/O is too slow. We use an In-Memory Store (Redis).

Data Model (Hash)

Key: rate_limit:{user_id}
Fields:
- tokens: Integer (Current tokens left)
- last_refill: Timestamp (Last time we added tokens)

Alternative: Sorted Sets (For Sliding Window Log)

Key: rate_limit:{user_id}
Score: Timestamp
Member: Request ID
Note: This approach is memory-heavy and typically avoided for high-scale systems.

6. High-Level Design (Distributed)

The middleware (API Gateway/Sidecar) intercepts requests before they hit the backend.

System Architecture: Distributed Rate Limiter

Middleware Pattern | Redis Lua Scripts

Allowed (200 OK)

Blocked (429)

Client

Edge / Gateway Layer

Backend & Data

Client

Load Balancer

API Gateway

Rate Limiter Middleware

Checks Quota

Redis Cluster

• User Counters
• Lua Scripts (Atomic)

Backend Service

Protected Resource
(Only reached if allowed)

7. Component Design: Algorithms & Lua

This is the core interview question. Which algorithm do you choose?

Algorithms

A. Token Bucket (Amazon/Stripe)

Concept: A bucket holds N tokens. Tokens refill at rate R. Each request consumes 1 token.
Pros: Allows Bursts (if bucket is full, you can send N requests instantly). Memory efficient.
Cons: Complex to tune two parameters (Bucket Size + Refill Rate).

B. Leaky Bucket (Shopify/Nginx)

Concept: Requests enter a queue (bucket). They are processed (leak) at a constant rate. If queue full, discard.
Pros: Smooths traffic. Stable outflow prevents overloading downstream services.
Cons: Bursts fill the queue and cause valid requests to drop.

C. Fixed Window Counter

Concept: Count requests in 12:00-12:01. Reset at 12:01.
Pros: Simple INCR operation.
Cons: Edge Case (Spike). If I send 100 requests at 12:00:59 and 100 at 12:01:01, I sent 200 in 2 seconds, violating the limit.

D. Sliding Window Counter (The Winner)

Concept: Hybrid approach. Weighted average of the previous window count and current window count.
Formula: Requests = (Requests in Current Window) + (Requests in Previous Window * Overlap Percentage)
Pros: Accurate enough, Memory efficient.

The Race Condition Problem (Distributed)

If two requests come at the exact same time (Concurrency), a classic Read-Modify-Write race occurs:

Request A reads tokens = 1.
Request B reads tokens = 1.
Request A decrements and writes 0.
Request B decrements and writes 0. Result: Both requests passed, but only one token was consumed. We allowed more traffic than the limit.

Solution: Redis Lua Scripts (Atomicity) We use Lua scripts in Redis. A Lua script executes as a single Atomic Transaction; no other command can run in between its steps.

Rate Limiter Lua Script (Token Bucket): (This script runs inside the Redis Cluster node)

-- KEYS[1]: User Token Key
-- ARGV[1]: Refill Rate
-- ARGV[2]: Capacity
-- ARGV[3]: Current Timestamp (Redis Time)

local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])

-- Get current tokens and last refill time
local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now

-- Refill Logic
local delta = math.max(0, now - last_refill)
local filled_tokens = math.min(capacity, tokens + (delta * rate))

if filled_tokens >= 1 then
    -- Consumption allowed
    redis.call("HMSET", key, "tokens", filled_tokens - 1, "last_refill", now)
    return 1 -- Allowed
else
    return 0 -- Rejected (429)
end

8. Data Partitioning

When we scale to millions of users, a single Redis instance cannot hold all keys.

Redis Cluster (Sharding)

Strategy: Shard by User ID (or API Key).
Slot Calculation: CRC16(user_id) % 16384.
Result: Each user’s counter lives on a specific Redis node. Operations for a single user are atomic on that node.

Multi-Region Partitioning

If you have users in US and EU:

Option A: Centralized Redis Cluster in US. (High Latency for EU users).
Option B: Local Redis in each region. (Limits are not shared globally. A user can hit 100 reqs in US + 100 reqs in EU).
Option C: CRDTs (Conflict-free Replicated Data Types) or Geo-Replicated Redis (Enterprise).
- Recommendation: For most rate limiters, Option B (Local) is acceptable. It is better to allow slight over-usage than to add 100ms latency to every request.

9. Reliability & Fault Tolerance

Fail-Open Strategy

What if Redis crashes?

Fail-Closed: Reject all requests. (Bad UX - Site goes down).
Fail-Open: Allow all requests. (Risky - Backend might get overloaded).
Decision: Fail-Open. It is better to let some spam through than to block legitimate users during an outage.

Clock Skew (NTP)

In the Lua script above, we pass ARGV[3]: Current Timestamp.

Issue: If Application Server A is 5 seconds behind Server B, users might get extra tokens or be blocked unfairly.
Solution: Use Redis Time (TIME command) inside the Lua script instead of passing it from the App Server. This ensures all servers rely on the “Redis Clock” as the single source of truth.

10. Interactive Decision Visualizer: Token vs Leaky Bucket

Click the button rapidly to simulate a traffic burst. Observe how the two algorithms behave differently.

Algorithm Arena

Token Bucket

Bursty

Tokens: 5/5

Waiting...

Leaky Bucket

Smooth

Queue: 0/5

💧

Waiting...

Refill/Leak Rate: 1/sec

11. System Walkthrough (Dry Run)

Scenario: User exceeds the limit

Request: User sends request #11 in the same second (Limit is 10/s).
Gateway: Intercepts request, calls Rate Limiter Middleware.
Redis Check: Middleware executes Lua script on Redis.
- Reads tokens (0).
- Reads last_refill.
- Calculates refill (0 tokens added).
Result: Script returns 0 (Rejected).
Response: Gateway returns HTTP 429 Too Many Requests.
- Header Retry-After: 1 (Wait 1 second).

12. Interview Gauntlet

Why use Redis? Why not a local counter in the App Server?
- Ans: If you have 10 App Servers, a local counter only limits per server. A user could hit 10 different servers and get 10x the limit. We need a shared, distributed counter.
How do you handle race conditions?
- Ans: Redis Lua scripts. They guarantee atomicity for the Read-Refill-Decrement logic.
What if Redis goes down?
- Ans: Fail-Open. We bypass the limiter to ensure availability, even if it means risking some overload.
How to handle Clock Skew?
- Ans: Do not trust App Server time. Use redis.call('TIME') in the Lua script to use the Redis server’s clock as the source of truth.
How do you test a rate limiter?
- Ans: Integration tests with a mock Redis. Simulate bursts using parallel threads and verify that exactly N requests pass.

13. Whiteboard Summary

1. Core Requirements

Limit: 1000/s
Low Latency: < 2ms
Distributed & Accurate

2. Key Architecture

Middleware: In Gateway/Sidecar
Store: Redis (In-Memory)
Logic: Lua Script (Atomic)

3. Data Models

Hash: `tokens`, `last_refill`
Key: `lim:{user_id}`
Sharding: By User ID

4. Bottlenecks & Fixes

Race Condition: Lua Scripts.
Latency: Local Redis in each region.
Failure: Fail-Open strategy.