Design an API Rate Limiter
[!IMPORTANT] In this lesson, you will master:
- 1. What is a Rate Limiter?: Building intuition behind 1. what is a rate limiter?.
- 2. Requirements & Goals: Building intuition behind 2. requirements & goals.
- 3. Capacity Estimation: Building intuition behind 3. capacity estimation.
1. What is a Rate Limiter?
A Rate Limiter restricts the number of requests a user can make to a service within a given time window. Think of it like a Bouncer at a nightclub: the bouncer ensures the club doesn’t get dangerously overcrowded by only letting a specific number of people in per hour, and blocking the rest until space frees up. It is the first line of defense against DOS attacks, Resource Starvation, and Cost Overruns.
Real-World Examples:
- Twitter API: “You have exceeded your limit of 300 tweets per 3 hours.”
- Login Forms: “Too many failed attempts. Try again in 15 minutes.”
[!TIP] Throttling vs Rate Limiting:
- Rate Limiting: “You can make 10 requests per minute.” (Hard Limit, returns 429).
- Throttling: “You are making too many requests, I will slow you down.” (Soft Limit, queues requests). In interviews, these terms are often used interchangeably, but usually refer to the “Hard Limit” scenario.
2. Requirements & Goals (P - Problem & Requirements)
Functional Requirements
- Limit: Accurately limit requests (e.g., 10 req/sec) based on rules.
- Feedback: Return clear HTTP headers (
X-Ratelimit-Remaining,Retry-After) and error code (HTTP 429 Too Many Requests). - Granularity: Limit by User ID, IP Address, or API Key.
Non-Functional Requirements
- Low Latency: The check must be super fast (< 2ms). It is in the hot path of every request.
- Distributed: Works across multiple servers.
- Fail-Open: If the rate limiter crashes, it should allow traffic (don’t break the site) rather than blocking legitimate users.
3. Capacity Estimation (E - Estimation)
If we have 1 Million Active Users and each makes 100 requests/day:
- Total Requests: 100M/day \approx 1,150 QPS.
- Peak QPS might be 10k+.
- Storage: We need to store counters in memory.
- UserID (8 bytes) + Count (4 bytes) + Timestamp (4 bytes) \approx 20 bytes.
- For 1M users, memory usage is tiny (~20MB). We can easily fit this in Redis.
4. System Interface (D - Data Model)
The Rate Limiter usually sits as Middleware (like a mandatory security checkpoint) (e.g., inside the API Gateway shown in the diagram) or as a Sidecar.
Response Headers:
HTTP 429 Too Many RequestsX-Ratelimit-Limit: 100X-Ratelimit-Remaining: 5X-Ratelimit-Reset: 1640995200(Unix timestamp)
5. Database Design: Redis Counters (D - Data Model)
We don’t use a SQL database for rate limiting because disk I/O is too slow. We use an In-Memory Store (Redis).
Data Model (Hash)
- Key:
rate_limit:{user_id} - Fields:
tokens: Integer (Current tokens left)last_refill: Timestamp (Last time we added tokens)
Alternative: Sorted Sets (For Sliding Window Log)
- Key:
rate_limit:{user_id} - Score: Timestamp
- Member: Request ID
- Note: This approach is memory-heavy and typically avoided for high-scale systems.
6. High-Level Design (Distributed) (A - Architecture)
Interview-Friendly High-Level Diagram
This is the simplified version of the architecture you should draw on the whiteboard.
graph TD
User([User]) --> LB[Load Balancer]
subgraph Edge Layer
LB --> Gateway[API Gateway / Sidecar]
Gateway --> Limiter[Rate Limiter Middleware]
end
subgraph Storage Layer
Limiter -- Lua Script --> Redis[(Redis Cluster)]
end
subgraph Service Layer
Limiter -- Allow --> AppSvc[Backend Microservices]
Limiter -- Deny --> 429[429 Too Many Requests]
end
The middleware (API Gateway/Sidecar) intercepts requests before they hit the backend.
7. Component Design: Algorithms & Lua (L - Localized Details)
This is the core interview question. Which algorithm do you choose?
Algorithms
A. Token Bucket (Amazon/Stripe)
- Concept: A bucket holds
Ntokens. Tokens refill at rateR. Each request consumes 1 token. Think of it like a subway turnstile: it has a bucket of tokens. As long as you have a token, you can burst through. Once empty, you must wait for the bucket to refill. - Pros: Allows Bursts (if bucket is full, you can send
Nrequests instantly). Memory efficient. - Cons: Complex to tune two parameters (Bucket Size + Refill Rate).
B. Leaky Bucket (Shopify/Nginx)
- Concept: Requests enter a queue (bucket). They are processed (leak) at a constant rate. If queue full, discard. Think of it like pouring water into a funnel: no matter how fast you pour water in (bursts), it drips out of the bottom at a steady, predictable rate.
- Pros: Smooths traffic. Stable outflow prevents overloading downstream services.
- Cons: Bursts fill the queue and cause valid requests to drop.
C. Fixed Window Counter
- Concept: Count requests in
12:00-12:01. Reset at12:01. - Pros: Simple
INCRoperation. - Cons: Edge Case (Spike). If I send 100 requests at
12:00:59and 100 at12:01:01, I sent 200 in 2 seconds, violating the limit.
D. Sliding Window Counter (The Winner)
- Concept: Hybrid approach. Weighted average of the previous window count and current window count. It acts as a rolling summary to smooth out sudden spikes at the edge of time windows.
- Formula:
Requests = (Requests in Current Window) + (Requests in Previous Window * Overlap Percentage) - Pros: Accurate enough, Memory efficient.
The Race Condition Problem (Distributed)
If two requests come at the exact same time (Concurrency), a classic Read-Modify-Write race occurs:
- Request A reads
tokens = 1. - Request B reads
tokens = 1. - Request A decrements and writes
0. - Request B decrements and writes
0. Result: Both requests passed, but only one token was consumed. We allowed more traffic than the limit.
Solution: Redis Lua Scripts (Atomicity) We use Lua scripts in Redis. A Lua script executes as a single Atomic Transaction (like a person entering a locked room to do work—no one else can interrupt or see what they are doing until they leave); no other command can run in between its steps.
Rate Limiter Lua Script (Token Bucket): (This script runs inside the Redis Cluster node)
-- KEYS[1]: User Token Key
-- ARGV[1]: Refill Rate
-- ARGV[2]: Capacity
-- ARGV[3]: Current Timestamp (Redis Time)
local key = KEYS[1]
local rate = tonumber(ARGV[1])
local capacity = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
-- Get current tokens and last refill time
local data = redis.call("HMGET", key, "tokens", "last_refill")
local tokens = tonumber(data[1]) or capacity
local last_refill = tonumber(data[2]) or now
-- Refill Logic
local delta = math.max(0, now - last_refill)
local filled_tokens = math.min(capacity, tokens + (delta * rate))
if filled_tokens >= 1 then
-- Consumption allowed
redis.call("HMSET", key, "tokens", filled_tokens - 1, "last_refill", now)
return 1 -- Allowed
else
return 0 -- Rejected (429)
end
8. Data Partitioning (S - Scale)
When we scale to millions of users, a single Redis instance cannot hold all keys. We shard the data like splitting a giant address book into smaller volumes based on the first letter of the last name.
Redis Cluster (Sharding)
- Strategy: Shard by User ID (or API Key).
- Slot Calculation:
CRC16(user_id) % 16384. - Result: Each user’s counter lives on a specific Redis node. Operations for a single user are atomic on that node.
Multi-Region Partitioning
If you have users in US and EU:
- Option A: Centralized Redis Cluster in US. (High Latency for EU users).
- Option B: Local Redis in each region. (Limits are not shared globally. A user can hit 100 reqs in US + 100 reqs in EU).
- Option C: CRDTs (Conflict-free Replicated Data Types) or Geo-Replicated Redis (Enterprise).
- Recommendation: For most rate limiters, Option B (Local) is acceptable. It is better to allow slight over-usage than to add 100ms latency to every request.
9. Reliability & Fault Tolerance
Fail-Open Strategy
What if Redis crashes?
- Fail-Closed: Reject all requests. (Bad UX - Site goes down).
- Fail-Open: Allow all requests. (Risky - Backend might get overloaded).
- Decision: Fail-Open. It is better to let some spam through than to block legitimate users during an outage.
Clock Skew (NTP)
In the Lua script above, we pass ARGV[3]: Current Timestamp.
- Issue: If Application Server A is 5 seconds behind Server B, users might get extra tokens or be blocked unfairly.
- Solution: Use Redis Time (
TIMEcommand) inside the Lua script instead of passing it from the App Server. This ensures all servers rely on the “Redis Clock” as the single source of truth.
10. Interactive Decision Visualizer: Token vs Leaky Bucket
Click the button rapidly to simulate a traffic burst. Observe how the two algorithms behave differently.
Algorithm Arena
Token Bucket
BurstyLeaky Bucket
Smooth11. System Walkthrough (Dry Run)
Scenario: User exceeds the limit
- Request: User sends request #11 in the same second (Limit is 10/s).
- Gateway: Intercepts request, calls Rate Limiter Middleware.
- Redis Check: Middleware executes Lua script on Redis.
- Reads
tokens(0). - Reads
last_refill. - Calculates refill (0 tokens added).
- Reads
- Result: Script returns
0(Rejected). - Response: Gateway returns HTTP 429 Too Many Requests.
- Header
Retry-After: 1(Wait 1 second).
- Header
12. Interview Gauntlet
- Why use Redis? Why not a local counter in the App Server?
- Ans: If you have 10 App Servers, a local counter only limits per server. A user could hit 10 different servers and get 10x the limit. We need a shared, distributed counter.
- How do you handle race conditions?
- Ans: Redis Lua scripts. They guarantee atomicity for the Read-Refill-Decrement logic.
- What if Redis goes down?
- Ans: Fail-Open. We bypass the limiter to ensure availability, even if it means risking some overload.
- How to handle Clock Skew?
- Ans: Do not trust App Server time. Use
redis.call('TIME')in the Lua script to use the Redis server’s clock as the source of truth.
- Ans: Do not trust App Server time. Use
- How do you test a rate limiter?
- Ans: Integration tests with a mock Redis. Simulate bursts using parallel threads and verify that exactly N requests pass.
- How do you handle IP spoofing?
- Ans: Limit by authenticated
user_idor API Key when possible. For unauthenticated routes, use rate limiting in conjunction with WAF/Cloudflare CAPTCHA challenges.
- Ans: Limit by authenticated
- What if one specific user is maliciously causing rate limit logs to fill up?
- Ans: Drop their requests at the Gateway/Edge layer before it even hits the Rate Limiter if they exceed the limit by a massive margin (e.g., 100x the limit).
13. Whiteboard Summary (4 Quadrant)
1. P - Core Requirements
- Limit: 1000/s
- Low Latency: < 2ms
- Distributed & Accurate
2. A - Key Architecture
- Middleware: In Gateway/Sidecar
- Store: Redis (In-Memory)
- Logic: Lua Script (Atomic)
3. D - Data Models
- Hash: `tokens`, `last_refill`
- Key: `lim:{user_id}`
- Sharding: By User ID
4. S - Scale & Bottlenecks
- Race Condition: Lua Scripts.
- Latency: Local Redis in each region.
- Failure: Fail-Open strategy.