Design Uber / Yelp (Geospatial Index)

1. What is a Geospatial System?

A Geospatial System (or Location-Based Service - LBS) is the engine behind modern on-demand apps. Whether it’s Uber matching riders with drivers, Yelp finding nearby sushi, or Tinder finding matches within 5 miles, the core problem is Proximity Search.

[!TIP] The Core Challenge: Standard databases (B-Trees) are excellent for 1D data (e.g., salary > 50000). They are terrible for 2D data (e.g., x and y near me). We need specialized Spatial Indexing structures to map 2D coordinates into 1D keys.

Real-World Examples

  • Uber/Lyft: Real-time driver tracking and matching.
  • Google Maps: Points of Interest (POI) search.
  • Pokemon Go: Spawning entities based on player location.

2. Requirements & Goals

Functional Requirements

  1. Driver Location Updates: Drivers ping their location (lat/long) every 5 seconds.
  2. Nearby Search: Passengers can see available drivers within a radius (e.g., 2km).
  3. ETA Calculation: Estimate arrival time based on road traffic.
  4. Trip History: Store trip data for billing and analytics.

Non-Functional Requirements

  1. Low Latency: Search results must return in < 200ms.
  2. High Throughput: Handle millions of driver updates per second globally.
  3. Availability: The system must be always online (CAP: AP over CP).
  4. Accuracy: Minimal lag between actual driver position and system state.

3. Capacity Estimation

Let’s design for Uber-scale.

  • Total Users: 500 Million.
  • Daily Active Users (DAU): 20 Million.
  • Total Drivers: 5 Million.
  • Active Drivers: 1 Million (online at any given time).

Throughput (QPS)

  • Driver Updates: Each active driver pings every 5 seconds.
    • $1,000,000 \text{ drivers} / 5 \text{ sec} = 200,000 \text{ QPS}$ (Write heavy).
  • Passenger Searches: Assume 50M searches per day.
    • $50,000,000 / 86400 \approx 600 \text{ QPS}$ (Read is much lower than Write).

Storage

  • Location data is ephemeral. We only need the current location for the hot path.
  • 1 Million active drivers $\times$ 100 bytes (ID, lat, long, status) $\approx$ 100 MB. This fits easily in Memory (Redis).

4. System APIs

1. Update Driver Location

POST /v1/driver/location
Authorization: Bearer <token>
{
  "lat": 37.7749,
  "long": -122.4194,
  "status": "AVAILABLE" // or BUSY
}

2. Find Nearby Drivers

GET /v1/passenger/nearby?lat=37.7749&long=-122.4194&radius=2000
Response:
{
  "drivers": [
    { "id": "d1", "lat": 37.7750, "long": -122.4195 },
    { "id": "d2", "lat": 37.7751, "long": -122.4193 }
  ]
}

5. Database Design & Spatial Indexing

Why not just use SQL?

SELECT * FROM drivers
WHERE lat BETWEEN my_lat - R AND my_lat + R
AND long BETWEEN my_long - R AND my_long + R

This requires scanning too many rows. Standard B-Tree indexes cannot efficiently handle two dimensions simultaneously.

The Solution: Grid-Based Indexing

We need to map 2D coordinates into a 1D value that preserves locality.

Comparison: Geohash vs QuadTree vs Google S2

Feature Geohash (String) QuadTree (Tree) Google S2 (Integer)
Data Structure Base-32 String Interleaving Recursive 4-ary Tree Hilbert Curve on Cube
Locality OK (Z-Order Curve) Good (Adaptive) Excellent (Hilbert Curve)
Query Logic Prefix Search (LIKE '9q8%') Tree Traversal Integer Range Scan
Precision Fixed grid sizes Variable/Adaptive Fixed Cell Levels (1-30)
Edge Cases Bad at Poles (Distortion) Complexity at implementation Handles Sphere math correctly
Best For Simple string-based DBs Dense/Sparse data variance High Scale (Uber/Tinder)

Option C: Google S2 (The Industry Standard)

  • Maps the sphere onto a cube, then uses a Hilbert Space-Filling Curve to map 2D cells to 1D integers (Cell IDs).
  • Why Uber uses S2:
    • Uniformity: S2 cells are roughly square everywhere on Earth. Geohash rectangles get very thin at the poles.
    • Math: Hilbert curves preserve locality better than Z-order curves (Geohash).

Interactive: Z-Order vs Hilbert Curve

This demo visualizes how 2D points are mapped to a 1D line.

  • Z-Order (Geohash): Often makes large jumps (e.g., from cell 7 to 8). Points that are close in ID might be far in space.
  • Hilbert (Google S2): Preserves locality much better. Adjacent 1D values are almost always adjacent in 2D space.

[!TIP] Try it yourself: Click the buttons below to trace the curve. Notice how the Green curve (Z-Order) has long diagonal jumps, while the Blue curve (Hilbert) stays local.

Select a curve to visualize

6. High-Level Design

High-Level Architecture: Real-Time Geospatial Flow.

System Architecture: Uber/Yelp Geospatial Flow
Driver Pings | S2 Cell Sharding | Hot/Cold Data Split
Driver Ping (Fast Path)
Nearby Search (Read Path)
Durable History (Cold Path)
Edge Clients
Ingress Layer
Core Geospatial Services
Storage Layer
🚗
Drivers (Active)
🚶
Passengers (Search)
WS Gateway
Stateful Connections
API Gateway
REST / Search
Location Svc
S2 Resolver
Points -> Cells
Search Service
Radius Queries
Hot Index (Redis)
Shard 1 (S2: 9q)
Shard 2 (S2: 9v)
...
Kafka
Cassandra
Trip History & Analytics
location_ping ZADD Cell_ID History Event GET /nearby Range Search
  1. Driver App: Sends real-time location pings via WS Gateway (WebSocket) to maintain long-lived stateful connections.
  2. Location Service: Processes incoming pings.
    • Hot Path: Uses the S2 Resolver to map lat/long coordinates to Google S2 Cell IDs. It then updates the Redis Hot Index using ZADD.
    • Cold Path: Simultaneously pushes a “History Event” to Kafka, which is consumed by workers and persisted in Cassandra for billing and trip history.
  3. Passenger App: Requests nearby drivers via the Search Service.
  4. Redis Cluster: Stores active driver locations, sharded by S2 Cell ID range to enable fast, distributed $O(\log N)$ range queries.

[!NOTE] See Load Balancing for how to handle the ingress traffic.


7. Deep Dive: QuadTree Implementation

A QuadTree is a tree data structure in which each internal node has exactly four children: NW, NE, SW, SE.

Algorithm

  1. Insert(Point):
    • Start at Root.
    • If leaf node and points < Capacity: Add point.
    • If leaf node and points == Capacity: Split into 4 children. Distribute existing points to children.
    • If internal node: Recurse into the correct quadrant.
  2. Search(Region):
    • Start at Root.
    • If node overlaps with search region:
      • Check points in this node.
      • Recurse into children.

This creates a map where dense areas are granular, and sparse areas are broad.


8. Data Partitioning (Sharding)

With 200k updates/sec, a single Redis instance will melt. For detailed scaling strategies, see Module 07: Data Scaling.

Sharding Strategies

  1. By City/Region:
    • “San Francisco” server, “New York” server.
    • Problem: “Hot Cities”. NY might overload while “Montana” is idle.
    • Problem: Boundary issues (Driver moving from SF to Daly City).
  2. By Geohash / S2 Cell ID (Consistent Hashing):
    • As shown in the Redis Hot Index shards in our diagram, we hash the S2 Cell ID to map it to a specific Redis node.
    • This ensures that nearby drivers (who share cell ID prefixes) are more likely to be on the same shard, or predictable adjacent shards.
    • Preferred Approach.

9. Reliability & Trade-offs

The “Ghost Driver” Problem (Accuracy)

When a driver app crashes or loses internet, they might appear “stuck” on the map.

  • Solution: Use a TTL (Time-To-Live) in Redis for every entry in the Hot Index.
  • SET driver:123 location EX 30
  • If no update comes in 30 seconds, the driver automatically disappears from the index.
  • This is a form of Passive Expiration common in Caching Strategies.

CAP Theorem Trade-off

  • We choose AP (Availability + Partition Tolerance).
  • Eventual Consistency: It’s acceptable if a driver appears 10 meters away from their actual spot. The map is always an approximation of reality.
  • Liveness over Correctness: It is unacceptable for the user to see a “Service Unavailable” screen just because one Redis node is syncing.

This demo simulates how a QuadTree efficiently searches for points.

  1. Add Drivers: Click on the map to populate it.
  2. Range Search: Move your mouse to see which QuadTree nodes are queried (Green) vs ignored (Red/Gray).

[!TIP] Try it yourself: Click “Add 50 Random Drivers” then hover over the box. Watch how the search only visits a few Green boxes instead of scanning the whole world.

QuadTree Range Search

Hover to search. Click to add points.

Total Drivers: 0
Nodes Visited: 0
Drivers Found: 0

11. System Walkthrough: The Life of a Ride

Let’s trace the exact flow of data when a Driver goes online and a Passenger searches for them.

Scenario A: Driver Pings Location (Write Path)

  1. Driver App (User 123) sends a ping via WebSocket.
    {
      "event": "update_location",
      "lat": 37.7749,
      "long": -122.4194,
      "status": "AVAILABLE"
    }
    
  2. Location Service receives the message.
    • Converts Lat/Long to S2 Cell ID (Level 12).
    • Cell ID: 808580ff... (Hex).
  3. Redis Update:
    • The service identifies the correct Redis Shard (e.g., Shard 5 for San Francisco).
    • Executes GEOADD (or ZADD with encoded integer).
      GEOADD drivers:available -122.4194 37.7749 "driver_123"
      EXPIRE driver_123 30  # Reset TTL
      
  4. Kafka Publish (Async):
    • Publishes event to driver-locations topic for historical tracking.

Scenario B: Passenger Searches for Ride (Read Path)

  1. Passenger App requests nearby drivers.
    GET /nearby?lat=37.7750&long=-122.4190&radius=500m
    
  2. Search Service:
    • Calculates the S2 Cell ID for the passenger.
    • Determines neighboring cells to query (usually the center cell + 8 neighbors).
  3. Redis Query:
    • Queries the Redis Shard for all drivers in those cells.
    • GEORADIUS drivers:available -122.4190 37.7750 500 m WITHDIST
  4. Response:
    {
      "drivers": [
        { "id": "driver_123", "lat": 37.7749, "long": -122.4194, "dist_m": 15.2 }
      ]
    }
    

12. Requirements Traceability Matrix

Requirement Architectural Solution
Real-Time Updates WebSocket Gateway (Stateful) + In-Memory Redis.
Low Latency (<200ms) Google S2 Index (Fast Math) + Sharded Redis Cluster.
Accuracy 30s TTL (Time-to-Live) ensures stale drivers disappear.
Scalability (200k QPS) Database Sharding by S2 Cell ID (Geospatial Partitioning).
Availability (AP) Redis Cluster with Replicas. Eventual Consistency accepted.
Historical Data Async Kafka pipeline to Cassandra (Write-Optimized).

13. Follow-Up Questions: The Interview Gauntlet

This section covers 50 rapid-fire questions to test your depth.

I. Geospatial Indexing

  • Why S2 over Geohash? S2 cells are square-like, while Geohash rectangles distort heavily near poles. S2 uses Hilbert curves for better locality.
  • Why not Postgres PostGIS? PostGIS uses R-Trees (disk-based). While powerful, it cannot handle 200k write IOPS per second. Redis (Memory) is required.
  • How does QuadTree rebalancing work? If a node exceeds capacity (e.g., 500 points), split it into 4 children. If children become sparse, merge them.
  • What is the “Edge Case” problem? A user standing on a grid line might not see a driver 1 meter away in the adjacent cell. We solve this by querying neighbor cells (all 9 blocks).

II. Scalability & Performance

  • How to shard Redis? Consistent Hashing on the S2 Cell ID. This keeps nearby drivers on the same shard.
  • Hot Spot Problem: What if Times Square has 10k drivers? The shard melts. Solution: Further split hot cells into smaller child cells (Dynamic Sharding) or use Read Replicas.
  • WebSocket vs Polling: WebSocket reduces overhead (headers) and latency. Essential for 1s update intervals.
  • Bandwidth Usage: 200k * 100 bytes = 20 MB/s. Manageable, but use Protobuf instead of JSON to reduce size by 60%.

III. Reliability & Failure Modes

  • Redis Crash: If Redis memory is wiped, we lose current locations. This is acceptable. Drivers will ping again in 5 seconds, self-healing the state.
  • Network Partition: If SF cannot talk to NY, it’s fine. Geospatial data is naturally partitioned by location.
  • Ghost Drivers: Handled by Redis TTL. If the app crashes, the key expires automatically.

14. Summary: The Whiteboard Strategy

If asked to design Uber, draw this 4-Quadrant Layout:

1. Requirements

  • Func: Driver Ping (Write), Passenger Search (Read).
  • Non-Func: Low Latency, High Throughput (200k QPS).
  • Scale: 500M Users, 1M Active Drivers.

2. Architecture

[Driver]
↓ (WS)
[Location Svc] -- [Kafka] -> [Cassandra]

[Redis Cluster (S2)]

[Search Svc]

[Passenger]

* S2 Index: Maps 2D to 1D.
* Redis: Handles high write throughput.

3. Data & API

Redis: GEOADD key lat long member
API: POST /location, GET /nearby
S2: Level 12 (~200m blocks)

4. Trade-offs

  • QuadTree vs S2: S2 has better math/locality. QuadTree is harder to balance distributedly.
  • CAP: AP (Availability). It's okay if map is slightly stale.
  • Protocol: WebSocket for drivers (push), HTTP for passengers (pull).

Return to Specialized Systems