Deep Dive: Trading Order System (RFQ)

1. Introduction: The High-Stakes World of RFQ

In the world of cryptocurrency exchanges, there are two primary models for trading: CLOB (Central Limit Order Book) and RFQ (Request for Quote). While CLOBs (like Nasdaq or Binance) match buyers and sellers continuously, RFQ systems are designed for institutional clients or “Convert” features where a user asks: “How much for 10 BTC right now?” and the system replies: “$50,000.00. Valid for 7 seconds.”

This creates a unique distributed systems challenge: Ephemeral State Management. You must guarantee a price for a short window, managing the risk that the market might crash during those seconds. If your system is too slow, you lose money. If your system is inconsistent, you lose user funds.

This guide assumes no prior knowledge of financial systems. We will build a production-grade, globally scalable RFQ system from the ground up, dissecting every component from the Load Balancer to the Database Kernel.

2. Requirements & Constraints

2.1 Functional Requirements

Authentication & Authorization: Users must log in to the system and access only the features and data they are authorized to see (role-based access control).
Price Discovery: Users can check real-time prices for all coins provided by the trading system.
Get Quote (RFQ): Users request a buy/sell price for a crypto pair (e.g., BTC-USD). The system must return a firm price valid for 7 seconds.
Execute Order: Users can accept the quote only if:
- The quote has not expired (within 7-second window)
- The user has sufficient balance in their wallet
- The quote has not been used before (replay protection)
Wallet Management: Users must be able to manage their balances (deposits, withdrawals, internal transfers).
Transaction History: Users can view their past trades and current portfolio value.
Settlement: Funds must be exchanged atomically. No partial states (e.g., money deducted but crypto not received).

2.2 Non-Functional Requirements

Reliability: 99.999% (Five Nines). In finance, downtime = reputation death.
Latency:
- Quote Generation: < 50ms (99th percentile). Stale quotes are risky.
- Order Execution: < 100ms (99th percentile). Fast feedback is crucial.
Consistency: Strong Consistency (ACID) for all financial transactions. Eventual consistency is not allowed for balances.
Throughput: Support 10,000 Quotes/sec and 500 Trades/sec initially, scalable to 10x via sharding.
Availability: 24/7. Crypto markets never sleep.
Security: Strict Authentication (mTLS/JWT), Rate Limiting, and Audit Logs.

2.3 Compliance & Regulations

[!IMPORTANT] Problem Statement: “System should be compliant with country regulations”

KYC (Know Your Customer): Users must be verified before trading large amounts.
- Tier 1 (Bronze): Email verification only → Max $1,000/trade
- Tier 2 (Silver): Photo ID + Proof of Address → Max $10,000/trade
- Tier 3 (Gold): Enhanced Due Diligence (source of funds) → Max $100,000/trade
AML (Anti-Money Laundering): Large or suspicious transactions must be flagged.
- Threshold: Transactions >$10,000 require enhanced monitoring
- SAR (Suspicious Activity Report): Automated flagging for patterns like structuring (many $9,999 trades to avoid $10k threshold)
Auditability: Every state change must be immutable and traceable for 7 years.
- All trades, wallet changes, and failed attempts logged to append-only tables
Geo-Fencing: Restrict trading based on jurisdiction.
- US: Privacy coins (Monero, Zcash) banned → Return 403 FORBIDDEN
- China: All crypto trading banned → Geo-block entire country
- EU (MiCA Regulation): Mandatory disclosure of energy consumption for PoW coins
Data Residency: Store user data in their home jurisdiction.
- GDPR (EU): User data must stay in EU data centers, with Right to Erasure (“Right to be Forgotten”)
- FinCEN (US): Transaction records must be available for 5 years for law enforcement requests
Transaction Limits: Per-user caps based on verification tier (enforced at API Gateway).

3. Capacity Planning & Estimation

Before writing code, we must understand the hardware requirements.

3.1 Traffic Analysis

DAU (Daily Active Users): 1,000,000.
Quote-to-Trade Ratio: 20:1. Users check prices frequently but trade less often.
Quotes: 20 requests/user/day → 20M quotes/day.
- Average QPS: 20,000,000 / 86,400 ≈ 230 QPS.
- Peak QPS: Financial markets are uniquely volatile. A single tweet (e.g., from Elon Musk) can trigger massive spikes. We must design for 10x to 50x bursts → 2,300 - 11,500 QPS.
- Why design for peak? If your system fails during a rally, users can’t trade, and you lose the most profitable moments.
Trades: 1M trades/day.
- Average TPS: 1,000,000 / 86,400 ≈ 12 TPS.
- Peak TPS: 120 TPS.

3.2 Bandwidth & Network

Quote Payload: ~1KB (JSON with prices, metadata, signature).
- Bandwidth: 2,300 QPS × 1 KB ≈ 2.3 MB/s. Trivial for modern networks (1Gbps links).
Order Payload: ~500 Bytes.
- Bandwidth: Negligible.

3.3 Storage (The Ledger)

Trades Table: The primary source of truth.
- Row Size: ~500 Bytes (IDs, timestamps, prices, fees).
- Daily Growth: 1,000,000 trades × 500 B = 500 MB/day.
- Yearly Growth: 500 MB × 365 ≈ 180 GB.
- 5-Year Retention: 180 GB × 5 ≈ 900 GB.
Conclusion: A single master database could hold this volume, but for IOPS (Input/Output Operations Per Second) and concurrency, we will need Sharding.

3.4 Memory (Redis Cache)

Active Quotes: Only valid for 7 seconds.
- At peak 2,300 QPS × 7s = ~16,100 active quotes in memory.
- Size: 16,100 × 1 KB ≈ 16 MB.
- Conclusion: Redis memory usage is tiny. We are CPU/Network bound, not Memory bound.

4. High-Level Architecture & The “Excalidraw” Diagram

We will use a Microservices Architecture to separate concerns: reliability (Orders) vs speed (Quotes).

The diagram below represents a “clean” architecture you can draw during an interview. It highlights the separation between the Synchronous Flow (User actions) and Asynchronous Flow (Settlement/Analytics).

System Architecture: RFQ Trading Platform

High-Frequency Quote & Order Execution Flow

Synchronous (block)

Asynchronous (non-block)

Client Zone

Ingress / DMZ

Private VPC (K8s)

Persistence Layer

Trader

Load Balancer

API Gateway (Kong)

Quote Svc Go

x20 Pods

• Stateless
• Median Quotes
• Signatures

Order Svc Java

x10 Pods

• Stateful (ACID)
• Validation
• Idempotency

Audit Consumer

Writes to Cold Storage / BigQuery

📡 External Liquidity Providers (Binance, CME)

Redis Cluster (Cache)

🔑 Quotes: TTL 7s

⚡️ Metadata: LPs, Pairs

            [Primary] [Replica] [Replica]

            [Primary] [Replica] [Replica]

Postgres (Shard 1: A-M)

Primary (RW)

Replica (RO)

Postgres (Shard 2: N-Z)

Primary (RW)

Replica (RO)

Kafka Cluster

Topic: orders_events

5. Detailed Component Design & Trade-offs

5.1 API Gateway (The Doorman)

The Gateway (e.g., Kong, Nginx, or AWS API Gateway) is the single entry point.

Authentication:
- Uses JWT (JSON Web Tokens) for lightweight, stateless auth.
- The Gateway verifies the signature (RSA-256) and claims (Expiration, Scopes) before passing the request downstream.
Rate Limiting:
- Algorithm: Token Bucket.
- Why Token Bucket over Leaky Bucket? Token Bucket allows for short bursts of traffic (e.g., a bot reacting to a sudden market move), which is desirable for trading. Leaky Bucket enforces a rigid rate, which might punish legitimate high-frequency users.
- Scope: Per User ID and Per IP.
- Config: 100 req/min for Quotes, 20 req/min for Orders.
Optimization:
- SSL Termination: Decrypt HTTPS at the edge to offload CPU from microservices.
- Keep-Alive: Maintain persistent connections to upstream services to avoid TCP Handshake overhead.

5.2 The Quote Service (The Sprinter)

This service must be incredibly fast. It is Stateless and Read-Heavy.

Role:
1. Fetch real-time prices from Liquidity Providers (LPs) via WebSocket.
2. Apply a “Spread” (e.g., +0.5% markup for profit).
3. Generate a cryptographically signed quote.
4. Cache the quote in Redis with TTL=7s.
Optimizations:
- Fan-out: Query 3 LPs in parallel, take the median price (to avoid outliers/manipulation).
- Zero-Allocation: In Go/Rust, reuse memory buffers to avoid Garbage Collection pauses during high load.

5.3 The Order Service (The Vault)

This service must be incredibly safe. It is Stateful and Transactional.

Role:
1. Atomically claim the quote (prevent replay attacks).
2. Check user balance (DB lock).
3. Execute the trade atomically (Update DB).
Idempotency:
- Crucial for network failures. If a client sends an order but doesn’t get a response (timeout), they will retry.
- Mechanism: The client sends a unique idempotency_key (UUID). The server checks a unique index in the DB. INSERT INTO orders (id, ...) VALUES ... ON CONFLICT DO NOTHING.

5.3.1 Quote Replay Protection (CRITICAL)

[!CAUTION] The #1 Interview Gotcha: Without atomic claim, your system is vulnerable to double-fill attacks.

The Attack Scenario:

User receives quote: quote_id=abc123, valid for 7 seconds
User opens two browser tabs
Tab A: Submit order with quote_id=abc123
Tab B: Submit order with quote_id=abc123 (1ms later)
Both requests race:
- Request A: Check quote:abc123 exists? → ✅ YES
- Request B: Check quote:abc123 exists? → ✅ YES (race!)
- Request A: Debit $50,000, credit 1 BTC
- Request B: Debit $50,000, credit 1 BTC
Result: User gets 2 BTC for the price of 1 → You lose $50,000

The Solution: Atomic Check-and-Claim

We use a Redis Lua script to atomically check if a quote exists AND mark it as claimed in a single operation:

-- quote_claim.lua
-- This script ensures a quote can only be claimed ONCE
local quote_key = KEYS[1]       -- "quote:abc123"
local claim_key = KEYS[2]       -- "quote:abc123:claimed"
local order_id = ARGV[1]        -- "order_xyz789"
local remaining_ttl = ARGV[2]   -- Remaining seconds for quote validity

-- Atomic check: Does quote exist AND is it not claimed?
if redis.call('EXISTS', quote_key) == 1 and redis.call('EXISTS', claim_key) == 0 then
    -- Claim it for this order_id
    redis.call('SET', claim_key, order_id, 'EX', remaining_ttl)
    -- Return the quote payload
    return redis.call('GET', quote_key)
else
    -- Quote expired or already claimed
    return nil
end

Order Service Execution Flow (Python):

import redis
import time
import hmac
import hashlib
import json

redis_client = redis.Redis()

# Load Lua script once on startup (cached on Redis server)
QUOTE_CLAIM_SCRIPT = redis_client.script_load(open('quote_claim.lua').read())

def execute_order(user_id, quote_payload_signed, idempotency_key):
    # Step 1: Verify HMAC signature (prevent tampering)
    quote_id = quote_payload_signed['id']
    received_hmac = quote_payload_signed['signature']
    payload = json.dumps(quote_payload_signed['data'], sort_keys=True)
    expected_hmac = hmac.new(SERVER_SECRET, payload.encode(), hashlib.sha256).hexdigest()
    
    if not hmac.compare_digest(received_hmac, expected_hmac):
        return {"error": "INVALID_SIGNATURE"}
    
    # Step 2: Check quote freshness (server-side clock with grace period)
    quote_data = quote_payload_signed['data']
    quote_created_at = quote_data['created_at']  # Unix timestamp from server
    current_time = time.time()
    
    # 7-second validity + 200ms grace period for network latency
    if current_time - quote_created_at > 7.2:
        return {"error": "QUOTE_EXPIRED"}
    
    # Step 3: Atomic claim using Lua script
    remaining_ttl = int(7.2 - (current_time - quote_created_at))
    quote_key = f"quote:{quote_id}"
    claim_key = f"quote:{quote_id}:claimed"
    
    quote_payload = redis_client.evalsha(
        QUOTE_CLAIM_SCRIPT,
        2,  # Number of KEYS
        quote_key, claim_key,
        idempotency_key, remaining_ttl
    )
    
    if quote_payload is None:
        return {"error": "QUOTE_EXPIRED_OR_ALREADY_USED"}
    
    # Step 4: Database transaction with balance check
    with db.begin():
        # Lock user's wallet row (prevents concurrent trades from same user)
        wallet = db.execute(
            "SELECT balance FROM wallets WHERE user_id = %s AND currency = %s FOR UPDATE",
            (user_id, quote_data['base_currency'])
        ).fetchone()
        
        if wallet['balance'] < quote_data['total_cost']:
            return {"error": "INSUFFICIENT_BALANCE"}
        
        # Debit and credit atomically
        db.execute(
            "UPDATE wallets SET balance = balance - %s WHERE user_id = %s AND currency = %s",
            (quote_data['total_cost'], user_id, quote_data['base_currency'])
        )
        db.execute(
            "UPDATE wallets SET balance = balance + %s WHERE user_id = %s AND currency = %s",
            (quote_data['amount'], user_id, quote_data['quote_currency'])
        )
        
        # Insert order record (idempotency_key ensures uniqueness)
        db.execute(
            "INSERT INTO orders (order_id, user_id, quote_id, pair, amount, price, status) "
            "VALUES (%s, %s, %s, %s, %s, %s, 'FILLED') "
            "ON CONFLICT (idempotency_key) DO NOTHING",
            (idempotency_key, user_id, quote_id, ...)
        )
    
    return {"status": "SUCCESS", "order_id": idempotency_key}

Why This Works:

Lua Atomicity: Redis executes the entire script as a single atomic operation. No race conditions.
Claim Key: Once set, the claim key prevents the quote from being used again (even if the original quote key still exists).
TTL Inheritance: The claim key expires at the same time as the quote, preventing leaked state.
Grace Period: The 200ms buffer accounts for network delays, ensuring valid requests at t=6.9s are not rejected.

5.4 Redis Architecture

We use Redis not just for caching, but for temporary state (Quotes).

Cluster Mode:
- Data is sharded across multiple nodes (16384 slots).
- This allows horizontal scaling of memory and throughput.
Eviction Policy: volatile-ttl.
- Why? We only want to evict keys with an expiry set (quotes). If we used allkeys-lru, we might accidentally evict persistent configuration keys or session data. See Module 06: Caching.
Persistence:
- RDB (Snapshot): Every 15 minutes.
- AOF (Append Only File): Disabled or set to everysec.
- Why? If Redis crashes, losing the last 1 second of quotes is acceptable. Users will just request a new quote. Performance > Durability for quotes.

5.5 Database Architecture (The Ledger)

We choose PostgreSQL for its robust ACID compliance.

Why PostgreSQL over NewSQL (CockroachDB/TiDB)?

Maturity: PostgreSQL has decades of battle-testing in financial systems. NewSQL databases are powerful but introduce complexity in deployment and debugging.
Complexity: For our scale (1M users), sharded Postgres is sufficient and well-understood. Distributed SQL adds network overhead (Raft consensus) to every write, increasing latency, which we want to avoid for order execution.

Schema Design

[!CAUTION] Never use FLOAT or DOUBLE for money. Floating-point arithmetic causes rounding errors due to IEEE 754 representation. Example: 0.1 + 0.2 != 0.3 in binary floating point. Always use fixed-point types (NUMERIC/DECIMAL) or store integer minor units (BIGINT for cents/wei).

CREATE TABLE wallets (
    user_id UUID,
    currency VARCHAR(10),
    -- NUMERIC(36, 18) supports crypto tokens with 18 decimals (e.g., ETH)
    -- Many ERC-20 tokens use 18 decimal places (1 ETH = 10^18 wei)
    balance NUMERIC(36, 18) NOT NULL DEFAULT 0,
    updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    PRIMARY KEY (user_id, currency)
);

CREATE TABLE orders (
    order_id UUID PRIMARY KEY,
    idempotency_key UUID UNIQUE NOT NULL,  -- Prevent duplicate submissions
    user_id UUID NOT NULL,
    quote_id VARCHAR(64) NOT NULL,
    pair VARCHAR(20) NOT NULL,  -- 'BTC-USD', 'ETH-USDT'
    side VARCHAR(4) NOT NULL,   -- 'BUY', 'SELL'
    amount NUMERIC(36, 18) NOT NULL,       -- Quantity of crypto
    price NUMERIC(36, 18) NOT NULL,        -- Price per unit
    total_cost NUMERIC(36, 18) NOT NULL,   -- amount * price
    fee NUMERIC(36, 18) NOT NULL DEFAULT 0,
    status VARCHAR(20) NOT NULL,           -- 'FILLED', 'FAILED', 'CANCELLED'
    created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    settled_at TIMESTAMPTZ,
    
    INDEX idx_user_created (user_id, created_at DESC)  -- For transaction history
);

Sharding Strategy & Hot Partitions

As the user base grows, a single Postgres node will hit write limits (IOPS).

Sharding Key: user_id.
Why?: All data for a specific user (Wallets, Orders) lives on the same shard. This allows us to use Local ACID Transactions rather than complex, slow Two-Phase Commit (2PC) protocols across shards.
Consistent Hashing: Used to map user_id to Shard_N.
The “Hot Partition” Problem:
- Scenario: A “Whale” user (e.g., an institutional bot) executes 100 trades/sec.
- Impact: That user’s shard (physical node) becomes overloaded, slowing down other users on the same node.
- Solution: Virtual Buckets. Instead of mapping Users directly to Nodes, we map Users to 1000 Virtual Buckets, and Buckets to Nodes.
- Mechanism: Hash(User_ID) % 1000 → Bucket_ID. Bucket_ID → Physical_Node_IP.
- Visual Example: In the diagram, we simplified this to a Range-Based split (Users A-M on Shard 1, N-Z on Shard 2) for clarity, but in production, Consistent Hashing is preferred to avoid unbalanced shards (e.g., if everyone is named “Alice”).
- Benefit: If a bucket gets hot, we can migrate just that bucket to a dedicated high-performance node without resharding the entire database.

5.6 Asynchronous History & Auditing

While the order execution is synchronous (ACID), reporting is asynchronous.

Change Data Capture (CDC): We use the Outbox Pattern or simple application-level publishing.
Kafka Topic: After the DB commit succeeds, the Order Service publishes an event to the orders_events topic.
```
{ "event": "ORDER_FILLED", "order_id": "...", "price": 50250, "ts": 169876... }
```
Audit Service: Consumes these events and writes them to:
- Elasticsearch: For user-facing search (“Show my trades last month”).
- Cold Storage (S3/Glacier): For 7-year regulatory retention.
- BigQuery/Redshift: For Data Science (Analytics).

5.7 Authentication & Authorization

[!IMPORTANT] Problem Statement Requirement: “Users should be able to log in to the system and access only the features and data they are authorized to see.”

5.7.1 Authentication Flow

User Registration →

Email Verification → KYC (Know Your Customer) → Account Activated

# Simplified Auth Flow
def login(email, password, totp_code):
    # Step 1: Verify password (bcrypt with cost factor 12)
    user = db.query("SELECT * FROM users WHERE email = ?", email)
    if not bcrypt.verify(password, user.password_hash):
        return {"error": "INVALID_CREDENTIALS"}
    
    # Step 2: Verify 2FA (Time-based One-Time Password)
    secret = user.totp_secret
    if not pyotp.TOTP(secret).verify(totp_code):
        return {"error": "INVALID_2FA"}
    
    # Step 3: Generate JWT with user role/tier embedded
    jwt_payload = {
        "user_id": user.id,
        "role": user.role,          # 'retail', 'vip', 'institutional'
        "tier": user.kyc_tier,       # 'bronze', 'silver', 'gold'
        "exp": time.time() + 900     # 15 minutes
    }
    access_token = jwt.encode(jwt_payload, JWT_SECRET, algorithm="HS256")
    refresh_token = generate_refresh_token(user.id)  # 7-day expiry
    
    return {"access_token": access_token, "refresh_token": refresh_token}

5.7.2 Role-Based Access Control (RBAC)

The API Gateway enforces trading limits based on user tier:

Tier	Max Trade Size	Daily Limit	Rate Limit (Quotes/sec)
Bronze (Unverified)	$1,000	$5,000	5
Silver (KYC Verified)	$10,000	$50,000	10
Gold (Enhanced DD)	$100,000	$500,000	20
Institutional	Custom	Custom	Custom

API Gateway Middleware (Kong/Envoy):

-- kong_auth.lua
function check_trade_limit(jwt_payload, trade_amount)
    local tier = jwt_payload.tier
    local limits = {
        bronze = 1000,
        silver = 10000,
        gold = 100000
    }
    
    if trade_amount > limits[tier] then
        return 403, "TRADE_LIMIT_EXCEEDED"
    end
    
    return 200, "OK"
end

5.7.3 Security Best Practices

Password Storage: bcrypt with cost factor 12 (2^12 iterations)
2FA: TOTP (Google Authenticator) mandatory for withdrawals
Session Management:
- Access Token: 15 minutes (short-lived)
- Refresh Token: 7 days (stored in HttpOnly cookie)
- Refresh rotation: New refresh token issued on each use

Rate Limiting: Per-user sliding window in Redis

key = f"rate_limit:{user_id}:quote"
redis.incr(key)
redis.expire(key, 1)  # 1-second window
if redis.get(key) > 10:  # Max 10 quotes/sec
    return "RATE_LIMIT_EXCEEDED"

5.8 Multi-Region Deployment

[!IMPORTANT] Problem Statement Requirements:

“Available in multiple regions”

“Provide good user experience to users who do not reside in residential country”

“System should be compliant with country regulations”

5.8.1 The Multi-Region Challenge

Requirements:

Low Latency: Users in Asia shouldn’t hit US servers (adds 200ms+ RTT)
Data Residency: EU users’ data must stay in EU (GDPR compliance)
Disaster Recovery: If US-East fails, failover to US-West < 1 minute

Architecture: Active-Active for Quotes, Active-Passive for Execution

5.8.2 Regional Deployment Strategy

┌─────────────────────────────────────────────────────────────┐
│                       GeoDNS (Route 53)                     │
│  Routes user to nearest region based on IP geolocation     │
└──────────────┬──────────────────────┬──────────────────────┘
              │                      │                       │
       ┌──────▼──────┐       ┌───────▼──────┐      ┌────────▼────────┐
       │  US-EAST-1  │       │  EU-WEST-1   │      │  AP-SOUTH-1     │
       │             │       │              │      │                 │
       │ Quote Svc   │       │ Quote Svc    │      │ Quote Svc       │
       │ (Stateless) │       │ (Stateless)  │      │ (Stateless)     │
       │ Redis Cache │       │ Redis Cache  │      │ Redis Cache     │
       └─────────────┘       └──────────────┘      └─────────────────┘
              │                      │                       │
       ┌──────▼──────────────────────▼───────────────────────▼──────┐
       │           Order Service (Stateful)                          │
       │  Pinned by "Home Region" per user                           │
       │  EU users → EU-WEST-1 DB     US users → US-EAST-1 DB       │
       └─────────────────────────────────────────────────────────────┘

5.8.3 Quote Service: Active-Active (Stateless)

Deployment: Each region has its own Quote Service + Redis cluster
No Cross-Region Sync: Quotes are ephemeral (7s TTL), no need to replicate
User Routing: GeoDNS routes user to nearest region for quote requests
Failover: If a region fails, GeoDNS automatically routes to next-nearest region

Trade-off: A user in Europe might get slightly different quotes from eu-west-1 vs us-east-1 due to different LP feeds, but the 7-second window is too short for meaningful arbitrage.

5.8.4 Order Service: Active-Passive (Stateful)

Home Region Pinning:

Each user is assigned a “home region” based on signup location
All trades for that user ALWAYS route to their home region’s DB shard
Stored in user profile: {"user_id": "abc", "home_region": "us-east-1"}

Why Pinning?

ACID Guarantee: Executing a trade requires updating wallet balances atomically. If data is split across regions, we’d need distributed transactions (2PC) which are slow and complex.
Compliance: EU user data stays in EU, US user data stays in US (GDPR/data residency laws).

Cross-Region Execution Flow:

EU user traveling in Asia requests quote from ap-south-1 (fast, uses local cache)
User clicks “Buy” → API Gateway checks home_region → Redirects to eu-west-1
Order executes on EU database (single-region ACID)

Disaster Recovery:

Warm Standby: Each region has a hot replica in a secondary AZ
Streaming Replication: Postgres uses WAL (Write-Ahead Log) shipping (RPO < 1 minute)
Promotion: If primary fails, standby promoted to primary within 30 seconds
Trade-off: 99.999% (five nines) = ~5 min/year downtime. A 30s failover twice/year = 1 minute, leaving 4 minutes budget.

5.8.5 Data Residency & Geo-Fencing

Regulatory Compliance:

Region	Storage Location	Allowed Trading Pairs	Special Rules
EU	Frankfurt (eu-central-1)	All except privacy coins	GDPR: Right to erasure, Data portability
US	Virginia (us-east-1)	SEC-approved only	FinCEN: SAR reporting for >$10k
APAC	Singapore (ap-southeast-1)	All	MAS: Mandatory cooling-off period for new users

Geo-Fencing Implementation:

# Middleware at API Gateway
def check_jurisdiction(user_location, trading_pair):
    if user_location == "US" and trading_pair == "XMR-USD":  # Monero (privacy coin)
        return 403, "RESTRICTED_IN_YOUR_JURISDICTION"
    
    if user_location == "CN":  # China ban
        return 403, "SERVICE_NOT_AVAILABLE_IN_YOUR_COUNTRY"
    
    return 200, "OK"

6. Deep Dive: Reliability & Consistency

6.1 The “Double Spend” Attack

Scenario: Malicious User “Eve” has $100. She sends two requests simultaneously:

Buy $100 of BTC.
Buy $100 of ETH. If processed in parallel without locking, both might check balance > 100 (True) and succeed. Eve spends $200 but only had $100.

Solution: Pessimistic Locking We use SELECT ... FOR UPDATE to lock the wallet row.

BEGIN;
  -- This line blocks other transactions trying to read Alice's USD wallet
  SELECT balance FROM wallets WHERE user_id = 'Alice' AND currency = 'USD' FOR UPDATE;
  
  IF balance >= 100 THEN
    UPDATE wallets SET balance = balance - 100 ...;
    INSERT INTO orders ...;
    COMMIT;
  ELSE
    ROLLBACK;
  END IF;
END;

Trade-off: Locking reduces concurrency for a single user, but prevents fraud. Since we shard by user, this doesn’t block other users.
Why not Optimistic Locking? Optimistic locking (using version columns) works well for low-contention. In high-frequency trading, retrying failed transactions due to version conflicts adds latency and complexity to the client. Pessimistic locking guarantees the order executes (or fails) immediately.

6.2 The 7-Second Race Condition

Scenario:

T=0.0s: Quote Generated (Valid until T=7.0s).
T=6.9s: User clicks “Buy”.
T=7.1s: Request reaches Server.
Strict Logic: Reject (Expired).
User Experience: “I clicked in time! Your system sucks.”

Solution: The Grace Period We add a server-side buffer (e.g., 500ms). The validator checks: if (CurrentTime < QuoteExpiry + GracePeriod)

Risk: The market crashes in that 500ms.
Mitigation: The “Spread” (profit margin) we added to the quote covers these small slippage risks.

Interactive Simulator

Test your latency luck. Can you click before the quote expires?

RFQ Simulator

Can you beat the network lag?

BTC / USD

$54,320.50

7.00s

Network Lag: 200ms

        > System Ready.
    

6.3 System Walkthrough: The Life of a Trade

To solidify our understanding, let’s trace a single transaction through the entire stack, examining the exact API payloads, Redis keys, and Database queries.

Scenario A: The Happy Path (Quote → Buy → Success)

Step 1: User requests a quote for 1 BTC.

Request: POST /quotes

{
  "pair": "BTC-USD",
  "side": "BUY",
  "amount": 1.00000000
}

Quote Service Action:
1. Fetches price from LP: $50,000.
2. Adds spread (+0.5%): $50,250.
3. Generates quote_id: q_123.
4. Signs the payload: HMAC_SHA256(price + expiry, secret).

Redis State:

SET quote:q_123 "{\"price\": 50250, \"expiry\": 1698765432}" EX 7

Response: 200 OK

{
  "quote_id": "q_123",
  "price": 50250.00,
  "expiry": 1698765432,
  "signature": "a1b2c3d4..."
}

Step 2: User accepts the quote (within 7 seconds).

Request: POST /orders

{
  "quote_id": "q_123",
  "order_id": "o_999",
  "user_id": "u_alice"
}

(Note: order_id is a client-generated UUID for Idempotency)

Order Service Action:

Validate Quote: GET quote:q_123. (If exists, proceed).
Validate Signature: Recompute HMAC. (Prevents tampering with price).

DB Transaction:

BEGIN;
-- 1. Lock Wallet
SELECT balance FROM wallets 
WHERE user_id = 'u_alice' AND currency = 'USD' FOR UPDATE;
-- (Result: 100,000.00)
        
-- 2. Update Balance
UPDATE wallets SET balance = balance - 50250.00 
WHERE user_id = 'u_alice';
        
-- 3. Insert Order
INSERT INTO orders (order_id, user_id, quote_id, status) 
VALUES ('o_999', 'u_alice', 'q_123', 'FILLED');
        
COMMIT;

Response: 200 OK

{
  "status": "FILLED",
  "tx_id": "tx_555"
}

Scenario B: The “Double Spend” Attempt

Imagine “Eve” has $100. She sends two requests simultaneously:

Req A: Buy $100 BTC.
Req B: Buy $100 ETH.

Timeline:

T=0.00s: Req A reaches DB. Starts Transaction.

[Tx A] SELECT balance ... FOR UPDATE; -- Locks Row

T=0.01s: Req B reaches DB. Starts Transaction.

[Tx B] SELECT balance ... FOR UPDATE; -- BLOCKED! Waits for Tx A.

T=0.05s: Req A updates balance ($100 → $0) and COMMIT.
- Row Lock released.
T=0.06s: Req B unblocks and reads the new balance.
- balance is now $0.
- Logic: IF balance < 100 THEN ROLLBACK.
Result: Req A succeeds (200 OK). Req B fails (402 Payment Required).

Scenario C: The “Expired Quote” Race Condition

Step 1: User attempts to trade at T=7.1s (0.1s too late).

Request: POST /orders { quote_id: "q_123" }
Order Service Action:
1. Check Redis: GET quote:q_123
2. Result: (nil) (Key evicted by Redis).
3. Fallback Check: Even if the key existed (e.g., due to lag), check payload.expiry < Now().

Response: 400 Bad Request

{
  "error": "QUOTE_EXPIRED",
  "message": "Quote q_123 is no longer valid. Please request a new price."
}

7. Alternative Solutions (Trade-offs)

7.1 RFQ vs. CLOB (Central Limit Order Book)

CLOB (e.g., Nasdaq, Binance):
- Mechanism: Continuous matching of limit orders.
- Pros: Transparent pricing, high liquidity discovery.
- Cons: Extremely complex to engineer (matching engine), high computational cost.
RFQ (Our System):
- Mechanism: Guaranteed price on demand.
- Pros: Simpler architecture, better UX for large trades (no slippage).
- Cons: Platform takes market risk.

7.2 SQL vs. NoSQL

NoSQL (DynamoDB):
- Pros: Infinite scaling.
- Cons: Lack of multi-row ACID transactions. Implementing a ledger in NoSQL requires complex application-level locking (e.g., optimistic locking with version numbers), which is error-prone for financial data.
SQL (PostgreSQL):
- Pros: Native ACID, referential integrity.
- Cons: Harder to scale writes.
- Decision: SQL wins because financial correctness > raw write speed. Sharding solves the scale issue.

7.3 Event Sourcing

Concept: Store every transaction as an immutable event (Deposited, Bought, Sold). Calculate balance by replaying events.
Pros: Perfect audit trail, easy debugging.
Cons: Replaying millions of events to get a balance is slow. Requires “Snapshots”.
Our Choice: Hybrid. We use a standard SQL table for current balance (fast) but log every change to an audit_logs table (immutable).

8. Low-Level Optimizations (The “Boom” Factor)

To squeeze every millisecond out of the system:

Kernel Tuning:
- Increase TCP Buffer sizes (net.ipv4.tcp_rmem, net.ipv4.tcp_wmem) to handle high-throughput bursts.
- Enable TCP Fast Open (TFO) to reduce handshake latency by 1 RTT.
Connection Pooling:
- Database connections are expensive. Use PgBouncer to maintain a pool of warm connections, reducing overhead.
Garbage Collection (GC):
- For the Quote Service (Golang), tune GOGC to trade memory for CPU.
- For the Order Service (Java), use ZGC or Shenandoah for sub-millisecond pause times.
Network:
- Place Quote Services in the same Availability Zone (AZ) as the Liquidity Providers if possible (AWS us-east-1).
- Use Kernel Bypass (DPDK) is likely overkill for 230 QPS, but worth mentioning for HFT requiring microsecond latency.

9. Requirements Traceability Matrix

Requirement	Architectural Solution
Get Quote (7s)	`Quote Service` + `Redis` (TTL 7s) + `WebSocket` to LPs.
Place Order	`Order Service` with HMAC validation + Idempotency keys.
Balance Check	`PostgreSQL` with `SELECT FOR UPDATE` (Pessimistic Locking).
Reliability (99.999%)	Active-Passive DB Failover + Stateless Services + Kubernetes Auto-healing.
Latency (<50ms)	In-memory processing (Redis) + Connection Pooling + Geolocation.
Consistency	Database Sharding by `user_id` allows local ACID transactions.
Scalability	Horizontal scaling of services + DB Sharding + Redis Cluster.
Security	API Gateway (JWT, Rate Limit) + Private Subnets + mTLS.
Compliance	Async KYC pipeline + Audit Logs (Event Sourcing lite).

10. Observability & Tracing

You cannot fix what you cannot see. For a system moving millions of dollars, we need total visibility.

10.1 The RED Method (Metrics)

We instrument every service to emit these three golden signals:

Rate: Request counts per second.
- Metric: http_requests_total{service="quote_svc", status="200"}
- Use: Detect traffic spikes or DDoS.
Errors: Failed requests.
- Metric: order_failed_total{reason="insufficient_funds"}
- Use: Alert if order failures exceed 1% of total traffic.
Duration: Latency distributions.
- Metric: quote_generation_seconds_bucket (Histogram)
- Use: Alert if P99 latency > 100ms.

10.2 Distributed Tracing

A single order touches 4 systems: Gateway -> Order Service -> Redis -> DB. If an order is slow, Distributed Tracing tells us exactly where.

Trace ID: Generated at the Gateway (e.g., x-trace-id: 12345). Passed via HTTP headers to every downstream service.
Spans: Each service logs a “Span” with start/end timestamps.
Visualization (e.g., Jaeger/Zipkin):

[Gateway]   |-------------------------------------------|  205ms
            [Order SVC] |-----------------------------|    180ms
                        [Redis] |----|                     10ms
                        [DB Lock]      |-----------|       150ms (Bottleneck!)

In this example, the DB Lock took 150ms, indicating database contention.

10.3 Structured Logging

Forget plain text logs. Use JSON for machine-readability (ELK Stack).

{
  "level": "INFO",
  "timestamp": "2023-10-27T10:00:00Z",
  "service": "order-service",
  "trace_id": "a1b2c3d4",
  "user_id": "u_999",
  "event": "order_placed",
  "amount": 100.00,
  "currency": "USD"
}

Audit Logs: Separate, immutable logs for compliance. Every balance change must be recorded here and archived to WORM (Write Once Read Many) storage (e.g., S3 Object Lock).

10.4 Alerting Strategy

P1 (Critical - Wake up on-call):
- Order Success Rate < 99.5%.
- Database connection pool saturation > 90%.
- Redis Cluster state “FAIL”.
P2 (Warning - Ticket for tomorrow):
- Latency P99 > 150ms (SLA breach warning).
- Disk usage > 80%.

11. Deployment & Operations

11.1 Deployment Strategy: Blue/Green

For a financial system, we cannot risk a “bad deploy” corrupting the database.

Blue (Active): Serving 100% traffic.
Green (Staging): Deploy new version. Run integration tests.
Switch: Update the Load Balancer to route 1% traffic to Green (Canary).
Monitor: Watch for HTTP 500 or latency spikes.
Rollout: If safe, route 100% to Green.
Rollback: If 1% fails, instantly revert LB to Blue. Users see errors for only 5 seconds.

11.2 Database Schema Evolution

Problem: Adding a column locks the table.
Solution: Expand-Contract Pattern.
1. Expand: Add nullable column new_col (Zero downtime).
2. Code: Update app to write to both old_col and new_col.
3. Backfill: Run a background job to copy data old → new.
4. Contract: Update code to read only from new_col. Drop old_col.

12. Follow-Up Questions: The Interview Gauntlet

This section covers 50 rapid-fire questions to test the depth of your design.

I. Database & Data Consistency (The Core)

Why PostgreSQL over NewSQL? Sharded Postgres is more mature and sufficient for 1M users. Distributed SQL (CockroachDB) adds consensus latency to writes.
Handling Hot Shards: If a “Whale” hits 10k TPS, we use Virtual Buckets to migrate that user to a dedicated physical node.
Isolation Levels: We use READ COMMITTED for performance. SERIALIZABLE prevents race conditions but causes too many transaction aborts/retries in high-concurrency environments.
Lock Contention: If SELECT ... FOR UPDATE hangs, connection pools exhaust. We set NOWAIT or short timeouts (e.g., 2s) to fail fast.
Replication Lag: Users reading from replicas might see old balances. We implement “Sticky Sessions” or force reads from Primary for critical wallet views.
Schema Migrations: Use pg_repack or similar tools to add columns without locking tables.
Archival Strategy: Move data > 1 year old to S3 (Parquet format) and delete from Postgres to keep indices small.
Double Booking: Without row locking, two parallel transactions read the same balance, subtract funds, and overwrite each other.
Database Failover: Postgres Automatic Failover (PAF) takes ~30s. Writes fail during this window; users see errors.
Data Corruption: External reconciliation (Nightly Jobs) sums all wallet balances vs. total deposits to detect drift.

II. Scalability & Performance

Redis Eviction: volatile-ttl ensures we only drop expired quotes, never persistent configs.
Connection Pooling: Use PgBouncer sidecar. If scaling exceeds DB limits, we must shard further.
Load Balancing: Least Outstanding Requests handles varying service times better than Round Robin.
Global Latency: Users in Australia will have higher latency. We can’t cheat physics; we must extend quote expiry or use Edge locations.
CDN Caching: We generally cannot cache prices as they change every second. WebSocket is preferred.
Write-Heavy Spikes: During crashes, we implement Queue-based Load Leveling (Kafka) to smooth out DB writes.
Serialization: JSON is fine for this scale. Protobuf saves bandwidth but adds debugging complexity.
Autoscaling Triggers: Scale on CPU Usage (>70%) and Request Queue Depth.
Cache Penetration: Use Bloom Filters to block requests for non-existent symbols (“FAKE-COIN”).

III. Reliability & Fault Tolerance

Redis Persistence: If Redis dies, quotes are lost. This is acceptable; users just request a new quote.
Circuit Breaker: Threshold based on Error Rate (e.g., >50% failures in 10s).
Bulkhead Pattern: Isolate thread pools for “Notifications” vs “Orders” so one slow dependency doesn’t crash the app.
Retry Storms: Add Exponential Backoff and Jitter to client retries.
Idempotency Storage: If Redis evicts keys, we fallback to a check in the persistent DB (slower but safer).
Graceful Degradation: If History Service fails, the “Trade” button still works.
Clock Skew: We use NTP on servers. Tolerance is built into the 7s expiry window.
Zonal Failures: Deployment across 3 AZs ensures only ~33% capacity loss, which autoscaling covers.

IV. Architecture & Microservices

Saga Pattern: We don’t use Sagas for the core trade (too slow). We use local ACID via sharding.
Service Discovery: Kubernetes (CoreDNS) handles service IP resolution.
Gateway vs Mesh: Gateway handles Edge concerns (Auth, Rate Limit); Mesh handles inter-service concerns (mTLS, Retries).
Configuration: Use a dynamic config server (e.g., Consul/Etcd) with watchers to update expiry_seconds hot.
Data Ownership: Order Service cannot access Wallet Table directly. Must call Wallet Service API to decouple schemas.
Event Ordering: Kafka Partition Key = user_id ensures events for one user are sequential.

V. Security & Compliance

Insider Trading: Admin actions require Multi-Party Approval and are logged to immutable audit trails.
API Key Security: Scoped keys (Read-Only vs Trade). Automated rotation.
DDoS Protection: Rate limiting at the Edge (Cloudflare) + Gateway (Token Bucket).
Audit Immutability: Write logs to S3 with Object Lock (Governance Mode).
PII Data: “Crypto-shredding”: Delete the encryption key for a user’s data to effectively “erase” it without modifying immutable logs.
Internal Auth: mTLS (Mutual TLS) ensures only authorized services can talk to the Wallet Service.

VI. Operations & Observability

Metric Cardinality: Do not tag metrics with user_id. Use logs for high-cardinality debugging.
Distributed Tracing: Inject x-trace-id at the Gateway and propagate it everywhere.
Deployment: Canary Deployment. Roll out v2 to 1% of users, monitor error rates, then expand.
Chaos Engineering: Randomly kill pods (Chaos Monkey) during staging to test recovery.
Alert Fatigue: Group related alerts. Use “Symptoms” (User can’t trade) rather than “Causes” (CPU high) for paging.
Capacity Planning: Linear regression on past 6 months of data to forecast storage/compute needs.

VII. Business Logic & Edge Cases

Negative Balance: Should be impossible with ACID. If it happens, freeze account and trigger manual investigation.
Partial Fills: Requires DB schema change (filled_amount vs requested_amount).
Market Halted: A global “Kill Switch” in Redis that the Order Service checks before every trade.
Rounding Errors: Always use Integers (Micros/Satoshis) or BigDecimal. Never float or double.
Settlement Failure: Platform takes the risk. If LP fails, we still owe the user the crypto.

VIII. Advanced Architecture (The 99.999% Club)

Event Sourcing vs. CRUD: Event sourcing is better for auditability but harder to query. We use a hybrid approach (CRUD for current state, Events for history).
CQRS (Command Query Responsibility Segregation): Use separate models for Writes (Order Service) and Reads (History Service). This allows scaling reads independently via Read Replicas.
LMAX Disruptor: A high-performance inter-thread messaging library. Used in HFT to avoid lock contention. Overkill for 500 TPS but good for 500k TPS.
Kernel Bypass (DPDK/Solarflare): Bypassing the Linux kernel networking stack to write directly to the NIC. Reduces latency from 10us to 1us.
Clock Synchronization: NTP is not enough for sub-millisecond precision. Use PTP (Precision Time Protocol) with hardware timestamping.
Garbage Collection Tuning: For Java, use ZGC to keep pauses < 1ms. For Go, use GOGC=off and manual memory management if needed (extreme case).
False Sharing: CPU cache line contention. Pad data structures to 64 bytes to prevent cores from invalidating each other’s caches.

IX. Failure Modes & Disaster Recovery

Partial Partition: What if the Order Service can reach Redis but not the DB? Answer: Fail the request safely.
Zombie Processes: A service that thinks it’s the leader but isn’t. Use Fencing Tokens (epoch numbers) to reject writes from zombies.
Thundering Herd: If Redis clears, 10k users hit the DB. Use Request Coalescing (Singleflight) to merge identical requests.
Split Brain: If the cluster partitions, do we accept writes on both sides? Answer: No. Pause writes (CP system) to preserve consistency.
Corrupted WAL: If Postgres WAL is corrupted, replay from the last snapshot and accept data loss (RPO > 0).
Region Failure: Failover to DR region. RTO (Recovery Time Objective) ~15 mins. DNS switch.

X. Market Microstructure

Slippage: The difference between the quoted price and executed price. In RFQ, the platform absorbs slippage (the “Spread”).
Order Types:
- FOK (Fill or Kill): Execute fully or not at all.
- IOC (Immediate or Cancel): Execute what you can, cancel the rest.
- GTC (Good Till Cancelled): Standard limit orders (not used in RFQ).
Spread Capture: The primary revenue model. We buy at $50,000 and sell to users at $50,250.
Hedging: When a user buys 1 BTC, we immediately buy 1 BTC from an LP to neutralize our inventory risk.

13. Summary: The Whiteboard Strategy

If you are asked to design this in 45 minutes, draw this 4-Quadrant Layout:

1. Requirements & Core Math

Func: Quote (7s), Order, Wallet.
Non-Func: 99.999%, <50ms Latency, ACID.
Scale: 230 QPS (Quotes), 12 TPS (Orders).
Traffic: Read-heavy (20:1).

2. Architecture

[Client]
↓
[Gateway (Auth/Rate Limit)]
↙ ↘
[Quote SVC] -- [Redis Cluster]
[Order SVC] -- [Sharded DB]

* Separation of Concerns: Fast (Quotes) vs Safe (Orders).
* Sharding: By User ID for local ACID.

3. Data & API

            POST /quotes -> { price, expiry, sig }

            POST /orders -> { order_id, status }

            Wallets: (user_id, currency, balance)

            Orders: (order_id, status, quote_id)

4. Trade-offs & Deep Dives

Concurrency: Pessimistic Locking (`FOR UPDATE`) prevents double-spend.
Latency: Redis `volatile-ttl` + Connection Pooling + Geo-routing.
Reliability: Grace Period for network jitter.
Observability: Distributed Tracing + Audit Logs.

Return to Specialized Systems