Deep Dive: Trading Order System (RFQ)
1. Introduction: The High-Stakes World of RFQ
In the world of cryptocurrency exchanges, there are two primary models for trading: CLOB (Central Limit Order Book) and RFQ (Request for Quote). While CLOBs (like Nasdaq or Binance) match buyers and sellers continuously, RFQ systems are designed for institutional clients or “Convert” features where a user asks: “How much for 10 BTC right now?” and the system replies: “$50,000.00. Valid for 7 seconds.”
This creates a unique distributed systems challenge: Ephemeral State Management. You must guarantee a price for a short window, managing the risk that the market might crash during those seconds. If your system is too slow, you lose money. If your system is inconsistent, you lose user funds.
This guide assumes no prior knowledge of financial systems. We will build a production-grade, globally scalable RFQ system from the ground up, dissecting every component from the Load Balancer to the Database Kernel.
2. Requirements & Constraints
2.1 Functional Requirements
- Authentication & Authorization: Users must log in to the system and access only the features and data they are authorized to see (role-based access control).
- Price Discovery: Users can check real-time prices for all coins provided by the trading system.
- Get Quote (RFQ): Users request a buy/sell price for a crypto pair (e.g., BTC-USD). The system must return a firm price valid for 7 seconds.
- Execute Order: Users can accept the quote only if:
- The quote has not expired (within 7-second window)
- The user has sufficient balance in their wallet
- The quote has not been used before (replay protection)
- Wallet Management: Users must be able to manage their balances (deposits, withdrawals, internal transfers).
- Transaction History: Users can view their past trades and current portfolio value.
- Settlement: Funds must be exchanged atomically. No partial states (e.g., money deducted but crypto not received).
2.2 Non-Functional Requirements
- Reliability: 99.999% (Five Nines). In finance, downtime = reputation death.
- Latency:
- Quote Generation: < 50ms (99th percentile). Stale quotes are risky.
- Order Execution: < 100ms (99th percentile). Fast feedback is crucial.
- Consistency: Strong Consistency (ACID) for all financial transactions. Eventual consistency is not allowed for balances.
- Throughput: Support 10,000 Quotes/sec and 500 Trades/sec initially, scalable to 10x via sharding.
- Availability: 24/7. Crypto markets never sleep.
- Security: Strict Authentication (mTLS/JWT), Rate Limiting, and Audit Logs.
2.3 Compliance & Regulations
[!IMPORTANT] Problem Statement: “System should be compliant with country regulations”
- KYC (Know Your Customer): Users must be verified before trading large amounts.
- Tier 1 (Bronze): Email verification only → Max $1,000/trade
- Tier 2 (Silver): Photo ID + Proof of Address → Max $10,000/trade
- Tier 3 (Gold): Enhanced Due Diligence (source of funds) → Max $100,000/trade
- AML (Anti-Money Laundering): Large or suspicious transactions must be flagged.
- Threshold: Transactions >$10,000 require enhanced monitoring
- SAR (Suspicious Activity Report): Automated flagging for patterns like structuring (many $9,999 trades to avoid $10k threshold)
- Auditability: Every state change must be immutable and traceable for 7 years.
- All trades, wallet changes, and failed attempts logged to append-only tables
- Geo-Fencing: Restrict trading based on jurisdiction.
- US: Privacy coins (Monero, Zcash) banned → Return
403 FORBIDDEN - China: All crypto trading banned → Geo-block entire country
- EU (MiCA Regulation): Mandatory disclosure of energy consumption for PoW coins
- US: Privacy coins (Monero, Zcash) banned → Return
- Data Residency: Store user data in their home jurisdiction.
- GDPR (EU): User data must stay in EU data centers, with Right to Erasure (“Right to be Forgotten”)
- FinCEN (US): Transaction records must be available for 5 years for law enforcement requests
- Transaction Limits: Per-user caps based on verification tier (enforced at API Gateway).
3. Capacity Planning & Estimation
Before writing code, we must understand the hardware requirements.
3.1 Traffic Analysis
- DAU (Daily Active Users): 1,000,000.
- Quote-to-Trade Ratio: 20:1. Users check prices frequently but trade less often.
- Quotes: 20 requests/user/day → 20M quotes/day.
- Average QPS: 20,000,000 / 86,400 ≈ 230 QPS.
- Peak QPS: Financial markets are uniquely volatile. A single tweet (e.g., from Elon Musk) can trigger massive spikes. We must design for 10x to 50x bursts → 2,300 - 11,500 QPS.
- Why design for peak? If your system fails during a rally, users can’t trade, and you lose the most profitable moments.
- Trades: 1M trades/day.
- Average TPS: 1,000,000 / 86,400 ≈ 12 TPS.
- Peak TPS: 120 TPS.
3.2 Bandwidth & Network
- Quote Payload: ~1KB (JSON with prices, metadata, signature).
- Bandwidth: 2,300 QPS × 1 KB ≈ 2.3 MB/s. Trivial for modern networks (1Gbps links).
- Order Payload: ~500 Bytes.
- Bandwidth: Negligible.
3.3 Storage (The Ledger)
- Trades Table: The primary source of truth.
- Row Size: ~500 Bytes (IDs, timestamps, prices, fees).
- Daily Growth: 1,000,000 trades × 500 B = 500 MB/day.
- Yearly Growth: 500 MB × 365 ≈ 180 GB.
- 5-Year Retention: 180 GB × 5 ≈ 900 GB.
- Conclusion: A single master database could hold this volume, but for IOPS (Input/Output Operations Per Second) and concurrency, we will need Sharding.
3.4 Memory (Redis Cache)
- Active Quotes: Only valid for 7 seconds.
- At peak 2,300 QPS × 7s = ~16,100 active quotes in memory.
- Size: 16,100 × 1 KB ≈ 16 MB.
- Conclusion: Redis memory usage is tiny. We are CPU/Network bound, not Memory bound.
4. High-Level Architecture & The “Excalidraw” Diagram
We will use a Microservices Architecture to separate concerns: reliability (Orders) vs speed (Quotes).
The diagram below represents a “clean” architecture you can draw during an interview. It highlights the separation between the Synchronous Flow (User actions) and Asynchronous Flow (Settlement/Analytics).
• Median Quotes
• Signatures
• Validation
• Idempotency
[Primary] [Replica] [Replica]
5. Detailed Component Design & Trade-offs
5.1 API Gateway (The Doorman)
The Gateway (e.g., Kong, Nginx, or AWS API Gateway) is the single entry point.
- Authentication:
- Uses JWT (JSON Web Tokens) for lightweight, stateless auth.
- The Gateway verifies the signature (RSA-256) and claims (Expiration, Scopes) before passing the request downstream.
- Rate Limiting:
- Algorithm: Token Bucket.
- Why Token Bucket over Leaky Bucket? Token Bucket allows for short bursts of traffic (e.g., a bot reacting to a sudden market move), which is desirable for trading. Leaky Bucket enforces a rigid rate, which might punish legitimate high-frequency users.
- Scope: Per User ID and Per IP.
- Config:
100 req/minfor Quotes,20 req/minfor Orders.
- Optimization:
- SSL Termination: Decrypt HTTPS at the edge to offload CPU from microservices.
- Keep-Alive: Maintain persistent connections to upstream services to avoid TCP Handshake overhead.
5.2 The Quote Service (The Sprinter)
This service must be incredibly fast. It is Stateless and Read-Heavy.
- Role:
- Fetch real-time prices from Liquidity Providers (LPs) via WebSocket.
- Apply a “Spread” (e.g., +0.5% markup for profit).
- Generate a cryptographically signed quote.
- Cache the quote in Redis with
TTL=7s.
- Optimizations:
- Fan-out: Query 3 LPs in parallel, take the median price (to avoid outliers/manipulation).
- Zero-Allocation: In Go/Rust, reuse memory buffers to avoid Garbage Collection pauses during high load.
5.3 The Order Service (The Vault)
This service must be incredibly safe. It is Stateful and Transactional.
- Role:
- Atomically claim the quote (prevent replay attacks).
- Check user balance (DB lock).
- Execute the trade atomically (Update DB).
- Idempotency:
- Crucial for network failures. If a client sends an order but doesn’t get a response (timeout), they will retry.
- Mechanism: The client sends a unique
idempotency_key(UUID). The server checks a unique index in the DB.INSERT INTO orders (id, ...) VALUES ... ON CONFLICT DO NOTHING.
5.3.1 Quote Replay Protection (CRITICAL)
[!CAUTION] The #1 Interview Gotcha: Without atomic claim, your system is vulnerable to double-fill attacks.
The Attack Scenario:
- User receives quote:
quote_id=abc123, valid for 7 seconds - User opens two browser tabs
- Tab A: Submit order with
quote_id=abc123 - Tab B: Submit order with
quote_id=abc123(1ms later) - Both requests race:
- Request A: Check
quote:abc123exists? → ✅ YES - Request B: Check
quote:abc123exists? → ✅ YES (race!) - Request A: Debit $50,000, credit 1 BTC
- Request B: Debit $50,000, credit 1 BTC
- Request A: Check
- Result: User gets 2 BTC for the price of 1 → You lose $50,000
The Solution: Atomic Check-and-Claim
We use a Redis Lua script to atomically check if a quote exists AND mark it as claimed in a single operation:
-- quote_claim.lua
-- This script ensures a quote can only be claimed ONCE
local quote_key = KEYS[1] -- "quote:abc123"
local claim_key = KEYS[2] -- "quote:abc123:claimed"
local order_id = ARGV[1] -- "order_xyz789"
local remaining_ttl = ARGV[2] -- Remaining seconds for quote validity
-- Atomic check: Does quote exist AND is it not claimed?
if redis.call('EXISTS', quote_key) == 1 and redis.call('EXISTS', claim_key) == 0 then
-- Claim it for this order_id
redis.call('SET', claim_key, order_id, 'EX', remaining_ttl)
-- Return the quote payload
return redis.call('GET', quote_key)
else
-- Quote expired or already claimed
return nil
end
Order Service Execution Flow (Python):
import redis
import time
import hmac
import hashlib
import json
redis_client = redis.Redis()
# Load Lua script once on startup (cached on Redis server)
QUOTE_CLAIM_SCRIPT = redis_client.script_load(open('quote_claim.lua').read())
def execute_order(user_id, quote_payload_signed, idempotency_key):
# Step 1: Verify HMAC signature (prevent tampering)
quote_id = quote_payload_signed['id']
received_hmac = quote_payload_signed['signature']
payload = json.dumps(quote_payload_signed['data'], sort_keys=True)
expected_hmac = hmac.new(SERVER_SECRET, payload.encode(), hashlib.sha256).hexdigest()
if not hmac.compare_digest(received_hmac, expected_hmac):
return {"error": "INVALID_SIGNATURE"}
# Step 2: Check quote freshness (server-side clock with grace period)
quote_data = quote_payload_signed['data']
quote_created_at = quote_data['created_at'] # Unix timestamp from server
current_time = time.time()
# 7-second validity + 200ms grace period for network latency
if current_time - quote_created_at > 7.2:
return {"error": "QUOTE_EXPIRED"}
# Step 3: Atomic claim using Lua script
remaining_ttl = int(7.2 - (current_time - quote_created_at))
quote_key = f"quote:{quote_id}"
claim_key = f"quote:{quote_id}:claimed"
quote_payload = redis_client.evalsha(
QUOTE_CLAIM_SCRIPT,
2, # Number of KEYS
quote_key, claim_key,
idempotency_key, remaining_ttl
)
if quote_payload is None:
return {"error": "QUOTE_EXPIRED_OR_ALREADY_USED"}
# Step 4: Database transaction with balance check
with db.begin():
# Lock user's wallet row (prevents concurrent trades from same user)
wallet = db.execute(
"SELECT balance FROM wallets WHERE user_id = %s AND currency = %s FOR UPDATE",
(user_id, quote_data['base_currency'])
).fetchone()
if wallet['balance'] < quote_data['total_cost']:
return {"error": "INSUFFICIENT_BALANCE"}
# Debit and credit atomically
db.execute(
"UPDATE wallets SET balance = balance - %s WHERE user_id = %s AND currency = %s",
(quote_data['total_cost'], user_id, quote_data['base_currency'])
)
db.execute(
"UPDATE wallets SET balance = balance + %s WHERE user_id = %s AND currency = %s",
(quote_data['amount'], user_id, quote_data['quote_currency'])
)
# Insert order record (idempotency_key ensures uniqueness)
db.execute(
"INSERT INTO orders (order_id, user_id, quote_id, pair, amount, price, status) "
"VALUES (%s, %s, %s, %s, %s, %s, 'FILLED') "
"ON CONFLICT (idempotency_key) DO NOTHING",
(idempotency_key, user_id, quote_id, ...)
)
return {"status": "SUCCESS", "order_id": idempotency_key}
Why This Works:
- Lua Atomicity: Redis executes the entire script as a single atomic operation. No race conditions.
- Claim Key: Once set, the claim key prevents the quote from being used again (even if the original quote key still exists).
- TTL Inheritance: The claim key expires at the same time as the quote, preventing leaked state.
- Grace Period: The 200ms buffer accounts for network delays, ensuring valid requests at
t=6.9sare not rejected.
5.4 Redis Architecture
We use Redis not just for caching, but for temporary state (Quotes).
- Cluster Mode:
- Data is sharded across multiple nodes (16384 slots).
- This allows horizontal scaling of memory and throughput.
- Eviction Policy:
volatile-ttl.- Why? We only want to evict keys with an expiry set (quotes). If we used
allkeys-lru, we might accidentally evict persistent configuration keys or session data. See Module 06: Caching.
- Why? We only want to evict keys with an expiry set (quotes). If we used
- Persistence:
- RDB (Snapshot): Every 15 minutes.
- AOF (Append Only File): Disabled or set to
everysec. - Why? If Redis crashes, losing the last 1 second of quotes is acceptable. Users will just request a new quote. Performance > Durability for quotes.
5.5 Database Architecture (The Ledger)
We choose PostgreSQL for its robust ACID compliance.
Why PostgreSQL over NewSQL (CockroachDB/TiDB)?
- Maturity: PostgreSQL has decades of battle-testing in financial systems. NewSQL databases are powerful but introduce complexity in deployment and debugging.
- Complexity: For our scale (1M users), sharded Postgres is sufficient and well-understood. Distributed SQL adds network overhead (Raft consensus) to every write, increasing latency, which we want to avoid for order execution.
Schema Design
[!CAUTION] Never use
FLOATorDOUBLEfor money. Floating-point arithmetic causes rounding errors due to IEEE 754 representation. Example:0.1 + 0.2 != 0.3in binary floating point. Always use fixed-point types (NUMERIC/DECIMAL) or store integer minor units (BIGINTfor cents/wei).
CREATE TABLE wallets (
user_id UUID,
currency VARCHAR(10),
-- NUMERIC(36, 18) supports crypto tokens with 18 decimals (e.g., ETH)
-- Many ERC-20 tokens use 18 decimal places (1 ETH = 10^18 wei)
balance NUMERIC(36, 18) NOT NULL DEFAULT 0,
updated_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
PRIMARY KEY (user_id, currency)
);
CREATE TABLE orders (
order_id UUID PRIMARY KEY,
idempotency_key UUID UNIQUE NOT NULL, -- Prevent duplicate submissions
user_id UUID NOT NULL,
quote_id VARCHAR(64) NOT NULL,
pair VARCHAR(20) NOT NULL, -- 'BTC-USD', 'ETH-USDT'
side VARCHAR(4) NOT NULL, -- 'BUY', 'SELL'
amount NUMERIC(36, 18) NOT NULL, -- Quantity of crypto
price NUMERIC(36, 18) NOT NULL, -- Price per unit
total_cost NUMERIC(36, 18) NOT NULL, -- amount * price
fee NUMERIC(36, 18) NOT NULL DEFAULT 0,
status VARCHAR(20) NOT NULL, -- 'FILLED', 'FAILED', 'CANCELLED'
created_at TIMESTAMPTZ NOT NULL DEFAULT NOW(),
settled_at TIMESTAMPTZ,
INDEX idx_user_created (user_id, created_at DESC) -- For transaction history
);
Sharding Strategy & Hot Partitions
As the user base grows, a single Postgres node will hit write limits (IOPS).
- Sharding Key:
user_id. - Why?: All data for a specific user (Wallets, Orders) lives on the same shard. This allows us to use Local ACID Transactions rather than complex, slow Two-Phase Commit (2PC) protocols across shards.
- Consistent Hashing: Used to map
user_idtoShard_N. - The “Hot Partition” Problem:
- Scenario: A “Whale” user (e.g., an institutional bot) executes 100 trades/sec.
- Impact: That user’s shard (physical node) becomes overloaded, slowing down other users on the same node.
- Solution: Virtual Buckets. Instead of mapping Users directly to Nodes, we map Users to 1000 Virtual Buckets, and Buckets to Nodes.
- Mechanism:
Hash(User_ID) % 1000→Bucket_ID.Bucket_ID→Physical_Node_IP. - Visual Example: In the diagram, we simplified this to a Range-Based split (Users A-M on Shard 1, N-Z on Shard 2) for clarity, but in production, Consistent Hashing is preferred to avoid unbalanced shards (e.g., if everyone is named “Alice”).
- Benefit: If a bucket gets hot, we can migrate just that bucket to a dedicated high-performance node without resharding the entire database.
5.6 Asynchronous History & Auditing
While the order execution is synchronous (ACID), reporting is asynchronous.
- Change Data Capture (CDC): We use the Outbox Pattern or simple application-level publishing.
- Kafka Topic: After the DB commit succeeds, the Order Service publishes an event to the
orders_eventstopic.{ "event": "ORDER_FILLED", "order_id": "...", "price": 50250, "ts": 169876... } - Audit Service: Consumes these events and writes them to:
- Elasticsearch: For user-facing search (“Show my trades last month”).
- Cold Storage (S3/Glacier): For 7-year regulatory retention.
- BigQuery/Redshift: For Data Science (Analytics).
5.7 Authentication & Authorization
[!IMPORTANT] Problem Statement Requirement: “Users should be able to log in to the system and access only the features and data they are authorized to see.”
5.7.1 Authentication Flow
User Registration →
Email Verification → KYC (Know Your Customer) → Account Activated
# Simplified Auth Flow
def login(email, password, totp_code):
# Step 1: Verify password (bcrypt with cost factor 12)
user = db.query("SELECT * FROM users WHERE email = ?", email)
if not bcrypt.verify(password, user.password_hash):
return {"error": "INVALID_CREDENTIALS"}
# Step 2: Verify 2FA (Time-based One-Time Password)
secret = user.totp_secret
if not pyotp.TOTP(secret).verify(totp_code):
return {"error": "INVALID_2FA"}
# Step 3: Generate JWT with user role/tier embedded
jwt_payload = {
"user_id": user.id,
"role": user.role, # 'retail', 'vip', 'institutional'
"tier": user.kyc_tier, # 'bronze', 'silver', 'gold'
"exp": time.time() + 900 # 15 minutes
}
access_token = jwt.encode(jwt_payload, JWT_SECRET, algorithm="HS256")
refresh_token = generate_refresh_token(user.id) # 7-day expiry
return {"access_token": access_token, "refresh_token": refresh_token}
5.7.2 Role-Based Access Control (RBAC)
The API Gateway enforces trading limits based on user tier:
| Tier | Max Trade Size | Daily Limit | Rate Limit (Quotes/sec) |
|---|---|---|---|
| Bronze (Unverified) | $1,000 | $5,000 | 5 |
| Silver (KYC Verified) | $10,000 | $50,000 | 10 |
| Gold (Enhanced DD) | $100,000 | $500,000 | 20 |
| Institutional | Custom | Custom | Custom |
API Gateway Middleware (Kong/Envoy):
-- kong_auth.lua
function check_trade_limit(jwt_payload, trade_amount)
local tier = jwt_payload.tier
local limits = {
bronze = 1000,
silver = 10000,
gold = 100000
}
if trade_amount > limits[tier] then
return 403, "TRADE_LIMIT_EXCEEDED"
end
return 200, "OK"
end
5.7.3 Security Best Practices
- Password Storage:
bcryptwith cost factor 12 (2^12 iterations) - 2FA: TOTP (Google Authenticator) mandatory for withdrawals
- Session Management:
- Access Token: 15 minutes (short-lived)
- Refresh Token: 7 days (stored in HttpOnly cookie)
- Refresh rotation: New refresh token issued on each use
- Rate Limiting: Per-user sliding window in Redis
key = f"rate_limit:{user_id}:quote" redis.incr(key) redis.expire(key, 1) # 1-second window if redis.get(key) > 10: # Max 10 quotes/sec return "RATE_LIMIT_EXCEEDED"
5.8 Multi-Region Deployment
[!IMPORTANT] Problem Statement Requirements:
- “Available in multiple regions”
- “Provide good user experience to users who do not reside in residential country”
- “System should be compliant with country regulations”
5.8.1 The Multi-Region Challenge
Requirements:
- Low Latency: Users in Asia shouldn’t hit US servers (adds 200ms+ RTT)
- Data Residency: EU users’ data must stay in EU (GDPR compliance)
- Disaster Recovery: If US-East fails, failover to US-West < 1 minute
Architecture: Active-Active for Quotes, Active-Passive for Execution
5.8.2 Regional Deployment Strategy
┌─────────────────────────────────────────────────────────────┐
│ GeoDNS (Route 53) │
│ Routes user to nearest region based on IP geolocation │
└──────────────┬──────────────────────┬──────────────────────┘
│ │ │
┌──────▼──────┐ ┌───────▼──────┐ ┌────────▼────────┐
│ US-EAST-1 │ │ EU-WEST-1 │ │ AP-SOUTH-1 │
│ │ │ │ │ │
│ Quote Svc │ │ Quote Svc │ │ Quote Svc │
│ (Stateless) │ │ (Stateless) │ │ (Stateless) │
│ Redis Cache │ │ Redis Cache │ │ Redis Cache │
└─────────────┘ └──────────────┘ └─────────────────┘
│ │ │
┌──────▼──────────────────────▼───────────────────────▼──────┐
│ Order Service (Stateful) │
│ Pinned by "Home Region" per user │
│ EU users → EU-WEST-1 DB US users → US-EAST-1 DB │
└─────────────────────────────────────────────────────────────┘
5.8.3 Quote Service: Active-Active (Stateless)
- Deployment: Each region has its own Quote Service + Redis cluster
- No Cross-Region Sync: Quotes are ephemeral (7s TTL), no need to replicate
- User Routing: GeoDNS routes user to nearest region for quote requests
- Failover: If a region fails, GeoDNS automatically routes to next-nearest region
Trade-off: A user in Europe might get slightly different quotes from eu-west-1 vs us-east-1 due to different LP feeds, but the 7-second window is too short for meaningful arbitrage.
5.8.4 Order Service: Active-Passive (Stateful)
Home Region Pinning:
- Each user is assigned a “home region” based on signup location
- All trades for that user ALWAYS route to their home region’s DB shard
- Stored in user profile:
{"user_id": "abc", "home_region": "us-east-1"}
Why Pinning?
- ACID Guarantee: Executing a trade requires updating wallet balances atomically. If data is split across regions, we’d need distributed transactions (2PC) which are slow and complex.
- Compliance: EU user data stays in EU, US user data stays in US (GDPR/data residency laws).
Cross-Region Execution Flow:
- EU user traveling in Asia requests quote from
ap-south-1(fast, uses local cache) - User clicks “Buy” → API Gateway checks
home_region→ Redirects toeu-west-1 - Order executes on EU database (single-region ACID)
Disaster Recovery:
- Warm Standby: Each region has a hot replica in a secondary AZ
- Streaming Replication: Postgres uses WAL (Write-Ahead Log) shipping (RPO < 1 minute)
- Promotion: If primary fails, standby promoted to primary within 30 seconds
- Trade-off: 99.999% (five nines) = ~5 min/year downtime. A 30s failover twice/year = 1 minute, leaving 4 minutes budget.
5.8.5 Data Residency & Geo-Fencing
Regulatory Compliance:
| Region | Storage Location | Allowed Trading Pairs | Special Rules |
|---|---|---|---|
| EU | Frankfurt (eu-central-1) | All except privacy coins | GDPR: Right to erasure, Data portability |
| US | Virginia (us-east-1) | SEC-approved only | FinCEN: SAR reporting for >$10k |
| APAC | Singapore (ap-southeast-1) | All | MAS: Mandatory cooling-off period for new users |
Geo-Fencing Implementation:
# Middleware at API Gateway
def check_jurisdiction(user_location, trading_pair):
if user_location == "US" and trading_pair == "XMR-USD": # Monero (privacy coin)
return 403, "RESTRICTED_IN_YOUR_JURISDICTION"
if user_location == "CN": # China ban
return 403, "SERVICE_NOT_AVAILABLE_IN_YOUR_COUNTRY"
return 200, "OK"
6. Deep Dive: Reliability & Consistency
6.1 The “Double Spend” Attack
Scenario: Malicious User “Eve” has $100. She sends two requests simultaneously:
- Buy $100 of BTC.
- Buy $100 of ETH.
If processed in parallel without locking, both might check
balance > 100(True) and succeed. Eve spends $200 but only had $100.
Solution: Pessimistic Locking
We use SELECT ... FOR UPDATE to lock the wallet row.
BEGIN;
-- This line blocks other transactions trying to read Alice's USD wallet
SELECT balance FROM wallets WHERE user_id = 'Alice' AND currency = 'USD' FOR UPDATE;
IF balance >= 100 THEN
UPDATE wallets SET balance = balance - 100 ...;
INSERT INTO orders ...;
COMMIT;
ELSE
ROLLBACK;
END IF;
END;
- Trade-off: Locking reduces concurrency for a single user, but prevents fraud. Since we shard by user, this doesn’t block other users.
- Why not Optimistic Locking? Optimistic locking (using
versioncolumns) works well for low-contention. In high-frequency trading, retrying failed transactions due to version conflicts adds latency and complexity to the client. Pessimistic locking guarantees the order executes (or fails) immediately.
6.2 The 7-Second Race Condition
Scenario:
- T=0.0s: Quote Generated (Valid until T=7.0s).
- T=6.9s: User clicks “Buy”.
- T=7.1s: Request reaches Server.
- Strict Logic: Reject (Expired).
- User Experience: “I clicked in time! Your system sucks.”
Solution: The Grace Period
We add a server-side buffer (e.g., 500ms). The validator checks:
if (CurrentTime < QuoteExpiry + GracePeriod)
- Risk: The market crashes in that 500ms.
- Mitigation: The “Spread” (profit margin) we added to the quote covers these small slippage risks.
Interactive Simulator
Test your latency luck. Can you click before the quote expires?
RFQ Simulator
Can you beat the network lag?
6.3 System Walkthrough: The Life of a Trade
To solidify our understanding, let’s trace a single transaction through the entire stack, examining the exact API payloads, Redis keys, and Database queries.
Scenario A: The Happy Path (Quote → Buy → Success)
Step 1: User requests a quote for 1 BTC.
- Request:
POST /quotes{ "pair": "BTC-USD", "side": "BUY", "amount": 1.00000000 } - Quote Service Action:
- Fetches price from LP:
$50,000. - Adds spread (+0.5%):
$50,250. - Generates
quote_id:q_123. - Signs the payload:
HMAC_SHA256(price + expiry, secret).
- Fetches price from LP:
- Redis State:
SET quote:q_123 "{\"price\": 50250, \"expiry\": 1698765432}" EX 7 - Response:
200 OK{ "quote_id": "q_123", "price": 50250.00, "expiry": 1698765432, "signature": "a1b2c3d4..." }
Step 2: User accepts the quote (within 7 seconds).
- Request:
POST /orders{ "quote_id": "q_123", "order_id": "o_999", "user_id": "u_alice" }(Note:
order_idis a client-generated UUID for Idempotency) - Order Service Action:
- Validate Quote:
GET quote:q_123. (If exists, proceed). - Validate Signature: Recompute HMAC. (Prevents tampering with price).
- DB Transaction:
BEGIN; -- 1. Lock Wallet SELECT balance FROM wallets WHERE user_id = 'u_alice' AND currency = 'USD' FOR UPDATE; -- (Result: 100,000.00) -- 2. Update Balance UPDATE wallets SET balance = balance - 50250.00 WHERE user_id = 'u_alice'; -- 3. Insert Order INSERT INTO orders (order_id, user_id, quote_id, status) VALUES ('o_999', 'u_alice', 'q_123', 'FILLED'); COMMIT;
- Validate Quote:
- Response:
200 OK{ "status": "FILLED", "tx_id": "tx_555" }
Scenario B: The “Double Spend” Attempt
Imagine “Eve” has $100. She sends two requests simultaneously:
- Req A: Buy $100 BTC.
- Req B: Buy $100 ETH.
Timeline:
- T=0.00s: Req A reaches DB. Starts Transaction.
[Tx A] SELECT balance ... FOR UPDATE; -- Locks Row - T=0.01s: Req B reaches DB. Starts Transaction.
[Tx B] SELECT balance ... FOR UPDATE; -- BLOCKED! Waits for Tx A. - T=0.05s: Req A updates balance ($100 → $0) and
COMMIT.- Row Lock released.
- T=0.06s: Req B unblocks and reads the new balance.
balanceis now $0.- Logic:
IF balance < 100 THEN ROLLBACK.
- Result: Req A succeeds (200 OK). Req B fails (402 Payment Required).
Scenario C: The “Expired Quote” Race Condition
Step 1: User attempts to trade at T=7.1s (0.1s too late).
- Request:
POST /orders { quote_id: "q_123" } - Order Service Action:
- Check Redis:
GET quote:q_123 - Result:
(nil)(Key evicted by Redis). - Fallback Check: Even if the key existed (e.g., due to lag), check
payload.expiry < Now().
- Check Redis:
- Response:
400 Bad Request{ "error": "QUOTE_EXPIRED", "message": "Quote q_123 is no longer valid. Please request a new price." }
7. Alternative Solutions (Trade-offs)
7.1 RFQ vs. CLOB (Central Limit Order Book)
- CLOB (e.g., Nasdaq, Binance):
- Mechanism: Continuous matching of limit orders.
- Pros: Transparent pricing, high liquidity discovery.
- Cons: Extremely complex to engineer (matching engine), high computational cost.
- RFQ (Our System):
- Mechanism: Guaranteed price on demand.
- Pros: Simpler architecture, better UX for large trades (no slippage).
- Cons: Platform takes market risk.
7.2 SQL vs. NoSQL
- NoSQL (DynamoDB):
- Pros: Infinite scaling.
- Cons: Lack of multi-row ACID transactions. Implementing a ledger in NoSQL requires complex application-level locking (e.g., optimistic locking with version numbers), which is error-prone for financial data.
- SQL (PostgreSQL):
- Pros: Native ACID, referential integrity.
- Cons: Harder to scale writes.
- Decision: SQL wins because financial correctness > raw write speed. Sharding solves the scale issue.
7.3 Event Sourcing
- Concept: Store every transaction as an immutable event (
Deposited,Bought,Sold). Calculate balance by replaying events. - Pros: Perfect audit trail, easy debugging.
- Cons: Replaying millions of events to get a balance is slow. Requires “Snapshots”.
- Our Choice: Hybrid. We use a standard SQL table for current balance (fast) but log every change to an
audit_logstable (immutable).
8. Low-Level Optimizations (The “Boom” Factor)
To squeeze every millisecond out of the system:
- Kernel Tuning:
- Increase TCP Buffer sizes (
net.ipv4.tcp_rmem,net.ipv4.tcp_wmem) to handle high-throughput bursts. - Enable TCP Fast Open (TFO) to reduce handshake latency by 1 RTT.
- Increase TCP Buffer sizes (
- Connection Pooling:
- Database connections are expensive. Use PgBouncer to maintain a pool of warm connections, reducing overhead.
- Garbage Collection (GC):
- For the Quote Service (Golang), tune
GOGCto trade memory for CPU. - For the Order Service (Java), use ZGC or Shenandoah for sub-millisecond pause times.
- For the Quote Service (Golang), tune
- Network:
- Place Quote Services in the same Availability Zone (AZ) as the Liquidity Providers if possible (AWS
us-east-1). - Use Kernel Bypass (DPDK) is likely overkill for 230 QPS, but worth mentioning for HFT requiring microsecond latency.
- Place Quote Services in the same Availability Zone (AZ) as the Liquidity Providers if possible (AWS
9. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Get Quote (7s) | Quote Service + Redis (TTL 7s) + WebSocket to LPs. |
| Place Order | Order Service with HMAC validation + Idempotency keys. |
| Balance Check | PostgreSQL with SELECT FOR UPDATE (Pessimistic Locking). |
| Reliability (99.999%) | Active-Passive DB Failover + Stateless Services + Kubernetes Auto-healing. |
| Latency (<50ms) | In-memory processing (Redis) + Connection Pooling + Geolocation. |
| Consistency | Database Sharding by user_id allows local ACID transactions. |
| Scalability | Horizontal scaling of services + DB Sharding + Redis Cluster. |
| Security | API Gateway (JWT, Rate Limit) + Private Subnets + mTLS. |
| Compliance | Async KYC pipeline + Audit Logs (Event Sourcing lite). |
10. Observability & Tracing
You cannot fix what you cannot see. For a system moving millions of dollars, we need total visibility.
10.1 The RED Method (Metrics)
We instrument every service to emit these three golden signals:
- Rate: Request counts per second.
- Metric:
http_requests_total{service="quote_svc", status="200"} - Use: Detect traffic spikes or DDoS.
- Metric:
- Errors: Failed requests.
- Metric:
order_failed_total{reason="insufficient_funds"} - Use: Alert if order failures exceed 1% of total traffic.
- Metric:
- Duration: Latency distributions.
- Metric:
quote_generation_seconds_bucket(Histogram) - Use: Alert if P99 latency > 100ms.
- Metric:
10.2 Distributed Tracing
A single order touches 4 systems: Gateway -> Order Service -> Redis -> DB. If an order is slow, Distributed Tracing tells us exactly where.
- Trace ID: Generated at the Gateway (e.g.,
x-trace-id: 12345). Passed via HTTP headers to every downstream service. - Spans: Each service logs a “Span” with start/end timestamps.
- Visualization (e.g., Jaeger/Zipkin):
[Gateway] |-------------------------------------------| 205ms
[Order SVC] |-----------------------------| 180ms
[Redis] |----| 10ms
[DB Lock] |-----------| 150ms (Bottleneck!)
In this example, the DB Lock took 150ms, indicating database contention.
10.3 Structured Logging
Forget plain text logs. Use JSON for machine-readability (ELK Stack).
{
"level": "INFO",
"timestamp": "2023-10-27T10:00:00Z",
"service": "order-service",
"trace_id": "a1b2c3d4",
"user_id": "u_999",
"event": "order_placed",
"amount": 100.00,
"currency": "USD"
}
- Audit Logs: Separate, immutable logs for compliance. Every balance change must be recorded here and archived to WORM (Write Once Read Many) storage (e.g., S3 Object Lock).
10.4 Alerting Strategy
- P1 (Critical - Wake up on-call):
- Order Success Rate < 99.5%.
- Database connection pool saturation > 90%.
- Redis Cluster state “FAIL”.
- P2 (Warning - Ticket for tomorrow):
- Latency P99 > 150ms (SLA breach warning).
- Disk usage > 80%.
11. Deployment & Operations
11.1 Deployment Strategy: Blue/Green
For a financial system, we cannot risk a “bad deploy” corrupting the database.
- Blue (Active): Serving 100% traffic.
- Green (Staging): Deploy new version. Run integration tests.
- Switch: Update the Load Balancer to route 1% traffic to Green (Canary).
- Monitor: Watch for
HTTP 500or latency spikes. - Rollout: If safe, route 100% to Green.
- Rollback: If 1% fails, instantly revert LB to Blue. Users see errors for only 5 seconds.
11.2 Database Schema Evolution
- Problem: Adding a column locks the table.
- Solution: Expand-Contract Pattern.
- Expand: Add nullable column
new_col(Zero downtime). - Code: Update app to write to both
old_colandnew_col. - Backfill: Run a background job to copy data
old→new. - Contract: Update code to read only from
new_col. Dropold_col.
- Expand: Add nullable column
12. Follow-Up Questions: The Interview Gauntlet
This section covers 50 rapid-fire questions to test the depth of your design.
I. Database & Data Consistency (The Core)
- Why PostgreSQL over NewSQL? Sharded Postgres is more mature and sufficient for 1M users. Distributed SQL (CockroachDB) adds consensus latency to writes.
- Handling Hot Shards: If a “Whale” hits 10k TPS, we use Virtual Buckets to migrate that user to a dedicated physical node.
- Isolation Levels: We use READ COMMITTED for performance. SERIALIZABLE prevents race conditions but causes too many transaction aborts/retries in high-concurrency environments.
- Lock Contention: If
SELECT ... FOR UPDATEhangs, connection pools exhaust. We setNOWAITor short timeouts (e.g., 2s) to fail fast. - Replication Lag: Users reading from replicas might see old balances. We implement “Sticky Sessions” or force reads from Primary for critical wallet views.
- Schema Migrations: Use
pg_repackor similar tools to add columns without locking tables. - Archival Strategy: Move data > 1 year old to S3 (Parquet format) and delete from Postgres to keep indices small.
- Double Booking: Without row locking, two parallel transactions read the same balance, subtract funds, and overwrite each other.
- Database Failover: Postgres Automatic Failover (PAF) takes ~30s. Writes fail during this window; users see errors.
- Data Corruption: External reconciliation (Nightly Jobs) sums all wallet balances vs. total deposits to detect drift.
II. Scalability & Performance
- Redis Eviction:
volatile-ttlensures we only drop expired quotes, never persistent configs. - Connection Pooling: Use PgBouncer sidecar. If scaling exceeds DB limits, we must shard further.
- Load Balancing: Least Outstanding Requests handles varying service times better than Round Robin.
- Global Latency: Users in Australia will have higher latency. We can’t cheat physics; we must extend quote expiry or use Edge locations.
- CDN Caching: We generally cannot cache prices as they change every second. WebSocket is preferred.
- Write-Heavy Spikes: During crashes, we implement Queue-based Load Leveling (Kafka) to smooth out DB writes.
- Serialization: JSON is fine for this scale. Protobuf saves bandwidth but adds debugging complexity.
- Autoscaling Triggers: Scale on CPU Usage (>70%) and Request Queue Depth.
- Cache Penetration: Use Bloom Filters to block requests for non-existent symbols (“FAKE-COIN”).
III. Reliability & Fault Tolerance
- Redis Persistence: If Redis dies, quotes are lost. This is acceptable; users just request a new quote.
- Circuit Breaker: Threshold based on Error Rate (e.g., >50% failures in 10s).
- Bulkhead Pattern: Isolate thread pools for “Notifications” vs “Orders” so one slow dependency doesn’t crash the app.
- Retry Storms: Add Exponential Backoff and Jitter to client retries.
- Idempotency Storage: If Redis evicts keys, we fallback to a check in the persistent DB (slower but safer).
- Graceful Degradation: If History Service fails, the “Trade” button still works.
- Clock Skew: We use NTP on servers. Tolerance is built into the 7s expiry window.
- Zonal Failures: Deployment across 3 AZs ensures only ~33% capacity loss, which autoscaling covers.
IV. Architecture & Microservices
- Saga Pattern: We don’t use Sagas for the core trade (too slow). We use local ACID via sharding.
- Service Discovery: Kubernetes (CoreDNS) handles service IP resolution.
- Gateway vs Mesh: Gateway handles Edge concerns (Auth, Rate Limit); Mesh handles inter-service concerns (mTLS, Retries).
- Configuration: Use a dynamic config server (e.g., Consul/Etcd) with watchers to update
expiry_secondshot. - Data Ownership: Order Service cannot access Wallet Table directly. Must call Wallet Service API to decouple schemas.
- Event Ordering: Kafka Partition Key =
user_idensures events for one user are sequential.
V. Security & Compliance
- Insider Trading: Admin actions require Multi-Party Approval and are logged to immutable audit trails.
- API Key Security: Scoped keys (Read-Only vs Trade). Automated rotation.
- DDoS Protection: Rate limiting at the Edge (Cloudflare) + Gateway (Token Bucket).
- Audit Immutability: Write logs to S3 with Object Lock (Governance Mode).
- PII Data: “Crypto-shredding”: Delete the encryption key for a user’s data to effectively “erase” it without modifying immutable logs.
- Internal Auth: mTLS (Mutual TLS) ensures only authorized services can talk to the Wallet Service.
VI. Operations & Observability
- Metric Cardinality: Do not tag metrics with
user_id. Use logs for high-cardinality debugging. - Distributed Tracing: Inject
x-trace-idat the Gateway and propagate it everywhere. - Deployment: Canary Deployment. Roll out v2 to 1% of users, monitor error rates, then expand.
- Chaos Engineering: Randomly kill pods (Chaos Monkey) during staging to test recovery.
- Alert Fatigue: Group related alerts. Use “Symptoms” (User can’t trade) rather than “Causes” (CPU high) for paging.
- Capacity Planning: Linear regression on past 6 months of data to forecast storage/compute needs.
VII. Business Logic & Edge Cases
- Negative Balance: Should be impossible with ACID. If it happens, freeze account and trigger manual investigation.
- Partial Fills: Requires DB schema change (
filled_amountvsrequested_amount). - Market Halted: A global “Kill Switch” in Redis that the Order Service checks before every trade.
- Rounding Errors: Always use Integers (Micros/Satoshis) or BigDecimal. Never
floatordouble. - Settlement Failure: Platform takes the risk. If LP fails, we still owe the user the crypto.
VIII. Advanced Architecture (The 99.999% Club)
- Event Sourcing vs. CRUD: Event sourcing is better for auditability but harder to query. We use a hybrid approach (CRUD for current state, Events for history).
- CQRS (Command Query Responsibility Segregation): Use separate models for Writes (Order Service) and Reads (History Service). This allows scaling reads independently via Read Replicas.
- LMAX Disruptor: A high-performance inter-thread messaging library. Used in HFT to avoid lock contention. Overkill for 500 TPS but good for 500k TPS.
- Kernel Bypass (DPDK/Solarflare): Bypassing the Linux kernel networking stack to write directly to the NIC. Reduces latency from 10us to 1us.
- Clock Synchronization: NTP is not enough for sub-millisecond precision. Use PTP (Precision Time Protocol) with hardware timestamping.
- Garbage Collection Tuning: For Java, use ZGC to keep pauses < 1ms. For Go, use
GOGC=offand manual memory management if needed (extreme case). - False Sharing: CPU cache line contention. Pad data structures to 64 bytes to prevent cores from invalidating each other’s caches.
IX. Failure Modes & Disaster Recovery
- Partial Partition: What if the Order Service can reach Redis but not the DB? Answer: Fail the request safely.
- Zombie Processes: A service that thinks it’s the leader but isn’t. Use Fencing Tokens (epoch numbers) to reject writes from zombies.
- Thundering Herd: If Redis clears, 10k users hit the DB. Use Request Coalescing (Singleflight) to merge identical requests.
- Split Brain: If the cluster partitions, do we accept writes on both sides? Answer: No. Pause writes (CP system) to preserve consistency.
- Corrupted WAL: If Postgres WAL is corrupted, replay from the last snapshot and accept data loss (RPO > 0).
- Region Failure: Failover to DR region. RTO (Recovery Time Objective) ~15 mins. DNS switch.
X. Market Microstructure
- Slippage: The difference between the quoted price and executed price. In RFQ, the platform absorbs slippage (the “Spread”).
- Order Types:
- FOK (Fill or Kill): Execute fully or not at all.
- IOC (Immediate or Cancel): Execute what you can, cancel the rest.
- GTC (Good Till Cancelled): Standard limit orders (not used in RFQ).
- Spread Capture: The primary revenue model. We buy at $50,000 and sell to users at $50,250.
- Hedging: When a user buys 1 BTC, we immediately buy 1 BTC from an LP to neutralize our inventory risk.
13. Summary: The Whiteboard Strategy
If you are asked to design this in 45 minutes, draw this 4-Quadrant Layout:
1. Requirements & Core Math
- Func: Quote (7s), Order, Wallet.
- Non-Func: 99.999%, <50ms Latency, ACID.
- Scale: 230 QPS (Quotes), 12 TPS (Orders).
- Traffic: Read-heavy (20:1).
2. Architecture
↓
[Gateway (Auth/Rate Limit)]
↙ ↘
[Quote SVC] -- [Redis Cluster]
[Order SVC] -- [Sharded DB]
* Separation of Concerns: Fast (Quotes) vs Safe (Orders).
* Sharding: By User ID for local ACID.
3. Data & API
POST /orders -> { order_id, status }
Wallets: (user_id, currency, balance)
Orders: (order_id, status, quote_id)
4. Trade-offs & Deep Dives
- Concurrency: Pessimistic Locking (`FOR UPDATE`) prevents double-spend.
- Latency: Redis `volatile-ttl` + Connection Pooling + Geo-routing.
- Reliability: Grace Period for network jitter.
- Observability: Distributed Tracing + Audit Logs.