Design an Exchange-Rate Service

[!IMPORTANT] In this lesson, you will master:

  1. Data Normalization: Handling the “Decimal Disaster” of global currencies (JPY vs. EUR).
  2. Aggressive Outlier Detection: Filtering out “Drunken Providers” using Trimmed Mean and Z-Scores.
  3. High-Frequency Ingestion: Ingesting $10^{6}$ rate updates with sub-millisecond propagation.

1. Scene Setting: The Search for the “Fair” Rate

Wise prides itself on providing the “mid-market” rate. In a 60-minute interview, you aren’t just building a “getter”; you are building a mission-critical financial backbone. If the Rates Service dies or provides a wrong rate (e.g., $10$x off), the company loses millions in seconds through arbitrage.

Clarifying Questions (The “Probing” Phase)

Prove your seniority by asking these before drawing a single box:

  • Interviewer: “Design a service that provides the mid-market exchange rate for 100 currency pairs.”
  • Candidate: “Does ‘mid-market’ mean the average of Bid/Ask from a single provider, or are we aggregating across 5+ providers?”
  • Interviewer: “Aggregate across 10 providers.”
  • Candidate: “What is our staleness tolerance? If the market moves, do we need updates reflected in < 1s or < 1 min?”
  • Interviewer: “Sub-second. If a rate is older than 30 seconds, it’s ‘Toxic’ and we shouldn’t use it.”
  • Candidate: “Are we providing ‘Real-time quotes’ or ‘Fixed quotes’? (i.e., Do we guarantee a price for 30 mins once a user starts a transfer?)”
  • Interviewer: “Both. We need a latest-rate API and a way to ‘Lock’ a rate for a transfer.”

2. Requirements & Constraints

2.1 Functional Requirements

  1. Multi-Provider Aggregation: Fetch and aggregate mid-market rates from 10+ liquidity providers (LPs) simultaneously.
  2. Real-Time Data Normalization: Convert heterogenous provider formats (FIX, JSON, XML) into a unified internal representation.
  3. Outlier Protection: Automatically identify and prune “Toxic” prices (stale or manipulated) to prevent arbitrage losses.
  4. Rate Locking: Provide a mechanism to lock a rate for 30 minutes once a user initiates a transfer.
  5. Historical Auditability: Maintain a 7-year audit trail of every raw observation for regulatory compliance.

2.2 Non-Functional Requirements

  1. Latency: Sub-second end-to-end propagation (from LP update to user’s screen).
  2. Accuracy (Zero Tolerance): Precision must support up to 18 decimal places (e.g., for ETH/standard crypto pairs).
  3. Availability: 99.999% (Five Nines). Trading cannot stop.

2.3 Compliance & Market Integrity

  • Market Abuse Detection: Monitor for “Flash Crashes” or “Spoofing” and trigger automated circuit breakers.
  • Fair Value Pricing: Ensure the “Mid-Market” rate reflects the true global consensus, matching Wise’s brand promise.

3. Capacity Planning & Estimation (The FX Firehose)

3.1 Throughput Analysis

  • Scale: 100 currency pairs × 10 Providers = 1,000 potential update streams.
  • Update Frequency: Major pairs (EUR/USD) update every $100$ ms. High-volatility moments can see $50$ updates/sec per pair.
  • Avg Aggregate Throughput: 5,000 updates/sec.
  • Peak Throughput: 25,000 updates/sec.

3.2 Storage (The 110TB Problem)

  • Retention: 7-year regulatory requirement.
  • Math:
    • $5,000$ updates/sec $\times$ $86,400$ sec/day $\approx$ 432 Million records/day.
    • Row size: 100 bytes (Pair, Provider, Bid, Ask, Timestamp, Signature).
    • Daily Ingestion: 43.2 GB/day.
    • 7-Year Volume: ~110 TB.
  • The Hardware Strategy (Tiered Storage):
    • Hot Tier (0-30 days): DynamoDB, Cassandra, or extended Kafka retention for immediate operational debugging and point-reads (quote_id_123).
    • Cold Tier (30 days - 7 years): WORM (Write Once, Read Many) data is dumped into AWS S3 (Standard IA or Glacier) via Kafka Connect, partitioned by date/currency_pair. Amazon Athena is used to query this 110TB datalake on-demand for regulators, saving massive 24/7 compute costs over a monolithic OLAP cluster like ClickHouse.
    • Live State: Redis for the “Latest Rate” snapshot (fitting 100 pairs in < 1 MB).

3.3 Propagation Latency

  • Target: < 500ms global propagation.
  • Components:
    • LP → Adapter (FIX Handshake): 20ms.
    • Adapter → Kafka: 10ms.
    • Aggregator Computation: < 5ms (Crucial: The Aggregator must be a Stateful Stream Processor like Apache Flink or Kafka Streams. It must maintain rolling time-windows in local memory to calculate Z-scores and Trimmed Means in O(1)/O(log N) time. Querying a database for standard deviation here will instantly violate the 500ms SLA).
    • Aggregator → Redis Global Replication: 150ms (Trans-Atlantic).
    • Total: ~200-300ms typical.

3. Architecture Comparison: Choosing the Pattern

Alternative 1: The “Fat Gateway” (Monolithic Ingestion)

A single service polls 10 providers and writes to a DB.

  • Pros: Easy to implement.
  • Cons: A single provider’s failure (slow FIX connection) can block the entire process. Hard to scale per-provider.

Each provider has its own “Adapter” service that normalizes data and pushes to a unified stream.

Feature Alternative 1 Alternative 2 (Recommended)
Isolation Low (One crash kills all) High (Provider A down doesn’t affect Provider B)
Scalability Vertical Horizontal
Normalisation Ad-hoc Unified Interface
flowchart TD
  P1[Reuters] -->|FIX| A1[Adapter A]
  P2[Bloomberg] -->|WS| A2[Adapter B]
  P3[Barclays] -->|REST| A3[Adapter C]
  
  A1 & A2 & A3 -->|Normalized JSON| K[Kafka Stream: RawObservations]
  
  K --> AGG[Aggregator Service]
  K --> LOG[(Audit Log: S3 Datalake + Athena)]
  
  AGG --> SNAP[(Latest Rate Cache: Redis)]
  API[Rates API] --> SNAP
  API --> LOCK[(Rate Lock Store: DynamoDB 30m TTL)]

4. The 4-Quadrant Whiteboard Layout

1. Reqs & Math
- 100 Pairs, 10 Providers
- 5,000 updates/sec
- 43GB/day audit log
- Requirements:
  * Sub-second lag
  * Toxitity detection
  * Rate locking (30m)
2. High Level Design
Providers (FIX/REST)
|
[Adapters]
|
[Kafka: RawObservations]
|
[Aggregator (Trimmed Mean)]
|
[Redis Snapshot]
3. Deep Dives
- Normalization (Minor Units)
- Outlier Control (Z-Score)
- Rate Locking (Persistence)
- Staleness Circuit Breakers
4. Scaling & Ops
- 1M Pairs: Sharding Redis
- WORM storage for Audit
- Propagation Monitoring
- Provider Drift Alerts

5. Elite Interactive: The Outlier Filter Simulator

Watch how high-frequency rate data from multiple providers is processed to find the “Fair Rate” while stripping away anomalies.

EUR/USD Aggregate Rate

1.0851
Filtering: Standard Mean
Reuters
Bloomberg
Barclays
HSBC
Citibank

6. Deep Dive: The Normalization Pattern

Providers send data in messy formats. You must implement a unified NormalizedRate object.

JPY vs. EUR: The Precision Problem

In financial systems, never use double or float. Use Minor Units or BigDecimal.

  • EUR/USD: 1.0851 (Scale 4)
  • USD/JPY: 151.20 (Scale 2)

[!TIP] Staff Engineer Insight: Why FIX Protocol? In institutional FX, we prefer FIX (Financial Information eXchange) over WebSockets. FIX is a binary-optimized protocol that handles sequence numbering and session recovery natively, essential for high-frequency low-latency updates.


7. Logic: Handling “Drunken Providers”

Providers often glitch. A bank might suddenly quote $1.50$ for EUR/USD when the real rate is $1.08$.

Outlier Strategies

  1. Trimmed Mean: Discard the highest $20\%$ and lowest $20\%$ of observations and average the rest.
  2. Z-Score (Standard Deviation): If an observation is $> 3\sigma$ away from the rolling average, discard it.

8. Technical Depth: Protocols & Propagation

8.1 FIX Protocol vs. WebSockets

While WebSockets are common for web clients, Wise uses the FIX (Financial Information eXchange) protocol for LPs.

  • Why?: FIX provides Sequence Numbering and Gap Fill logic. If a network packet is lost, FIX ensures the session is re-synchronized without missing a single price tick. WebSockets over standard TCP can suffer from “Head of Line Blocking.”
  • Low Latency: FIX engines (like QuickFIX) use efficient binary tagging (35=D|49=WISE|...) to minimize CPU parsing time compared to heavy JSON.

8.2 The Statistics of Trust: Z-Score vs. Trimmed Mean

  • The Problem: A single provider glitch (reporting $100.0$ instead of $1.0$).
  • Trimmed Mean (The Hammer): Simply drops the top/bottom 20%.
    • Cons: If 3 out of 10 providers glitch together, the mean still drifts significantly.
  • Z-Score (The Scalpel): Calculate the standard deviation ($\sigma$) of the last 1 minute of updates. If a new tick is $> 3\sigma$ from the mean, it is dropped as an outlier.
    • Pros: Highly sensitive to sudden anomalies while remaining robust to gradual market moves.

8.3 Global Rate Propagation (Speed of Light)

To ensure a user in Australia sees the same rate as one in New York:

  1. London (Primary Aggregator): Receives Reuters/CME feeds. Calculates the “Fair Price.”
  2. Redis Global Mesh: Uses Active-Active Replication (CRDTs).
  3. Local Reads: The Singapore API node reads from its local Redis replica.
    • Technical Gotcha: Clock skew between London and Singapore can make a rate look “stale” on arrival. We use Monotonic Timestamps inside the payload to ignore system clock differences.

Operational Failure Modes (Playbooks)

Scenario A: Fat Finger Error (Price Manipulation)

  • Problem: 8 out of 10 providers suddenly report a 20% drop in GBP because of a “Flash Crash” or a bad data feed.
  • Playbook:
    1. Implement Global Drift Alerts.
    2. If the “Fair Price” moves more than 2% in < 5 seconds, trigger a Regional Halt.
    3. No new transfers can start for that currency pair until a human operator confirms the market move is real.

Scenario B: Provider Adapter Memory Leak

  • Problem: Provider B’s adapter is consuming 100% CPU, causing pricing lag for all pairs it covers.
  • Playbook:
    1. The Aggregator detects “Provider B Latency.”
    2. It automatically de-weights Provider B in the calculation.
    3. Rolling restart of the Provider B microservice cluster.

9. The 30-Minute Rate Lock Store

The requirement to lock a rate for 30 minutes is not solved by caching the real-time rate. If a user initiates a transfer, the system must guarantee that specific price.

  1. Quote Generation: When the user requests a quote, the Rates API reads the current aggregated rate from the Redis Cache.
  2. The Lock Store: The API generates a unique quote_id and saves the rate, currency pair, and timestamp into a Dedicated Rate Lock Store (e.g., DynamoDB or a separate Redis cluster).
  3. TTL (Time to Live): The record is written with an exact 30-minute TTL.
  4. Execution: When the user executes the transaction 15 minutes later, the transaction service queries the Lock Store using the quote_id. It does not query the live real-time rate cache.

10. Missing Crucial Details (The “Emerging Markets” Edge Cases)

At the Staff/Lead level, especially for global platforms dealing with exotic pairs, you must address nuances beyond simple G10 rate aggregation.

A. Inverse Pairs and Cross Rates (Triangulation)

The assumption that liquidity providers directly quote every possible currency pair (e.g., SGD to VND) is false. Providers predominantly quote against the USD (e.g., USD/SGD, USD/VND).

  • The Solution: The Aggregator must perform Synthetic Cross-Rate Calculation. If a user requests an SGD to VND quote, the system must calculate: (USD/VND Ask) / (USD/SGD Bid).
  • Concurrency Risk: This requires atomic multi-pair reads from the state cache so you don’t calculate a cross-rate using a 1-second-old USD/SGD rate alongside a 10-second-old USD/VND rate, which creates an arbitrage opportunity.

B. The Spread and Margin Engine

An architecture focusing purely on the “Fair Mid-Market Rate” misses the business logic. Consumer apps do not hand out raw mid-market rates without spreads.

  • The Solution: Introduce a Margin Engine situated between the Aggregator and the User API.
    1. The Aggregator calculates the Mid-Market (e.g., 1.0850).
    2. The Margin Engine evaluates real-time market volatility (Z-scores). If volatility is high, it temporarily widens the spread to shield the business from slippage.
    3. It applies the customer’s tier markup (e.g., 0.5%).
    4. The Final Quote is surfaced to the UI (e.g., 1.0904).

C. The Staleness Threshold Matrix (Circuit Breakers)

If a provider feed goes dark, what does the API actually do when a user clicks “Transfer”? Rejecting the transaction creates terrible UX, but proceeding blindly risks massive financial exposure.

  • The Solution: Define a Staleness Threshold Matrix. The API checks last_updated against an escalating risk profile:
    • $< 2$ seconds old: Process normally.
    • $2 - 10$ seconds old: Widen the spread (e.g., add $+2\%$) to absorb unknown market fluctuations.
    • $> 10$ seconds old: Trip the Circuit Breaker. The API returns 503 Service Unavailable (Trading Halted).

11. Advanced: Scaling and Reliability

  • 1M Pairs: Partition rates by “Currency Cluster” (G10 vs. LatAm vs. Exotic).
  • Compliance: Every rate_snapshot_id must link back to specific Kafka offsets for auditability.

Hardware-First Intuition: The Cost of a Cache Miss

  • Redis (RAM): $1$ ms fetch.
  • Athena/S3 Cold Query: $\approx$ 1-5 seconds.

The Strategy: We pre-aggregate the “Fair Price” and push it to Redis. The Pricing API never queries the 110TB S3 audit log during a live user request. The Cold Tier is strictly for asynchronous compliance investigations.


12. Interview Pacing & Milestone Guide

Time Task Key Talking Points
0-10m Reqs & Estimates Define Mid-market, calculate 40GB/day data weight.
10-25m High Level Design Kafka pipeline, Adapters as microservices.
25-40m Outliers & Precison BigDecimal, Trimmed Mean logic.
40-50m Rate Locking Snapshot ID pattern.
50-60m Scaling & Failure Regional Halts, Redis Sharding.

13. Summary: Senior Interview Checklist

  • Arbitrage Protection: Automated “Halts” on high volatility.
  • Audit Trail: Link transfer_idrate_snapshot_idRawObservations.
  • Atomic Snapshots: Use Redis Pipeline/Multi-exec for consistency.

14. Follow-up Interview Questions

  1. “Should the adapters use Webhooks or Polling to get rates from providers?” Answer: It depends on the provider’s API. Webhooks (or long-lived WebSocket connections like FIX protocol) are preferred for real-time streams to minimize latency. If a provider only offers REST, we must poll. Polling introduces artificial latency and requires careful rate-limit management.

  2. “What happens if all 3 providers go down or stop updating?” Answer: We track the last_updated timestamp for the aggregated rate. If the data becomes “stale” (e.g., no updates for >5 seconds on a major pair during trading hours), the Aggregator triggers a “Circuit Breaker” state. The API will refuse new transfers to prevent taking on unbounded FX risk, surfacing a “Rate currently unavailable” error to the user.

  3. “How do you handle precision loss when calculating the average?” Answer: Never use standard floating-point numbers (float or double). We must use arbitrary-precision data types (like Java’s BigDecimal or explicitly storing integers as micro-cents) at every stage—from the Adapter parsing the JSON to the Aggregator calculating the mean and storing the string in Redis.

  4. “If the Aggregator service crashes, will users see old rates?” Answer: Yes, momentarily. Redis will serve the last written rate. To prevent this from becoming a liability, the Redis keys should have a short TTL (e.g., 10 seconds). If the Aggregator dies and doesn’t refresh the key, the key expires, and the API correctly fails closed rather than allowing users to trade on stale data.

  5. “How do you partition the incoming raw observations stream to prevent hotspots and ensure sequential processing for the Aggregator?” Answer: We partition the RawObservations Kafka topic by currency_pair (e.g., EUR/USD). This ensures that all updates for a specific pair land on the same partition, guaranteeing strict ordering and allowing a single Aggregator instance to calculate the Trimmed Mean/Z-Score accurately for that pair without complex distributed state coordination. Since there are 100 pairs, we can have up to 100 partitions and consumers, providing excellent horizontal scalability.