Design LeetCode (Remote Code Execution)
[!IMPORTANT] In this lesson, you will master:
- 1. What is a Code Execution Service?: Building intuition behind 1. what is a code execution service?.
- 2. Requirements & Goals: Building intuition behind 2. requirements & goals.
- 3. Capacity Estimation: Building intuition behind 3. capacity estimation.
1. What is a Code Execution Service?
A Remote Code Execution (RCE) service allows users to submit code in various languages (Python, C++, Java), executes it against a set of test cases, and returns the results (Pass/Fail, Runtime, Memory Usage).
[!WARNING] The Danger Zone: This is one of the most dangerous systems to build. You are literally inviting strangers to run arbitrary code on your servers. Without proper Sandboxing, a user could
rm -rf /, mine crypto, or scan your internal network.
Real-World Examples
- LeetCode / HackerRank: Competitive programming platforms.
- Judge0: Open-source RCE API.
- Replit / CodeSandbox: Online IDEs (stateful environments).
2. Requirements & Goals
Functional Requirements
- Code Submission: User submits source code and language.
- Execution: System runs code against test cases.
- Feedback: Returns Standard Output (STDOUT), Standard Error (STDERR), and Verdict (Accepted, Wrong Answer, TLE).
- Limits: Enforce strict Time Limits (e.g., 2s) and Memory Limits (e.g., 256MB).
Non-Functional Requirements
- Security (Critical): Zero Trust. Code must run in total isolation.
- Performance: Low overhead for container startup. Users expect results in seconds.
- Concurrency: Handle thousands of simultaneous submissions.
3. Capacity Estimation
- Daily Submissions: 1 Million.
- Peak Traffic: 100 submissions/sec (during contests).
- Execution Time: Average 2 seconds per task.
- Compute Resources: Since tasks are CPU-bound, we need scalable worker nodes.
- 100 tasks/sec × 2 sec/task = 200 concurrent containers running.
- If each core handles 1 container, we need ~200 vCPUs (e.g., 25-50 large instances).
4. System APIs
Submit Code
POST /v1/submissions
{
"code": "print('hello')",
"language": "python3",
"problem_id": "123"
}
Response: { "submission_id": "sub_abc123" }
Get Status (Polling)
GET /v1/submissions/sub_abc123
Response:
{
"status": "PROCESSING" // or COMPLETED
"result": { "verdict": "Accepted", "runtime_ms": 45 }
}
5. Database Design
We need to store submission history and problem data. For more on relational schema design, see Module 04: Database Basics.
1. Submissions Table (Postgres)
| Column | Type | Description |
|---|---|---|
id |
UUID | Primary Key |
user_id |
UUID | The submitter |
problem_id |
UUID | The problem solved |
code |
TEXT | The source code (Stored in S3 if large) |
status |
ENUM | PENDING, PROCESSING, AC, WA, TLE, RE |
runtime |
INT | Execution time in ms |
memory |
INT | Memory usage in KB |
2. Test Cases (Object Store / S3)
- Test cases are often large files.
- Structure:
s3://problems/{problem_id}/input/1.txt,s3://problems/{problem_id}/output/1.txt. - Worker nodes download these during execution.
6. High-Level Design
High-Level Architecture: The Secure Execution Pipeline.
The system follows an Asynchronous Worker Pattern to handle long-running code execution without blocking the API:
- API Gateway / Submissions Svc: Validates the request, saves initial metadata to PostgreSQL, and pushes the job into a Redis Job Queue (See Module 08: Messaging for more).
- Job Queue: Decouples the API from execution. This buffers traffic bursts and allows for easy scaling of worker nodes.
- Worker Nodes: Independent compute nodes that poll the queue via
RPOP. - Sandbox Runtime (The Isolation Zone):
- Ephemeral Sandbox: The worker creates a secure environment (e.g., gVisor or Firecracker).
- Test Case Mounting: Mounts test cases from S3 as Read-Only.
- Execution: Runs the untrusted code while intercepting syscalls.
- Result Store: Once execution finishes, the worker updates the submission status in PostgreSQL (e.g., AC, WA, TLE).
[!TIP] Why Async? Code execution takes time (1-10s). Keeping an HTTP connection open is brittle. Polling or WebSockets is better for the client.
7. Deep Dive: Isolation Strategy (The Sandbox)
How do we prevent system("rm -rf /") or kernel exploits? This is handled by the Security Boundary shown in our architecture.
Level 1: Standard Containers (Docker)
- Tool: Docker, LXC.
- Pros: Fast boot (milliseconds).
- Cons: Shared Kernel. All containers share the host OS kernel. If a hacker finds a kernel vulnerability (e.g., Dirty COW), they can escape the container and take over the host.
- Verdict: Not secure enough for untrusted public code.
Level 2: Virtual Machines (VMs)
- Tool: AWS EC2, VMWare.
- Pros: Hardware-level virtualization (Hypervisor). Very secure.
- Cons: Slow boot time (minutes). Too heavy for a 2-second script.
Level 3: MicroVMs / User-Space Kernels (The Gold Standard)
This bridges the gap between VM security and Container speed.
- gVisor (Google):
- Intercepts syscalls in User Space.
- The “Guest” application talks to gVisor (Sentry), not the Host Kernel.
- Acts as a “security proxy” for syscalls.
- Firecracker (AWS Lambda):
- Lightweight KVM-based microVMs.
- Boots in < 125ms.
- Used by AWS Lambda and Fargate.
8. Defense in Depth (Specific Mitigations)
Even with MicroVMs, apply these Linux primitives:
A. Preventing “Fork Bombs” (cgroups)
[!NOTE] War Story: The “Infinite Loop” Bill In the early days of a popular competitive programming platform, a user accidentally submitted a C++ program with an infinite loop that allocated memory on each iteration. Because there were no strict memory cgroups configured on the worker node, the process eventually triggered the Linux OOM Killer, which indiscriminately killed the main task supervisor instead of the user’s process. The node stayed up but stopped reporting status, leaving the infinite loop to burn CPU for hours until an engineer manually intervened. This incident led to the immediate adoption of strict
pids.maxand memory cgroup limits.
A Fork Bomb (while(1) fork()) crashes a server by exhausting the Process ID (PID) table.
- Solution: Control Groups (cgroups).
- Configure
pids.max = 64. If the code tries to spawn the 65th process, the kernel blocks it. - Also limit CPU shares and Memory (OOM Killer).
B. Preventing Network Scans (Namespaces)
Users shouldn’t scan your internal AWS VPC.
- Solution: Network Namespaces.
- Run the container with No Network access (
--network none). - Only map STDIN/STDOUT.
C. Preventing File System Damage (Seccomp)
Users shouldn’t read /etc/passwd.
- Solution:
- Mount Root FS as Read-Only.
- Use Seccomp (Secure Computing Mode) to whitelist only necessary syscalls (read, write, exit). Block
socket,execve(except strictly controlled paths).
9. Data Partitioning & Sharding
We generate millions of submissions. A single DB won’t hold up.
Sharding Strategy: Shard by submission_id
- Shard by
user_id: Good for “Show me all my submissions”. Bad for global analytics or if one user spams. - Shard by
submission_id: Even distribution. But “Show me my submissions” requires Scatter-Gather. - Decision: LeetCode is Write-Heavy during contests. We likely prioritize Write Throughput, so Sharding by
submission_id(or using a dedicated high-write store like Cassandra/DynamoDB) is preferred. For user history, we can maintain a secondary index.
10. Interactive Decision Visualizer: The Secure Pipeline
This demo visualizes how different layers of defense block different types of attacks. Select an Attack Vector and see which layer catches it.
[!TIP] Try it yourself: Click “Fork Bomb” or “File Deletion” to see how Linux cgroups and Seccomp filters block these attacks in real-time.
11. System Walkthrough: The Life of a Submission
Let’s trace a user submitting Python code that tries to access the network.
Step 1: Submission
- User sends code via
POST /submissions.{ "code": "import socket; s = socket.socket(); s.connect(('google.com', 80))", "lang": "python3" } - API Gateway generates
submission_id: "abc-123"and pushes to Redis:RPUSH submission_queue "{\"id\":\"abc-123\", \"lang\":\"python3\", ...}"
Step 2: Worker Processing
- Worker Node (Golang) pulls the job:
BLPOP submission_queue 0. - It launches a Firecracker MicroVM with restricted arguments:
# Conceptual command firecracker-run \ --kernel vmlinux \ --rootfs python3-rootfs.ext4 \ --network none \ # <--- Network Isolation --cpu-template T2 \ --memory 128M
Step 3: Execution & Interception
- The Python code runs inside the MicroVM.
- It tries to call the
connect()syscall. - The Kernel (inside MicroVM) checks the network namespace. It sees no network interfaces (only loopback).
- The syscall fails with
ENETUNREACH(Network is unreachable).
Step 4: Result Collection
- The worker captures
STDERR:OSError: [Errno 101] Network is unreachable. - It writes the result to Postgres:
UPDATE submissions SET status='RUNTIME_ERROR', stderr='Network unreachable...' WHERE id='abc-123';
12. Requirements Traceability Matrix
| Requirement | Architectural Solution |
|---|---|
| Code Isolation | gVisor / Firecracker (MicroVMs) prevent kernel sharing. |
| Resource Limits | cgroups enforce CPU, Memory, and PID limits. |
| Network Security | Network Namespaces (--network none) block internet access. |
| File System Security | Read-Only RootFS + Seccomp whitelist prevents rm -rf. |
| Scalability | Redis Job Queue decouples API from Workers. Auto-scaling workers. |
| Concurrency | Firecracker boots in <125ms, allowing high density (thousands per node). |
13. Follow-Up Questions: The Interview Gauntlet
I. Security & Isolation
- Why is Docker not enough? Docker shares the Host Kernel. A kernel exploit (e.g., Dirty Pipe) allows root access to the host.
- Explain Seccomp. It stands for Secure Computing. It’s a BPF filter that whitelists syscalls. If a process calls
socket()and it’s not whitelisted, the kernel kills the process. - How to prevent Infinite Loops? Use
setrlimit(RLIMIT_CPU)in the runner code + a hard timeout (SIGKILL) from the worker supervisor after 2 seconds. - How to prevent memory exhaustion?
cgroupsmemory limit. The OOM Killer will kill the specific container, not the worker node.
II. Scalability
- What if the Queue backs up? Auto-scale the Worker Group based on
LLEN(submission_queue). If queue > 1000, add 10 nodes. - Handling Large Outputs: If a user prints 1GB of text, it fills the disk. Limit
STDOUTcapture to 100KB. Truncate the rest. - Shard Strategy: Shard DB by
submission_id. No need for complex cross-shard joins.
III. Operational Excellence
- How to update the runtime (e.g., Python 3.9 → 3.10)? Build a new RootFS image. Rolling update the workers to use the new image.
- Malicious Users: Rate limit by User ID. If a user triggers Security Violations repeatedly, ban the account.
14. Summary: The Whiteboard Strategy
If asked to design LeetCode, draw this 4-Quadrant Layout:
1. Requirements
- Func: Execute Code, Feedback, Limits.
- Non-Func: Security (Sandbox), Speed (<2s).
- Scale: 100 QPS (Burst).
2. Architecture
↓
[Worker Group]
[Firecracker VM (Seccomp/NS)]
↓
[DB]
* Async Worker: Decouples execution.
* Firecracker: MicroVM Isolation.
3. Data & API
DB: Submissions(id, user, status, result)
S3: Test Cases (Read-Only)
4. Security Layers
- Network: Namespace (`--net none`).
- FS: Read-Only RootFS.
- Syscalls: Seccomp Whitelist.
- Kernel: gVisor / MicroVM.