Design an Online Judge
Submission Pipeline, Sandbox Isolation, Test Case Execution, and Plagiarism Detection
An online judge (e.g., LeetCode, Codeforces, HackerRank) accepts code submissions, compiles and runs them in a sandboxed environment against hidden test cases, and returns a verdict. Core challenges: secure sandbox isolation (preventing malicious code from escaping), fair resource limits (CPU time, memory), high-throughput judging during contests with thousands of concurrent users, and plagiarism detection across submissions. A well-designed system handles 10K+ submissions/minute during peak contests with sub-second queue-to-verdict latency.
Submission Pipeline
A submission flows through multiple stages: queued, picked up by a worker, sandbox created, compiled, test cases executed, and verdict returned. Click "Submit Solution" to watch the pipeline animate. Multiple submissions can be in flight simultaneously.
Sandbox Isolation Strategies
The sandbox must prevent submitted code from accessing the host system, network, or other submissions. Compare three isolation approaches and configure resource limits.
Docker Containers
Isolation: Namespace + cgroup-based. Startup: ~500ms. Overhead: Low. Uses pre-built images per language. Good balance of security and speed. Vulnerable to kernel exploits without additional hardening.
gVisor / Firecracker
Isolation: User-space kernel (gVisor) or microVM (Firecracker). Startup: ~125ms (Firecracker). Overhead: Medium. Intercepts all syscalls. Used by Google and AWS Lambda. Strongest isolation.
seccomp-bpf
Isolation: Syscall filtering via BPF rules. Startup: ~5ms. Overhead: Minimal. Blocks dangerous syscalls (fork, exec, network). Lightweight but must be paired with chroot/namespaces for full isolation.
Resource Limits
Test Case Execution Timeline
Watch test cases execute sequentially. Each test case compares the actual output against the expected output and produces a verdict: AC (Accepted), WA (Wrong Answer), TLE (Time Limit Exceeded), RE (Runtime Error), or MLE (Memory Limit Exceeded).
Capacity Estimation
Estimate the infrastructure requirements for handling peak contest load.
Plagiarism Detection
After a contest, detect code plagiarism by comparing submissions. Three main approaches, each with different accuracy and performance trade-offs.
Token-Based (MOSS)
Converts code to a token stream (ignoring variable names, whitespace, comments). Uses Winnowing algorithm to select fingerprints. Compares fingerprint sets between pairs. Fast and robust against simple renaming. Used by Stanford MOSS.
AST Comparison
Parses code into an Abstract Syntax Tree. Compares tree structures using tree edit distance. Resilient to variable renaming, reordering, and formatting changes. Higher accuracy but computationally expensive: O(n^2) per pair.
N-gram Fingerprinting
Extracts overlapping n-grams of tokens. Hashes each n-gram and selects a subset (e.g., min-hash). Estimates Jaccard similarity via LSH (Locality-Sensitive Hashing). Scales to thousands of submissions with approximate matching.
Fingerprint Comparison Visualizer
def two_sum(nums, target):
seen = {}
for i, n in enumerate(nums):
comp = target - n
if comp in seen:
return [seen[comp], i]
seen[n] = i def solve(a, t):
d = {}
for idx, val in enumerate(a):
diff = t - val
if diff in d:
return [d[diff], idx]
d[val] = idx Architecture Overview
High-level architecture of an online judge system with auto-scaled judge workers and separated storage layers.
(Kafka / RabbitMQ)
(auto-scaled)
(resource limits)
(PostgreSQL)
(S3 / mounted volume)
(S3 / blob)
Key Design Decisions
Critical trade-offs when designing an online judge system.
Synchronous vs Asynchronous Judging
- User waits for result
- Simple implementation
- Blocks worker threads
- Doesn't scale for contests
- Submit → queue → poll/websocket
- Workers scale independently
- Handles burst traffic
- Retry on worker failure
Language Support Strategy
Pre-build Docker images per language (Python, C++, Java, Go, Rust). Each image has the compiler/runtime, resource limit tooling, and a read-only test case mount. Use a language registry mapping language ID to image tag, compile command, and run command.
Judging Modes
Standard judge: exact string match of output. Special judge: custom checker program (e.g., floating point tolerance, multiple valid answers). Interactive judge: judge and solution communicate via stdin/stdout (e.g., binary search game). Each mode requires different sandbox orchestration.
Worker Auto-Scaling
Scale judge workers based on queue depth. Rule: if queue depth > 2 × worker count for > 30s, add workers. Scale down when idle for > 5 min. Use Kubernetes HPA with custom metrics from the message queue. During contests, pre-warm workers 10 min before start.
Result Caching
Cache results keyed by hash(code + problem_id + language). If a user resubmits identical code, return cached verdict instantly. Saves significant compute during contests where users spam-submit. Invalidate cache when test cases are updated.