Design an Online Judge

Submission Pipeline, Sandbox Isolation, Test Case Execution, and Plagiarism Detection

An online judge (e.g., LeetCode, Codeforces, HackerRank) accepts code submissions, compiles and runs them in a sandboxed environment against hidden test cases, and returns a verdict. Core challenges: secure sandbox isolation (preventing malicious code from escaping), fair resource limits (CPU time, memory), high-throughput judging during contests with thousands of concurrent users, and plagiarism detection across submissions. A well-designed system handles 10K+ submissions/minute during peak contests with sub-second queue-to-verdict latency.

Submission Pipeline

A submission flows through multiple stages: queued, picked up by a worker, sandbox created, compiled, test cases executed, and verdict returned. Click "Submit Solution" to watch the pipeline animate. Multiple submissions can be in flight simultaneously.

Queue
Worker
Sandbox
Compile
Run Tests
Verdict
Queued0
Processing0
Completed0
Avg Time--

Sandbox Isolation Strategies

The sandbox must prevent submitted code from accessing the host system, network, or other submissions. Compare three isolation approaches and configure resource limits.

Docker Containers

Isolation: Namespace + cgroup-based. Startup: ~500ms. Overhead: Low. Uses pre-built images per language. Good balance of security and speed. Vulnerable to kernel exploits without additional hardening.

Security: 7/10

gVisor / Firecracker

Isolation: User-space kernel (gVisor) or microVM (Firecracker). Startup: ~125ms (Firecracker). Overhead: Medium. Intercepts all syscalls. Used by Google and AWS Lambda. Strongest isolation.

Security: 9.5/10

seccomp-bpf

Isolation: Syscall filtering via BPF rules. Startup: ~5ms. Overhead: Minimal. Blocks dangerous syscalls (fork, exec, network). Lightweight but must be paired with chroot/namespaces for full isolation.

Security: 5.5/10

Resource Limits

CPU Usage
0%
Memory Usage
0 MB
Disk I/O
0 KB
Network
BLOCKED

Test Case Execution Timeline

Watch test cases execute sequentially. Each test case compares the actual output against the expected output and produces a verdict: AC (Accepted), WA (Wrong Answer), TLE (Time Limit Exceeded), RE (Runtime Error), or MLE (Memory Limit Exceeded).

#1input: [2,7,11,15], t=9expected: [0,1]actual: --PENDING--
#2input: [3,2,4], t=6expected: [1,2]actual: --PENDING--
#3input: [3,3], t=6expected: [0,1]actual: --PENDING--
#4input: [1], t=2expected: []actual: --PENDING--
#5input: 10^5 elementsexpected: [4999,5000]actual: --PENDING--
#6input: 10^6 elementsexpected: [0,999999]actual: --PENDING--
#7input: all negativesexpected: [2,4]actual: --PENDING--
#8input: duplicatesexpected: [0,3]actual: --PENDING--
Passed0
Failed0
First Failure--
Total Time--

Capacity Estimation

Estimate the infrastructure requirements for handling peak contest load.

Peak submissions/sec--
CPU cores needed--
Queue depth--
Storage/day--

Plagiarism Detection

After a contest, detect code plagiarism by comparing submissions. Three main approaches, each with different accuracy and performance trade-offs.

Token-Based (MOSS)

Converts code to a token stream (ignoring variable names, whitespace, comments). Uses Winnowing algorithm to select fingerprints. Compares fingerprint sets between pairs. Fast and robust against simple renaming. Used by Stanford MOSS.

AST Comparison

Parses code into an Abstract Syntax Tree. Compares tree structures using tree edit distance. Resilient to variable renaming, reordering, and formatting changes. Higher accuracy but computationally expensive: O(n^2) per pair.

N-gram Fingerprinting

Extracts overlapping n-grams of tokens. Hashes each n-gram and selects a subset (e.g., min-hash). Estimates Jaccard similarity via LSH (Locality-Sensitive Hashing). Scales to thousands of submissions with approximate matching.

Fingerprint Comparison Visualizer

Submission A
def two_sum(nums, target):
    seen = {}
    for i, n in enumerate(nums):
        comp = target - n
        if comp in seen:
            return [seen[comp], i]
        seen[n] = i
Similarity
0%
Submission B
def solve(a, t):
    d = {}
    for idx, val in enumerate(a):
        diff = t - val
        if diff in d:
            return [d[diff], idx]
        d[val] = idx

Architecture Overview

High-level architecture of an online judge system with auto-scaled judge workers and separated storage layers.

Web Frontend
API Server
Submission Queue
(Kafka / RabbitMQ)
Judge Workers
(auto-scaled)
Sandbox
(resource limits)
Sandbox
Result
Result DB
(PostgreSQL)
Leaderboard Service
Problem DB
Test Case Storage
(S3 / mounted volume)
Code Storage
(S3 / blob)

Key Design Decisions

Critical trade-offs when designing an online judge system.

Synchronous vs Asynchronous Judging

Synchronous
  • User waits for result
  • Simple implementation
  • Blocks worker threads
  • Doesn't scale for contests
vs
Asynchronous (Preferred)
  • Submit → queue → poll/websocket
  • Workers scale independently
  • Handles burst traffic
  • Retry on worker failure

Language Support Strategy

Pre-build Docker images per language (Python, C++, Java, Go, Rust). Each image has the compiler/runtime, resource limit tooling, and a read-only test case mount. Use a language registry mapping language ID to image tag, compile command, and run command.

Judging Modes

Standard judge: exact string match of output. Special judge: custom checker program (e.g., floating point tolerance, multiple valid answers). Interactive judge: judge and solution communicate via stdin/stdout (e.g., binary search game). Each mode requires different sandbox orchestration.

Worker Auto-Scaling

Scale judge workers based on queue depth. Rule: if queue depth > 2 × worker count for > 30s, add workers. Scale down when idle for > 5 min. Use Kubernetes HPA with custom metrics from the message queue. During contests, pre-warm workers 10 min before start.

Result Caching

Cache results keyed by hash(code + problem_id + language). If a user resubmits identical code, return cached verdict instantly. Saves significant compute during contests where users spam-submit. Invalidate cache when test cases are updated.

Online judges combine systems programming, security, and distributed systems. The sandbox is the most critical component β€” a single escape could compromise the entire infrastructure.