Design an Online Judge

Submission Pipeline, Sandbox Isolation, Test Case Execution, and Plagiarism Detection

An online judge (e.g., LeetCode, Codeforces, HackerRank) accepts code submissions, compiles and runs them in a sandboxed environment against hidden test cases, and returns a verdict. Core challenges: secure sandbox isolation (preventing malicious code from escaping), fair resource limits (CPU time, memory), high-throughput judging during contests with thousands of concurrent users, and plagiarism detection across submissions. A well-designed system handles 10K+ submissions/minute during peak contests with sub-second queue-to-verdict latency.

Submission Pipeline

A submission flows through multiple stages: queued, picked up by a worker, sandbox created, compiled, test cases executed, and verdict returned. Click "Submit Solution" to watch the pipeline animate. Multiple submissions can be in flight simultaneously.

Queue

Worker

Sandbox

Compile

Run Tests

Verdict

Queued0

Processing0

Completed0

Avg Time--

Sandbox Isolation Strategies

The sandbox must prevent submitted code from accessing the host system, network, or other submissions. Compare three isolation approaches and configure resource limits.

Docker Containers

Isolation: Namespace + cgroup-based. Startup: ~500ms. Overhead: Low. Uses pre-built images per language. Good balance of security and speed. Vulnerable to kernel exploits without additional hardening.

Security: 7/10

gVisor / Firecracker

Isolation: User-space kernel (gVisor) or microVM (Firecracker). Startup: ~125ms (Firecracker). Overhead: Medium. Intercepts all syscalls. Used by Google and AWS Lambda. Strongest isolation.

Security: 9.5/10

seccomp-bpf

Isolation: Syscall filtering via BPF rules. Startup: ~5ms. Overhead: Minimal. Blocks dangerous syscalls (fork, exec, network). Lightweight but must be paired with chroot/namespaces for full isolation.

Security: 5.5/10

Resource Limits

Time limit: 5 seconds

Memory limit: 256 MB

CPU Usage

Memory Usage

0 MB

Disk I/O

0 KB

Network

BLOCKED

Test Case Execution Timeline

Watch test cases execute sequentially. Each test case compares the actual output against the expected output and produces a verdict: AC (Accepted), WA (Wrong Answer), TLE (Time Limit Exceeded), RE (Runtime Error), or MLE (Memory Limit Exceeded).

#1input: [2,7,11,15], t=9expected: [0,1]actual: --PENDING--

#2input: [3,2,4], t=6expected: [1,2]actual: --PENDING--

#3input: [3,3], t=6expected: [0,1]actual: --PENDING--

#4input: [1], t=2expected: []actual: --PENDING--

#5input: 10^5 elementsexpected: [4999,5000]actual: --PENDING--

#6input: 10^6 elementsexpected: [0,999999]actual: --PENDING--

#7input: all negativesexpected: [2,4]actual: --PENDING--

#8input: duplicatesexpected: [0,3]actual: --PENDING--

Passed0

Failed0

First Failure--

Total Time--

Capacity Estimation

Estimate the infrastructure requirements for handling peak contest load.

Concurrent contests: 5

Users per contest: 2,000

Submissions per user per hour: 10

Avg test cases per problem: 20

Peak submissions/sec--

CPU cores needed--

Queue depth--

Storage/day--

Plagiarism Detection

After a contest, detect code plagiarism by comparing submissions. Three main approaches, each with different accuracy and performance trade-offs.

Token-Based (MOSS)

Converts code to a token stream (ignoring variable names, whitespace, comments). Uses Winnowing algorithm to select fingerprints. Compares fingerprint sets between pairs. Fast and robust against simple renaming. Used by Stanford MOSS.

AST Comparison

Parses code into an Abstract Syntax Tree. Compares tree structures using tree edit distance. Resilient to variable renaming, reordering, and formatting changes. Higher accuracy but computationally expensive: O(n^2) per pair.

N-gram Fingerprinting

Extracts overlapping n-grams of tokens. Hashes each n-gram and selects a subset (e.g., min-hash). Estimates Jaccard similarity via LSH (Locality-Sensitive Hashing). Scales to thousands of submissions with approximate matching.

Fingerprint Comparison Visualizer

Submission A

def two_sum(nums, target):
    seen = {}
    for i, n in enumerate(nums):
        comp = target - n
        if comp in seen:
            return [seen[comp], i]
        seen[n] = i

Similarity

Submission B

def solve(a, t):
    d = {}
    for idx, val in enumerate(a):
        diff = t - val
        if diff in d:
            return [d[diff], idx]
        d[val] = idx

Architecture Overview

High-level architecture of an online judge system with auto-scaled judge workers and separated storage layers.

Web Frontend

→

API Server

→

Submission Queue
(Kafka / RabbitMQ)

→

Judge Workers
(auto-scaled)

→

Sandbox
(resource limits)

Sandbox

→

Result

→

Result DB
(PostgreSQL)

→

Leaderboard Service

Problem DB

Test Case Storage
(S3 / mounted volume)

Code Storage
(S3 / blob)

Key Design Decisions

Critical trade-offs when designing an online judge system.

Synchronous vs Asynchronous Judging Synchronous User waits for result
Simple implementation
Blocks worker threads
Doesn't scale for contests
 
vs
 Asynchronous (Preferred) Submit → queue → poll/websocket
Workers scale independently
Handles burst traffic
Retry on worker failure
 

Language Support Strategy

Pre-build Docker images per language (Python, C++, Java, Go, Rust). Each image has the compiler/runtime, resource limit tooling, and a read-only test case mount. Use a language registry mapping language ID to image tag, compile command, and run command.

Judging Modes

Standard judge: exact string match of output. Special judge: custom checker program (e.g., floating point tolerance, multiple valid answers). Interactive judge: judge and solution communicate via stdin/stdout (e.g., binary search game). Each mode requires different sandbox orchestration.

Worker Auto-Scaling

Scale judge workers based on queue depth. Rule: if queue depth > 2 × worker count for > 30s, add workers. Scale down when idle for > 5 min. Use Kubernetes HPA with custom metrics from the message queue. During contests, pre-warm workers 10 min before start.

Result Caching

Cache results keyed by hash(code + problem_id + language). If a user resubmits identical code, return cached verdict instantly. Saves significant compute during contests where users spam-submit. Invalidate cache when test cases are updated.