Distributed File System
A distributed file system gives the abstraction of one giant disk over a fleet of commodity servers. The original GFS/HDFS architecture — large chunks, central metadata, three-way replication — defined a generation. Modern object stores (S3, MinIO, Ceph) trade the file system semantics for flat key spaces and erasure coding. The interesting tradeoffs are around metadata centralization, small-file overhead, and the replication-vs-erasure economics.
Architecture
Capacity Estimation
| Metric | Value | Notes |
|---|---|---|
| Cluster capacity | ~100 PB | 1 K nodes × 100 TB |
| Chunk size | 64–128 MB (HDFS), 4 MB (S3 part) | tradeoff with metadata |
| Files | ~1 B | NameNode RAM dominates |
| NameNode RAM | ~150 GB | ~150 B/file metadata |
| Replication factor | 3 (hot), EC 6+3 (cold) | 1.5x storage with EC vs 3x |
| Read throughput | ~1 GB/s/client | parallel chunk fetch |
| Recovery from disk loss | ~30 min | 10 TB disk on 10 GbE |
Chunked Storage
Files are split into chunks (HDFS blocks) of 64–128 MB. The choice of chunk size is a fundamental tradeoff:
- Larger chunks — fewer chunks per file, less metadata, better sequential throughput. Bad for small files (a 1 KB file still occupies a chunk header).
- Smaller chunks — metadata explodes; NameNode RAM becomes the limit.
HDFS picked 128 MB as the sweet spot for analytics workloads (multi-GB Parquet files). S3 internally splits objects into ~4 MB parts but exposes a flat key space, hiding the chunking from clients.
Metadata Service: NameNode / Master
One central service holds the filesystem tree and the file→chunk map. Operations: create, open, list, delete, rename. The Master does not serve data; it returns chunk locations and the client streams data directly from DataNodes.
- State kept entirely in RAM for low-latency lookups; persisted via an edit log to disk + snapshot.
- HA: a standby NameNode receives the edit log via JournalNodes (a Paxos/Raft quorum); failover takes ~30 s in HDFS's standard configuration.
- Federation: multiple NameNodes each owning a namespace partition (e.g.,
/userson NN1,/dataon NN2). Mitigates the single-master scale ceiling.
The single-master design is the original sin and the strength: it enables strong consistency on the namespace, but caps cluster size at ~10K nodes / 1B files. S3 sidestepped this by going flat-key (no directories) and partitioning the index by key prefix.
Three-way Replication
The default GFS/HDFS placement: three copies, two racks. Replica 1 on the writer's rack (locality), replicas 2 and 3 on a remote rack (rack-failure survivability), replica 3 on a different node from replica 2.
Reads pick the closest replica (rack-aware client). Writes pipeline: client sends to replica 1, which forwards to 2, which forwards to 3, ack returns when all three commit. Failure of one replica during write doesn't fail the write — the client receives a partial-ack and the master triggers re-replication.
Storage cost: 3× raw. For petabyte-scale archives, this becomes the dominant infrastructure expense.
The Small-file Problem
1 billion 1 KB files = 1 TB of data + 150 GB of NameNode metadata. The metadata service is the bottleneck, not the disks. Mitigations:
- HAR / SequenceFile / Avro — pack many small files into one large container. Native HDFS feature.
- HBase / KV store on top — small records belong in a key-value store, not a file system. Use HDFS for blobs > 1 MB only.
- Object store flat namespace — S3 partitions metadata by key prefix; small-object overhead is per-object cost, not centralized RAM. Better suited for many-small-object workloads.
Erasure Coding
Three-way replication is wasteful for cold data — you copy to protect against failure that almost never happens. Reed-Solomon erasure coding stores K data chunks + M parity chunks; you can lose any M and recover. Example: RS(6, 3) stores 6 data + 3 parity; loses up to 3 chunks; storage overhead is 1.5× vs replication's 3×.
Tradeoff: read of a single missing chunk requires fetching K chunks across the cluster and computing — expensive. So:
- Hot tier: 3× replication. Read latency unaffected by failures.
- Cold tier: erasure-coded. Storage cheap; reconstruction expensive but rare.
HDFS-EC, S3 (internally), Backblaze B2, and Ceph all use Reed-Solomon variants. Modern codes (Locally Recoverable Codes, LRC) trade a bit of overhead for cheaper single-chunk recovery.
Comparison with Object Stores
- S3 — flat keys, eventual consistency historically (now strongly consistent), erasure coding internally, 99.999999999% durability claim. The benchmark every other object store is measured against. Cost-effective at petabyte scale.
- MinIO — S3-API-compatible, self-hosted, uses Reed-Solomon. Good for on-prem or hybrid; pure storage layer (no compute integration like HDFS).
- Ceph — multi-protocol (object, block, filesystem) on one cluster. CRUSH algorithm for placement, no central metadata. Operationally heavy but flexible.
- HDFS — tightly coupled to Hadoop ecosystem. Locality-aware compute (run Spark on the node that holds the chunk). Modern stacks decouple this and use S3-compatible storage with Spark on Kubernetes.
The industry trend: most new workloads use object storage (S3 or compatible). HDFS remains where it is for legacy Hadoop installations.
Failure Modes
- NameNode OOM — metadata grows past RAM. Heap-tune, federate the namespace, archive cold data to a separate cluster.
- Cluster-wide replication storm — a rack loses power; thousands of chunks go under-replicated; the master triggers re-replication from surviving copies, saturating network. Throttle re-replication rate to ~10% of cluster bandwidth.
- Small-file hell — ML training writes millions of small files; cluster grinds to halt on metadata operations. Detect via NameNode RPC queue depth; refactor producers to write larger files.
- Stale reads — client cached chunk locations; chunk migrated due to rebalance; read hits dead replica. Refresh on read failure; client retries from master.
FAQ
HDFS or S3?
S3 unless you have a specific reason. Locality-aware compute (the original HDFS pitch) is mostly mooted by 25 GbE networks; S3's 11-9s durability and operational simplicity beat self-managed HDFS for most teams.
What about POSIX semantics?
HDFS is mostly-POSIX (no random write into a closed file, no hard links). S3 is not POSIX (no rename, no atomic directory operations). For applications that need POSIX, mount with FUSE or use a layered filesystem (Alluxio, JuiceFS).
Can I run Spark on S3?
Yes, this is the modern default. Performance is acceptable with the S3A connector and proper committers (the "direct" and "magic" committers solve the rename-as-commit problem).
Why are chunks 64–128 MB rather than 1 GB?
Recovery time after a node failure scales with chunk size: bigger chunks mean longer to re-replicate one. 128 MB on 10 GbE is ~100 ms, fast enough to maintain durability targets.