Distributed File System — GFS, HDFS, replication, erasure coding, S3

Architecture

Capacity Estimation

Metric	Value	Notes
Cluster capacity	~100 PB	1 K nodes × 100 TB
Chunk size	64–128 MB (HDFS), 4 MB (S3 part)	tradeoff with metadata
Files	~1 B	NameNode RAM dominates
NameNode RAM	~150 GB	~150 B/file metadata
Replication factor	3 (hot), EC 6+3 (cold)	1.5x storage with EC vs 3x
Read throughput	~1 GB/s/client	parallel chunk fetch
Recovery from disk loss	~30 min	10 TB disk on 10 GbE

Chunked Storage

Files are split into chunks (HDFS blocks) of 64–128 MB. The choice of chunk size is a fundamental tradeoff:

Larger chunks — fewer chunks per file, less metadata, better sequential throughput. Bad for small files (a 1 KB file still occupies a chunk header).
Smaller chunks — metadata explodes; NameNode RAM becomes the limit.

HDFS picked 128 MB as the sweet spot for analytics workloads (multi-GB Parquet files). S3 internally splits objects into ~4 MB parts but exposes a flat key space, hiding the chunking from clients.

Metadata Service: NameNode / Master

One central service holds the filesystem tree and the file→chunk map. Operations: create, open, list, delete, rename. The Master does not serve data; it returns chunk locations and the client streams data directly from DataNodes.

State kept entirely in RAM for low-latency lookups; persisted via an edit log to disk + snapshot.
HA: a standby NameNode receives the edit log via JournalNodes (a Paxos/Raft quorum); failover takes ~30 s in HDFS's standard configuration.
Federation: multiple NameNodes each owning a namespace partition (e.g., /users on NN1, /data on NN2). Mitigates the single-master scale ceiling.

The single-master design is the original sin and the strength: it enables strong consistency on the namespace, but caps cluster size at ~10K nodes / 1B files. S3 sidestepped this by going flat-key (no directories) and partitioning the index by key prefix.

Three-way Replication

The default GFS/HDFS placement: three copies, two racks. Replica 1 on the writer's rack (locality), replicas 2 and 3 on a remote rack (rack-failure survivability), replica 3 on a different node from replica 2.

Reads pick the closest replica (rack-aware client). Writes pipeline: client sends to replica 1, which forwards to 2, which forwards to 3, ack returns when all three commit. Failure of one replica during write doesn't fail the write — the client receives a partial-ack and the master triggers re-replication.

Storage cost: 3× raw. For petabyte-scale archives, this becomes the dominant infrastructure expense.

The Small-file Problem

1 billion 1 KB files = 1 TB of data + 150 GB of NameNode metadata. The metadata service is the bottleneck, not the disks. Mitigations:

HAR / SequenceFile / Avro — pack many small files into one large container. Native HDFS feature.
HBase / KV store on top — small records belong in a key-value store, not a file system. Use HDFS for blobs > 1 MB only.
Object store flat namespace — S3 partitions metadata by key prefix; small-object overhead is per-object cost, not centralized RAM. Better suited for many-small-object workloads.

Erasure Coding

Three-way replication is wasteful for cold data — you copy to protect against failure that almost never happens. Reed-Solomon erasure coding stores K data chunks + M parity chunks; you can lose any M and recover. Example: RS(6, 3) stores 6 data + 3 parity; loses up to 3 chunks; storage overhead is 1.5× vs replication's 3×.

Tradeoff: read of a single missing chunk requires fetching K chunks across the cluster and computing — expensive. So:

Hot tier: 3× replication. Read latency unaffected by failures.
Cold tier: erasure-coded. Storage cheap; reconstruction expensive but rare.

HDFS-EC, S3 (internally), Backblaze B2, and Ceph all use Reed-Solomon variants. Modern codes (Locally Recoverable Codes, LRC) trade a bit of overhead for cheaper single-chunk recovery.

Comparison with Object Stores

S3 — flat keys, eventual consistency historically (now strongly consistent), erasure coding internally, 99.999999999% durability claim. The benchmark every other object store is measured against. Cost-effective at petabyte scale.
MinIO — S3-API-compatible, self-hosted, uses Reed-Solomon. Good for on-prem or hybrid; pure storage layer (no compute integration like HDFS).
Ceph — multi-protocol (object, block, filesystem) on one cluster. CRUSH algorithm for placement, no central metadata. Operationally heavy but flexible.
HDFS — tightly coupled to Hadoop ecosystem. Locality-aware compute (run Spark on the node that holds the chunk). Modern stacks decouple this and use S3-compatible storage with Spark on Kubernetes.

The industry trend: most new workloads use object storage (S3 or compatible). HDFS remains where it is for legacy Hadoop installations.

Failure Modes

NameNode OOM — metadata grows past RAM. Heap-tune, federate the namespace, archive cold data to a separate cluster.
Cluster-wide replication storm — a rack loses power; thousands of chunks go under-replicated; the master triggers re-replication from surviving copies, saturating network. Throttle re-replication rate to ~10% of cluster bandwidth.
Small-file hell — ML training writes millions of small files; cluster grinds to halt on metadata operations. Detect via NameNode RPC queue depth; refactor producers to write larger files.
Stale reads — client cached chunk locations; chunk migrated due to rebalance; read hits dead replica. Refresh on read failure; client retries from master.

FAQ

HDFS or S3?

S3 unless you have a specific reason. Locality-aware compute (the original HDFS pitch) is mostly mooted by 25 GbE networks; S3's 11-9s durability and operational simplicity beat self-managed HDFS for most teams.

What about POSIX semantics?

HDFS is mostly-POSIX (no random write into a closed file, no hard links). S3 is not POSIX (no rename, no atomic directory operations). For applications that need POSIX, mount with FUSE or use a layered filesystem (Alluxio, JuiceFS).

Can I run Spark on S3?

Yes, this is the modern default. Performance is acceptable with the S3A connector and proper committers (the "direct" and "magic" committers solve the rename-as-commit problem).

Why are chunks 64–128 MB rather than 1 GB?

Recovery time after a node failure scales with chunk size: bigger chunks mean longer to re-replicate one. 128 MB on 10 GbE is ~100 ms, fast enough to maintain durability targets.