AlgoMaster Logo

Distributed File Systems

Last Updated: January 10, 2026

Ashish

Ashish Pratap Singh

When you need to store a terabyte of data, a single hard drive works fine. When you need to store a petabyte, you have a problem. No single machine can hold that much data, and even if it could, you would have a single point of failure that could take down your entire system.

Distributed file systems solve this by spreading data across many machines while presenting a unified view to applications.

When your application reads /data/logs/server.log, the file system figures out which machines hold pieces of that file, fetches them, and assembles them transparently. Your application sees a normal file, but behind the scenes, dozens or hundreds of servers may be involved.

Distributed file systems also provide fault tolerance (your data survives machine failures), high throughput (many machines serving data in parallel), and data locality (processing can happen where data lives). These properties make distributed file systems the backbone of modern data infrastructure, from web search indexes to machine learning training pipelines.

In this chapter, we will explore how distributed file systems work, the architectural patterns they use, and when to choose them for your system design.

What is a Distributed File System?

A distributed file system (DFS) stores files across multiple servers but presents them to users and applications as if they were on a single machine. The distribution is transparent.

Applications use familiar file operations (open, read, write, close), and the DFS handles the complexity of locating data, coordinating access, and recovering from failures.

The diagram below illustrates this abstraction. An application sees a simple file path, but behind that path lies a sophisticated system that manages metadata about where file pieces are stored and coordinates multiple storage servers.

Notice how the metadata service acts as a directory. It knows which servers hold which pieces of the file, but it does not store the actual file data itself.

The key distinction from local file systems is that data and metadata may live on different machines, accessed over a network. This introduces challenges around consistency, latency, and failure handling that local file systems never have to consider.

A local file system can assume the disk is either working or completely failed. A distributed file system must handle partial failures, network partitions, and the reality that different parts of the system may have different views of the current state.

These challenges are what make distributed file systems interesting, and why understanding them is valuable for any system design discussion.

Why Distributed File Systems Exist

Before diving into architecture, it helps to understand the problems that drive the need for distributed file systems. Each problem shaped the design decisions we see in modern implementations.

The Capacity Problem

Single-machine storage has physical limits. A modern server might have 100 TB of disk space, which sounds like a lot until you realize the scale that major technology companies operate at.

The math is straightforward. If you need to store 100 petabytes and a single server can hold 100 terabytes, you need at least 1,000 servers.

But simply having 1,000 independent servers does not solve the problem. You need a way to coordinate them, to know which server holds which data, and to present a unified interface to applications that should not need to care about which physical machine their data lives on.

Solution: Spread data across many machines, with a coordination layer that makes them appear as a single system.

The Reliability Problem

Hard drives fail. This is not pessimism, it is physics. A drive with 1% annual failure rate sounds reliable until you have 10,000 drives, where you can expect 100 failures per year, roughly 2 per week.

The problem compounds when you consider that your data might live on just one of those drives. With a single copy of your data, every drive failure means permanent data loss. Even if you have backups, restoring from backup takes time, during which your data is unavailable.

Solution: Replicate data across multiple machines. If any machine fails, copies exist elsewhere. The system can continue serving data while replacing the failed replica.

The Throughput Problem

A single machine has limited I/O bandwidth. Even with fast NVMe SSDs, one server might sustain 10 GB/s of read throughput. This sounds fast until thousands of clients need data simultaneously.

Consider a concrete example. A data science team runs a training job that needs to read 10 TB of data. If they read from a single server at 10 GB/s, it takes 1,000 seconds (about 17 minutes) just to read the data once. If the same data is spread across 100 servers and they can read in parallel, it takes 10 seconds.

Solution: Serve data from many machines in parallel. Aggregate bandwidth scales with cluster size, not with the capabilities of any single machine.

The Locality Problem

Moving large amounts of data over the network is slow and expensive. Network bandwidth, while improving, still lags far behind local disk bandwidth. If you need to analyze 100 TB of logs, shipping all that data to a compute cluster could take hours and saturate your network.

This was the key insight behind MapReduce and the reason HDFS was built alongside it. Instead of moving data to computation, move computation to data. Store data near the compute resources, or let the file system inform schedulers where data lives so they can schedule tasks on the machines that already have the data.

Solution: Move computation to data rather than data to computation. The file system tracks data location and exposes this information to job schedulers.

These four problems, capacity, reliability, throughput, and locality, are the core motivations behind distributed file systems. Every design decision in GFS, HDFS, and other implementations traces back to solving one or more of these problems.

Core Architecture Patterns

Most distributed file systems share common architectural patterns, though they implement them differently based on their priorities.

Master-Worker Architecture

The dominant pattern separates metadata management from data storage. This separation is not accidental. It reflects the fundamental difference between how we access metadata and data.

Metadata is small but frequently accessed. When you list a directory or check file permissions, you need metadata. When a job scheduler decides where to run tasks, it needs to know where data chunks live. These operations are small (kilobytes) but happen constantly.

Data is large but accessed less frequently relative to its size. Reading a 100 MB chunk is a single operation, even though it transfers far more bytes than thousands of metadata queries.

The following diagram shows how this separation works in practice.

The client first asks the metadata server "where is this file?" The metadata server responds with a list of storage servers that hold the file's chunks. Then the client contacts those storage servers directly to read or write data. Notice that the actual data never flows through the metadata server.

Metadata server (master) responsibilities:

  • Maintains the namespace (the directory tree structure)
  • Tracks which chunks belong to which files
  • Tracks which servers hold each chunk
  • Handles file creation, deletion, and permissions
  • Does NOT handle actual data transfer

Storage servers (workers) responsibilities:

  • Store actual file data as chunks on local disks
  • Serve read/write requests directly from clients
  • Send periodic heartbeats to the master to prove they are alive
  • Report their chunk inventories to the master

Why this separation matters

The separation serves three purposes.

First, scalability. Metadata is small (kilobytes per file), data is large (gigabytes per file). They have different scaling needs and can be optimized independently.

Second, performance. Metadata can fit entirely in the master's memory, enabling O(1) lookups for file locations. Data must be disk-resident and would overwhelm any single machine's memory.

Third, simplicity. By having data flow directly between clients and storage servers, the master never becomes a bandwidth bottleneck. It handles millions of metadata operations while data moves at the aggregate bandwidth of all storage servers.

This pattern is so effective that it appears not just in file systems but in many distributed storage systems. The key insight is recognizing what needs to be centralized (coordination, metadata) and what can be distributed (actual data storage and transfer).

Chunking

Distributed file systems split files into fixed-size chunks (also called blocks or stripes). This chunking is fundamental to how distribution works.

A 500 MB file becomes four chunks. The first three are full 128 MB chunks. The last one contains the remaining 116 MB. Each chunk can be stored on a different server, replicated independently, and read in parallel.

Why such large chunks (64 MB - 256 MB)?

If you are familiar with local file systems, these chunk sizes seem enormous. Your laptop's file system uses 4 KB blocks. Why the difference?

Chunk SizeProsCons
Large (128 MB+)Fewer metadata entries, amortized seek time, better for sequential I/OWasted space for small files, longer recovery time per chunk
Small (4 KB)Fine-grained storage, efficient for small filesMetadata explosion, high overhead for large files

The choice comes down to workload. Local file systems handle many small files (config files, logs, documents) where 4 KB blocks make sense. Distributed file systems like GFS and HDFS were designed for batch processing of large files (web crawls, log aggregations, training data) where sequential throughput matters more than random access.

Large chunks also reduce the metadata burden. If you have 10 PB of storage with 128 MB chunks, you have about 80 million chunks to track. With 4 KB chunks, you would have 2.5 trillion chunks, an impossible number for any metadata server to manage.

This is a recurring theme in distributed systems. Design decisions that seem unusual often make sense when you consider the specific workload and scale the system targets.

Replication

To survive failures, chunks are replicated across multiple servers. This is perhaps the most important concept for fault tolerance.

The diagram shows a chunk stored on three servers across two racks. This placement is intentional. If all replicas were in the same rack and that rack lost power or network connectivity, all copies would be unavailable simultaneously. By spreading replicas across racks, the system survives rack-level failures.

Replica placement strategies:

Placement is not random. The system follows policies to maximize fault tolerance while balancing load.

  1. Rack-aware placement: Place at least one replica in a different rack. Racks share power supplies and top-of-rack switches, so a single failure can take down an entire rack.
  2. Zone-aware placement: For multi-datacenter deployments, place replicas in different availability zones. This provides datacenter-level fault tolerance at the cost of cross-zone latency.
  3. Balanced placement: Distribute replicas to avoid hotspots. A server with too many replicas becomes a bottleneck for reads and must do more work during re-replication.

Replication factor trade-offs:

How many copies should you keep? This is a business decision as much as a technical one.

Replication FactorStorage OverheadSurvives N FailuresRead Bandwidth
2100% extra1 failure2x single server
3200% extra2 failures3x single server
5400% extra4 failures5x single server

Most systems default to 3 replicas. This survives two simultaneous failures (uncommon but not rare at scale) while tripling storage costs. For critical data or data that is expensive to regenerate, some systems use higher replication factors. For data that can be recomputed (like intermediate MapReduce results), lower factors or no replication may be acceptable.

The 200% storage overhead of 3x replication motivated the development of erasure coding, which we will discuss later. Erasure coding can achieve similar fault tolerance with only 50% overhead, at the cost of more CPU during reads and writes.

Metadata Management

The metadata server is the brain of a distributed file system. Its design critically affects scalability, availability, and performance. Get it wrong, and nothing else matters.

What Metadata Includes

Metadata is everything the system needs to know about files except the actual content. This includes two types of information: namespace metadata (the directory tree) and chunk metadata (where file pieces physically reside).

Notice that a file entry points to chunks, and chunk entries point to physical servers. This two-level indirection is important. When a server fails, only chunk entries need updating. File entries remain unchanged. When files are renamed or moved, only namespace entries change. Chunk entries and physical data remain unchanged.

In-Memory Metadata

Many DFS implementations keep all metadata in the master's memory. This is a deliberate choice that trades memory capacity for lookup speed.

GFS/HDFS memory requirements

  • ~64 bytes per chunk mapping
  • ~150 bytes per file or directory entry
  • 1 billion files = ~150 GB metadata

At first glance, keeping everything in memory seems limiting. 150 GB of RAM is significant. But consider the alternative. Disk-based lookups would add milliseconds of latency to every metadata operation.

Implications of in-memory design

The in-memory approach has several consequences worth understanding.

First, lookup is O(1), extremely fast. A hash table lookup for chunk locations takes microseconds, enabling the metadata server to handle hundreds of thousands of operations per second.

Second, memory limits the number of files the system can track. This creates the "small files problem" that plagues HDFS deployments.

Third, master restart requires loading all metadata from disk. For a large cluster, this can take minutes, during which the entire file system is unavailable.

The small files problem

This deserves special attention because it surprises many people. Consider two scenarios:

Scenario A: 1,000 files, each 1 GB in size, totaling 1 TB of data. Scenario B: 1,000,000 files, each 1 KB in size, totaling 1 GB of data.

Scenario B uses 1,000x more metadata despite storing 1,000x less data. Each of those million files needs a namespace entry, and while chunks might be combined for small files, the per-file overhead remains.

This is why DFS implementations actively discourage small files. If your workload involves millions of small files, a DFS may not be the right choice. Consider aggregating files (HAR archives in Hadoop), using a different storage system (object storage, databases), or restructuring your data.

Metadata Persistence

Keeping everything in memory is great for performance, but what happens when the master crashes? Memory contents are lost. Without persistence, a master failure means losing track of all files.

The solution combines two techniques: write-ahead logging and periodic snapshots.

Write-ahead log (edit log)

Before any metadata change is applied to memory, it is written to a log on disk. This sequential log write is fast (appending to a file is much faster than random writes). If the master crashes after writing to the log but before updating memory, the change can be recovered by replaying the log.

The log contains entries like "create file /user/data/file.txt with ID 12345" or "add chunk 67890 to file 12345." These entries are small and can be written quickly.

Snapshots (checkpoints)

The problem with relying only on the edit log is that the log grows forever. On restart, the master would have to replay potentially millions of operations, which could take hours.

Snapshots solve this by periodically dumping the entire in-memory state to disk. When the master restarts, it loads the most recent snapshot, then replays only the edit log entries that occurred after the snapshot. If snapshots happen hourly, restart requires replaying at most one hour of logs instead of the entire history.

This combination provides durability (every change is logged) while keeping operations fast (log writes are sequential) and restart times reasonable (snapshots bound replay time).

Scaling Beyond a Single Master

A single master can become a bottleneck or single point of failure as the cluster grows. Different systems address this in different ways.

Scaling reads

Read load on the master comes from clients looking up file locations. Two techniques help.

First, cache metadata on clients. After looking up a file's chunk locations once, the client caches this information. Subsequent reads of the same file do not hit the master. Cache invalidation happens via lease expiration or notifications.

Second, add read-only replicas. Observer nodes receive a stream of metadata updates from the primary master and can serve read queries. This distributes read load while maintaining a single source of truth for writes.

Scaling writes

Write scaling is harder because you need to maintain consistency. The main technique is federation: multiple masters, each owning part of the namespace.

HDFS Federation splits the namespace into independent namespaces, each managed by a separate NameNode. One NameNode might own /user, another owns /data. Applications must know which NameNode handles which path, but within each namespace, you get full consistency.

High availability

A single master is also a single point of failure. High availability requires eliminating this.

In this setup, the active master writes its edit log to a shared, fault-tolerant log (the journal nodes). The standby master continuously reads this log and applies changes to its own in-memory state. It stays synchronized with the active master within seconds.

ZooKeeper handles leader election. If the active master fails (stops sending heartbeats), ZooKeeper triggers an election. The standby, already having a nearly-current copy of metadata, can take over quickly. Typical failover times are 30-60 seconds.

This architecture eliminates the single point of failure while preserving the simplicity of single-master coordination.

Read and Write Paths

Understanding how reads and writes flow through a distributed file system is essential for system design discussions. The paths reveal the trade-offs between latency, throughput, consistency, and fault tolerance.

Write Path

Writing data to a distributed file system is more complex than reading because multiple servers must coordinate to ensure all replicas receive the data. Let us walk through the sequence step by step.

Step 1-2: Location discovery

The client asks the master where to write. For a new file, the master allocates chunk IDs and selects servers to hold replicas. For appending to an existing file, it returns the current chunk servers. The master designates one server as the "primary" which will coordinate the write.

Step 3: Data push

The client pushes the actual data to all replica servers. In the GFS model shown here, the client sends data to all servers in parallel. HDFS takes a different approach using a pipeline: client sends to server 1, which forwards to server 2, which forwards to server 3. We will discuss the trade-offs shortly.

Step 4: Write command

After data is pushed, the client sends a write command to the primary. This command says "commit the data I just sent." The separation between data push and write command allows the system to push data optimally (parallel or pipelined) while maintaining ordering through the primary.

Step 5-6: Ordering and replication

The primary assigns a sequence number to the write and applies it locally. Then it sends the write command (with sequence number) to secondaries. The sequence number ensures all replicas apply writes in the same order, which is critical for consistency.

Step 7-8: Acknowledgment

Secondaries acknowledge to the primary after applying the write. Once all (or a configured quorum of) replicas acknowledge, the primary reports success to the client.

Pipeline vs parallel data push

The choice between pipeline and parallel data push involves trade-offs that come up frequently in distributed systems.

ApproachClient Network LoadLatencyWhen to Use
ParallelClient sends N copiesWait for slowest serverClient has high bandwidth
PipelineClient sends 1 copySum of N-1 forwarding hopsClient bandwidth is limited

GFS used parallel writes because Google's datacenter had high-bandwidth networks between all machines. HDFS adopted pipeline writes because Hadoop clusters often had limited client bandwidth, particularly when clients were mappers or reducers running on commodity hardware.

The pipeline approach is interesting: the client sends data once, and each server forwards to the next. This reduces client bandwidth by a factor of N (where N is the replication factor) at the cost of higher latency. For large sequential writes where throughput matters more than latency, this trade-off often makes sense.

Read Path

Reads are simpler because they do not require coordination between replicas. Any replica with the data can serve a read request.

Step 1-2: Location lookup

The client asks the master for the file's chunk locations. The master returns a list of chunks and, for each chunk, a list of servers holding that chunk. Importantly, servers are sorted by proximity to the client. The client will prefer reading from nearby servers.

Step 3-4: Data transfer

The client contacts a storage server and requests the chunk. The server reads from local disk and returns the data along with a checksum.

Step 5-6: Verification and assembly

The client verifies the checksum to detect corruption. If the checksum fails, the client tries another replica. Once all chunks are verified, the client assembles them into the complete file.

Read optimizations

Several techniques improve read performance in practice.

Locality preference: The master returns replicas sorted by network distance. Reading from the same rack saves crossing the network backbone. Reading from the same server (if the client is a compute task co-located with data) avoids the network entirely.

Client-side caching: Clients cache chunk locations. After the first read of a file, subsequent reads skip the master lookup. Caches are invalidated when files are modified or after a timeout.

Short-circuit reads: This optimization bypasses the network entirely when the client and data are on the same machine. Instead of reading over a network socket, the client reads directly from the local disk. HDFS supports this for MapReduce tasks scheduled on the same nodes as their input data.

Speculative reads: For latency-sensitive applications, the client can request the same chunk from multiple replicas simultaneously and use whichever responds first. This trades extra network traffic for tail latency reduction.

Consistency Models

Distributed file systems make different trade-offs around consistency. The choice affects both performance and programming complexity.

Why Consistency is Hard

In a local file system, consistency is straightforward. When you write data and then read it, you see what you wrote. There is only one copy of the data, so there is nothing to disagree about.

In a distributed file system, the same data exists on multiple servers. When you write, all replicas must be updated. If a read happens while the update is in progress, different replicas might return different values. If network failures prevent some replicas from receiving updates, they fall out of sync.

The fundamental tension is between consistency (all clients see the same data), availability (the system responds to requests), and partition tolerance (the system continues operating when network failures split it). This is the CAP theorem, and distributed file systems must choose their position in this trade-off space.

Strong Consistency

Every read sees the most recent write. This is what local file systems provide and what applications expect.

How to achieve it:

  • A single master serializes all writes, assigning them a global order
  • Synchronous replication ensures all replicas have the data before acknowledging
  • Reads come from the master, from all replicas (quorum), or from a replica that is guaranteed to be up-to-date

Cost: Higher latency because writes must wait for all replicas. Lower availability during failures because the system must choose between serving stale data and refusing requests.

Strong consistency is easier to program against because applications can assume normal file semantics. However, achieving it at scale requires careful engineering.

Eventual Consistency

Reads may see stale data temporarily, but the system eventually converges to a consistent state.

When updates happen:

  1. Client writes to primary
  2. Primary acknowledges to client
  3. Primary asynchronously replicates to secondaries
  4. Eventually, all replicas have the update

The gap: Between steps 2 and 4, different replicas have different data. A read from the primary returns the new value. A read from a secondary might return the old value.

Use case: High availability, partition tolerance, and workloads where slight staleness is acceptable. A file deleted on one server may still appear on another for seconds or minutes, but this might be fine for batch analytics where you are processing yesterday's data anyway.

GFS Consistency Model

The Google File System defines a nuanced model that accepts weaker consistency in exchange for higher throughput.

The "at least once" semantics for appends is particularly interesting. GFS was designed for log collection where thousands of machines append records to shared files. Guaranteeing exactly-once would require expensive coordination.

Instead, GFS guarantees that your record is written (at least once), tells you where it landed, and lets applications handle duplicates.

This trade-off made sense for Google's workloads. Log processing pipelines already had to handle duplicates from other sources (retried RPCs, restarted jobs). Adding one more source of duplicates did not change the programming model.

HDFS Consistency

HDFS takes a more conservative approach, providing simpler semantics by restricting what operations are allowed.

By disallowing concurrent writes and random updates, HDFS sidesteps the hardest consistency problems. There is no need to reason about what happens when two writers update the same region because that is not allowed. This makes HDFS easier to understand and use correctly, at the cost of flexibility.

The restriction to append-only writes aligns well with batch processing workloads. MapReduce jobs write output files sequentially. Log collectors append records. The inability to update existing data is rarely a limitation in practice.

Fault Tolerance

Handling failures gracefully is a defining characteristic of distributed file systems. At scale, failures are not exceptional events. They are routine. A well-designed DFS assumes components will fail and builds resilience into every layer.

Failure Detection

How do you know when a server has failed? The answer is harder than it seems. A server might be down, or the network might be partitioned, or the server might be overloaded and responding slowly. From the outside, these cases look similar.

Distributed file systems use heartbeats for failure detection. Storage servers send periodic messages to the master saying "I'm alive."

Why wait 10 minutes?

This delay seems long. A server that is actually down will be unavailable for 10 minutes before the system begins recovering its data. Why not act faster?

The answer involves the cost of being wrong. Re-replication is expensive. Copying terabytes of data consumes network bandwidth, disk I/O on source and destination servers, and master CPU for coordination.

If you trigger re-replication for a server that was just rebooting or experiencing a brief network glitch, you waste resources and then have to clean up the extra replicas.

The 10-minute delay balances detection latency against false positive cost. For most failures (disk failures, kernel panics, hardware issues), 10 minutes of unavailability is acceptable. For transient issues (network blips, garbage collection pauses), the delay avoids unnecessary work.

Re-replication

When a server fails, its chunks become under-replicated. The master must restore the target replication factor.

Prioritization matters

Not all under-replicated chunks are equally urgent. A chunk with only 1 remaining replica is one failure away from data loss. A chunk with 2 remaining replicas (target is 3) is under-replicated but not critical.

The master prioritizes re-replication:

  1. Critical: Chunks with only 1 replica. These are scheduled immediately.
  2. Urgent: Chunks with 2 replicas (for a 3x target). These are scheduled soon.
  3. Normal: Chunks below target but not at risk. These are scheduled as bandwidth allows.

Rate limiting

Even with prioritization, re-replication must be throttled. Copying all data from a failed server as fast as possible would saturate the network, impacting normal operations. Systems limit concurrent re-replication operations per server, total cluster re-replication bandwidth, and the number of chunks being copied simultaneously.

This creates an interesting trade-off. Faster re-replication restores fault tolerance sooner but impacts production traffic. Slower re-replication is gentler on the cluster but extends the window of vulnerability.

Placement during re-replication

When selecting a target server for a new replica, the master considers:

  • Rack diversity (do not put all replicas in one rack)
  • Server utilization (prefer less-loaded servers)
  • Recent write activity (avoid servers that are already busy with writes)
  • Disk space (ensure the server has room)

Data Integrity

Disks can fail silently, returning corrupted data without reporting errors. This "bit rot" is insidious because it can spread through the system undetected. A corrupted replica might be used as the source for re-replication, propagating the corruption.

Checksums

Each chunk has a cryptographic checksum (typically CRC32 or stronger) stored separately from the data. When reading a chunk, the server computes the checksum and compares it to the stored value.

If the checksum fails, the server reports the corruption to the master, and the client reads from another replica. The master schedules re-replication from a good replica to replace the corrupt one.

Background scrubbing

Waiting for reads to discover corruption is reactive. What about chunks that are rarely read? A corrupt replica might sit undetected until the other replicas also fail, at which point the data is lost.

Background scrubbing proactively verifies data integrity. Each storage server periodically reads its chunks and verifies checksums, even if no client requested that data. Corruptions are discovered and repaired before they matter.

Master Recovery

Despite high availability configurations, there are scenarios where the master must recover from a complete restart. Understanding this process helps you reason about system behavior during maintenance and failures.

Recovery sequence

  1. Load the most recent snapshot from disk
  2. Replay all edit log entries since the snapshot
  3. Wait for storage servers to report their chunk inventories
  4. Rebuild the chunk-to-server mapping from these reports
  5. Enter "safe mode" (read-only) until enough replicas are confirmed
  6. Resume normal operation

Step 3 is particularly interesting. The master does not persistently store chunk locations. It reconstructs this mapping at startup from reports by storage servers. This is intentional. Chunk locations change frequently (servers fail, new servers join, chunks are rebalanced), and the storage servers are the source of truth for what they actually have.

Minimizing downtime

For a hot standby configuration, recovery is much faster:

  1. ZooKeeper detects master failure
  2. Standby is promoted to active
  3. Standby was already applying logs, so its state is nearly current
  4. Brief safe mode while confirming chunk availability
  5. Resume normal operation

Typical failover times are 30-60 seconds, most of which is confirming the standby has applied recent logs and validating chunk availability.

Major Implementations

Let us examine how major distributed file systems implement these concepts. Each makes different trade-offs based on its target workload and operational environment.

Google File System (GFS)

GFS (2003) pioneered many patterns that later systems adopted. It was designed for Google's specific workload: large files, sequential access patterns, and append-heavy operations (web crawls, log collection, index building).

Key characteristics:

  • 64 MB chunks: Large for the era, chosen to amortize seek time and reduce metadata overhead for Google's large files.
  • 3x replication by default: Simple to understand and implement, with acceptable storage overhead for Google's scale.
  • Single master: Simplified coordination at the cost of scalability limits. Google later evolved to Colossus with distributed metadata.
  • Atomic record append: A special operation for appending records to shared files. Critical for log collection where many producers write to the same file.
  • Relaxed consistency: Accepted for the performance benefits. Applications were already designed to handle duplicates and out-of-order data.

Design philosophy: Optimize for the common case. Google's workloads involved large sequential reads and appends. Random access and small files were rare. By optimizing for the common case and accepting limitations for uncommon cases, GFS achieved excellent performance for its target workload.

GFS demonstrated that you could build a reliable file system on unreliable hardware by accepting failures as normal and building recovery into the system design.

Hadoop Distributed File System (HDFS)

HDFS is an open-source implementation inspired by GFS. It was built as part of the Hadoop project to support MapReduce processing.

AspectGFSHDFS
Chunk size64 MB128 MB (configurable)
Default replication33 (configurable)
Metadata serverGFS MasterNameNode
Storage serversChunk ServersDataNodes
ConsistencyRelaxed (concurrent writes)Stronger (single-writer)

HDFS innovations beyond GFS:

NameNode High Availability: HDFS added active/standby NameNodes with shared edit logs, eliminating the single point of failure that limited GFS.

Federation: Multiple independent namespaces, each managed by its own NameNode, allowing horizontal scaling of metadata capacity.

Erasure Coding: HDFS 3.x added erasure coding as an alternative to replication. Instead of storing 3 copies (200% overhead), erasure coding can achieve similar fault tolerance with 50% overhead. The trade-off is higher CPU during reads and writes.

Data locality API: HDFS exposes block locations to applications, enabling schedulers like YARN to place compute tasks on nodes that already have the data. This data locality is key to MapReduce and Spark performance.

HDFS remains the most widely deployed distributed file system for data analytics workloads. Its tight integration with the Hadoop ecosystem (MapReduce, Spark, Hive, HBase) makes it the default choice for many organizations.

Ceph

Ceph takes a fundamentally different approach by eliminating the central metadata server bottleneck. This makes it more complex but enables massive scaling.

The CRUSH algorithm

Ceph's key innovation is CRUSH (Controlled Replication Under Scalable Hashing). Instead of looking up data locations from a metadata server, clients compute locations algorithmically.

Given an object ID and the current cluster map (which describes the cluster topology and which servers are available), CRUSH deterministically computes which servers should store that object. All clients running the same algorithm with the same inputs arrive at the same answer.

This eliminates the metadata lookup bottleneck. To read data, a client does not ask anyone where the data is. It computes the location and contacts the storage servers directly. The only centralized component is the cluster map, which changes infrequently (only when servers are added or removed) and is distributed to all clients.

Ceph provides multiple interfaces

  • RADOS: The core object storage layer. Everything else builds on this.
  • RBD (RADOS Block Device): Block storage for virtual machines. Used by OpenStack, Kubernetes.
  • CephFS: A POSIX-compatible file system. Uses metadata servers for the file namespace but stores data in RADOS.
  • RGW (RADOS Gateway): Object storage with S3 and Swift APIs.

This versatility makes Ceph popular for organizations that need multiple storage types. Instead of running separate systems for block storage, file storage, and object storage, Ceph provides all three on a unified cluster.

Comparison Summary

FeatureGFS/HDFSCephCloud Object Storage (S3)
Metadata architectureCentralized masterDistributed (CRUSH)Managed service
Scalability limitMaster memoryNear-linearUnlimited (AWS manages it)
ConsistencyConfigurableStrongStrong (since 2020)
Operational complexityModerateHighLow (managed)
Best forBatch analyticsGeneral purpose, on-premiseCloud-native applications

The choice between these systems depends on your specific requirements. HDFS excels for Hadoop/Spark workloads with its deep integration. Ceph provides flexibility for organizations needing multiple storage types. Cloud object storage offers simplicity for cloud-native applications.

When to Use Distributed File Systems

Understanding when to choose a distributed file system versus alternatives is as important as understanding how they work. The wrong choice can lead to complexity without benefits or limitations that block your use case.

Choose DFS When

You have large datasets that do not fit on single machines.

If your data exceeds what a single server can hold (roughly 10-100 TB for modern servers), you need distributed storage. The question becomes which kind. DFS is appropriate when your access patterns match its strengths.

You need fault tolerance without manual intervention.

DFS automatically replicates data and recovers from failures. A disk fails, and the system notices, replicates affected data to other servers, and continues operating. Compare this to a single server where a disk failure means downtime and restore from backup.

You want data locality for computation.

This is the killer feature for analytics workloads. When you can process data where it lives, you avoid shipping terabytes over the network. MapReduce, Spark, and similar frameworks exploit this. If your workload involves large scans of data (analytics, ML training), DFS data locality can be a significant performance win.

You have batch processing workloads.

DFS is optimized for large sequential operations. Writing a 1 GB file sequentially is efficient. Scanning a 10 TB dataset in parallel is efficient. The large chunk size and append-oriented design assume this access pattern.

You need high aggregate throughput.

If many clients need to read large amounts of data simultaneously, DFS distributes load across many servers. A hundred-server cluster can sustain far more throughput than any single machine.

Choose Alternatives When

You need low-latency random access.

DFS is optimized for throughput, not latency. Reading a single record involves network round trips and potentially disk seeks. If your application needs millisecond-level random access to individual records, use a database or key-value store. HBase, Cassandra, or Redis will serve individual lookups much faster than HDFS or GFS.

You have many small files.

The "small files problem" is real. Each file requires metadata, and DFS metadata servers have limits. A million small files can consume more metadata capacity than a thousand large files with the same total data. If your workload involves millions of small files, consider object storage (which handles small objects better), aggregating files into larger archives, or a database.

You are running in the cloud.

Cloud object storage (S3, GCS, Azure Blob) provides many of the same benefits as DFS with less operational burden. You do not need to run NameNodes or monitor DataNodes. The cloud provider handles replication, availability, and scaling. For cloud-native applications, object storage is often the simpler choice.

You need POSIX semantics.

Most DFS implementations provide limited POSIX support. They may lack features like random writes, hard links, extended attributes, or file locking that applications expect from local file systems. If your applications require full POSIX compatibility, consider network file systems like NFS or cloud file storage like Amazon EFS.

You need ACID transactions.

File systems, distributed or not, do not provide transactions. If you need to update multiple files atomically, or read a consistent snapshot across files, you need a database. DFS provides file-level atomicity at best.

Decision Framework

This flowchart summarizes the decision process.

At smaller scales (under 10 TB), single-machine storage or managed cloud storage is simpler and sufficient. At larger scales, the access pattern matters. Sequential batch processing suits DFS. Random access needs databases. Mixed workloads in the cloud often work best with object storage that separates storage from compute.

Summary

Distributed file systems enable storing and processing data at scales impossible for single machines. The core patterns that make this possible appear throughout distributed storage systems.

Master-worker architecture separates metadata (small, frequently accessed) from data (large, less frequently accessed relative to size). This separation enables independent scaling. The master keeps metadata in memory for fast lookups while data remains distributed across storage nodes. Data flows directly between clients and storage nodes, preventing the master from becoming a bandwidth bottleneck.

Chunking breaks files into fixed-size pieces distributed across servers. Large chunks (64-256 MB) optimize for sequential throughput by amortizing seek time and reducing metadata overhead. The trade-off is inefficiency for small files and longer recovery times.

Replication provides fault tolerance and read bandwidth. The standard 3x replication survives two simultaneous failures, which is rare but happens at scale. Erasure coding offers a storage-efficient alternative at the cost of computation and complexity.

Consistency varies by implementation. GFS accepted relaxed consistency to achieve higher throughput for append-heavy workloads. HDFS restricts operations (single writer, append-only) to maintain stronger guarantees with simpler semantics.

Fault tolerance is built into every layer. Heartbeats detect server failures. Re-replication restores redundancy automatically. Checksums catch silent corruption. Background scrubbing finds problems before they matter.

When choosing storage for your system design, match the storage to your workload:

  • DFS (HDFS, Ceph): On-premise deployments, batch processing, data locality needed for compute frameworks
  • Object storage (S3, GCS): Cloud-native applications, variable scale, managed operations preferred
  • Databases: Low-latency random access, ACID transactions, structured queries
  • Network file systems (NFS, EFS): POSIX compatibility required, legacy application support

Understanding distributed file systems helps you reason about any distributed storage system. The patterns of chunking, replication, metadata separation, and failure handling appear in object stores, distributed databases, and even distributed caches. When you encounter a new storage system, ask: how does it handle metadata? How does it replicate? How does it recover from failures? The answers will map to patterns you already understand.

References