Building A Distributed File System Inspired By Gfs: Master Architecture, Chunk Replication, And Leases

2019-05-14 · Leonardo Benicio

Summary

A comprehensive technical exploration of building a distributed file system inspired by gfs: master architecture, chunk replication, and leases, covering key concepts, practical implementations, and real-world applications.

Contents

Building a Distributed File System Inspired by GFS: A Deep Dive into Master Architecture, Chunk Replication, and Leases

Introduction: Why Build a Distributed File System Inspired by GFS?

Imagine you are responsible for storing the entire web. Every night, your system ingests hundreds of terabytes of new data – crawled pages, user uploads, log files – and processes it with thousands of machines running in parallel. The data must be accessible with high throughput, must survive disk failures that happen daily at scale, and must support workloads that are overwhelmingly append-heavy yet rarely modify existing files. This was the challenge Google faced in the early 2000s, and the solution they built – the Google File System (GFS) – became one of the most influential systems in the history of distributed computing.

GFS was not the first distributed file system, nor was it the most elegant. Its design made pragmatic, sometimes controversial, trade-offs: a single master with global knowledge, large chunk sizes (64 MB) that reduced metadata overhead, a shadow master for read-only availability, and a lease mechanism to coordinate concurrent writes. It assumed that hardware failures were the norm, not the exception. It treated 99.99% of reads as append operations. And it powered the infrastructure that made Google’s search, MapReduce, Bigtable, and countless internal services possible. Over time, the lessons from GFS trickled down into open-source projects like Hadoop Distributed File System (HDFS), Ceph, and Facebook’s Tectonic, and the ideas behind it now underpin nearly every large-scale storage system in existence.

Why, then, should you care about building your own distributed file system inspired by GFS? Because understanding one foundational system thoroughly unlocks a whole domain of distributed systems knowledge. By walking through the three core mechanisms – master architecture, chunk replication, and leases – you will confront the fundamental challenges that every distributed storage engineer faces: how to maintain consistency without sacrificing performance, how to replicate data efficiently across faulty hardware, and how to design a system that scales to thousands of machines while remaining operationally simple. Building your own DFS (even a toy version) forces you to think about failure modes, concurrency control, metadata management, and performance tuning in ways that reading about them never does.

In this blog post, we will go beyond the original GFS paper and dissect each of these mechanisms in detail. We’ll add concrete examples, pseudocode, and design decisions that you can use as a blueprint for your own implementation. By the end, you will have a thorough understanding of how GFS works under the hood, why its designers made the choices they did, and how those lessons apply to modern distributed storage systems. Let’s begin with the most controversial component: the single master.

1. The Single Master Architecture: Centralized Control Simplicity

1.1 Why a Single Master?

In a distributed file system, metadata management is perhaps the most critical design decision. Metadata includes namespace information (directory trees, file names, permissions), file-to-chunk mappings (which chunks belong to a file, where they are located), and cluster state (chunk server health, load, available capacity). GFS opted for a single master node that holds all metadata in memory and coordinates all operations. This is a radical choice: it introduces a single point of failure (SPOF) and a potential bottleneck. Why did Google do this?

The reasoning was twofold. First, metadata size is relatively small compared to data. With 64 MB chunks, each million files (each about 100 MB average) require roughly 100 GB of metadata – easily fit in the memory of a modest machine even in 2003. Second, consistency is vastly simplified when there is a single authoritative source for metadata. There is no need for distributed consensus protocols (like Paxos or Raft) to resolve conflicting views of the namespace. The master simply serialises all metadata changes via a single-threaded event loop. This allows the system to guarantee strong consistency for operations like file creation, deletion, and rename.

1.2 The Master’s Responsibilities

The master does not store file data; it only manages metadata and monitors chunk servers. Its key responsibilities are:

Namespace management: Operations on directory structure, file creation, deletion, rename, and symbolic links.
Chunk-to-server mapping: For each chunk, the master maintains a list of chunk servers that hold a replica. This mapping is updated via regular heartbeat messages from chunk servers.
Chunk creation and rebalancing: When a client needs to write a new chunk, the master assigns replicas to chunk servers based on load, disk space, and rack diversity. It also initiates replication when a chunk falls below its replication factor (e.g., due to a server failure).
Lease management: For each chunk, the master designates one replica as the primary (holder of the write lease) and the rest as secondaries. We will discuss leases in Section 3.
Garbage collection: Deleting a file does not immediately remove chunks; the master marks them for deletion and periodically cleans up orphaned chunks.
Snapshot: GFS supports efficient file or directory snapshots via copy-on-write.

All these operations are executed by the master’s single-threaded event loop, which ensures sequential consistency for metadata changes.

1.3 Metadata Data Structures

To appreciate the master’s memory footprint, let’s look at the main in-memory data structures. For each file, the master stores:

File name and path.
List of chunk handles (64-bit unique IDs).
Access control lists (optional).

For each chunk handle, it stores:

Version number (incremented on each write to detect stale replicas).
List of chunk server locations (IP addresses or hostnames).
Primary lease state (which server holds the lease and its expiration time).

Additionally, the master maintains a namespace lock table to allow concurrent operations on different parts of the directory tree. This is not a traditional lock but rather a keyed mutex used to ensure that two operations on the same path do not interleave.

Below is a simplified pseudocode representation of the metadata structures:

class Master:
    # namespace: dictionary mapping path -> FileInfo
    self.namespace = {}

    # chunk2servers: dictionary mapping chunk handle -> list of ChunkServerID
    self.chunk2servers = {}

    # chunk2version: dictionary mapping chunk handle -> int
    self.chunk2version = {}

    # lease_info: dictionary mapping chunk handle -> LeaseInfo
    # LeaseInfo contains primary_id, expiration_time
    self.lease_info = {}

    # chunk servers: list of ChunkServer objects with heartbeat status
    self.chunk_servers = {}

class FileInfo:
    path: str
    chunk_handles: list[int]
    creation_time: datetime
    permissions: str

1.4 Heartbeat and Chunk Server State

Every chunk server sends a Heartbeat message to the master every few seconds (default 2 seconds in GFS). The heartbeat includes:

List of all chunks stored on that server (with version numbers).
Current disk usage, network load, and number of free slots.

The master uses this information to build its mapping. If a chunk server fails to send heartbeats for a certain timeout (e.g., 5 seconds), the master marks it as dead and initiates re-replication of all chunks hosted on that server.

To avoid stale state, the master always assumes that a chunk server’s reported chunks are correct. However, if a chunk server crashes and recovers, it may have lost some chunks that were previously on its disk. In that case, the missing chunks will not appear in the heartbeat, so the master will detect a reduction in replicas and begin re-replication.

1.5 The Scalability Bottleneck

The single master is often cited as GFS’s biggest limitation. As the cluster grows, the master must handle more heartbeats, more metadata operations, and more chunk lease requests. Google later acknowledged that the master became a bottleneck for clusters beyond a few thousand nodes. The master’s CPU and memory can be scaled up, but not indefinitely. Moreover, the master is a single point of failure: if it goes down, the entire file system becomes unavailable until a new master is restored from the operation log.

GFS addressed this with a shadow master – a secondary master that maintains a replica of the metadata by replaying the same operation log. The shadow master can serve read-only operations (e.g., metadata lookups, chunk locations) even if the primary master fails. However, because the shadow master may be slightly behind the primary (due to log replication lag), it might return stale information. Write operations must always go through the primary.

Despite its drawbacks, the single-master design is simple enough that many systems adopted it – HDFS uses a single NameNode (now with a High Availability option), and Facebook’s Tectonic uses a similar centralized metadata service. The lesson is that for workloads where metadata is small relative to data, a single master provides excellent consistency with modest implementation complexity.

1.6 Operation Log and Recovery

Every metadata change (file create, chunk creation, lease assignment, etc.) is first appended to an operation log on the local disk of the master (and optionally replicated to remote disks). This log is the authoritative source of truth for the master’s state. The master can reconstruct its in-memory state by replaying the log from the last checkpoint.

To speed up recovery, the master periodically creates a checkpoint – a compact snapshot of the entire metadata state. When the master restarts after a crash, it loads the latest checkpoint and then replays only the log entries after that checkpoint. Since the log is appended sequentially, this process is fast. In practice, GFS masters recovered in under a minute.

2. Chunk Replication: Throughput, Fault Tolerance, and Locality

2.1 Why 64 MB Chunks?

One of GFS’s most distinctive design choices is the chunk size: 64 MB. In contrast, typical file systems use block sizes of 4 KB to 64 KB. Why so large?

There are three primary reasons:

Reduce metadata overhead: Each chunk corresponds to a 64-byte entry in the master’s memory. With a 64 MB chunk size, a 1 TB file needs only about 16,000 chunks, requiring around 1 MB of master memory. If chunks were 1 MB, the same file would need a million chunks – 64 MB of metadata. For a cluster storing petabytes, massive metadata would overwhelm the single master.
Locality of operations: Large chunks amortize the cost of contacting the master. When a client reads a file, it asks the master for the locations of the first chunk, then contacts the chunk server directly. If the chunk is 64 MB, the client can stream many bytes before needing to ask for the next chunk. This reduces master load.
Sequential access patterns: GFS was designed for MapReduce-style workloads that scan entire files sequentially. In sequential access, larger chunk sizes improve throughput because the disk can stream contiguous data without seeking. The penalty for random access within a chunk is higher, but that workload is rare in GFS applications.

However, large chunks have downsides. A small file (say 1 KB) still occupies a full 64 MB chunk on disk, wasting space. GFS handled this by storing the last chunk of a file as a partial chunk, but still the chunk server allocates the full 64 MB block. Google accepted this waste because the majority of files were large. For systems like HDFS, the default block size is 128 MB, and small files are a known problem.

2.2 Replication Factor: The 3-Copy Rule

GFS stores each chunk with a replication factor of 3 – three replicas on different chunk servers. Why 3? The standard argument: with three replicas, you can tolerate one failure and still have another replica for reads. But there is more nuance. If two replicas fail (a rare event with independent failures), the system becomes unsafe. In practice, Google found that three replicas provided adequate durability for their workloads while keeping storage overhead manageable.

The replication factor is configurable. Some GFS deployments used a factor of 2 for less critical data, and others used 4 for critical data. HDFS also defaults to 3.

2.3 Chunk Placement: Rack Awareness

To survive a rack failure (e.g., a network switch outage or power loss for an entire rack), GFS spreads replicas across racks. The default strategy is to place one replica in the same rack as the client, one in a different rack, and one in yet another rack. This ensures that even if an entire rack goes offline, there is at least one replica available elsewhere.

Placing replicas across racks also improves read locality for clients in the same rack. When a client requests a chunk, the master returns the location of a replica that is on the same rack (or as close as possible). This minimizes cross-rack traffic, which is often a bottleneck.

2.4 Data Locality for Computation

GFS was designed to work closely with MapReduce. The MapReduce scheduler can use the chunk location information from the master to launch tasks on the same machine that holds a replica of the input data. This principle, called data locality, avoids moving large data over the network. A map task reading a 64 MB chunk runs on the same chunk server or at least on the same rack, resulting in high throughput.

This co-location of compute and storage is a hallmark of the data-d31dd264 computing era. Modern distributed file systems like HDFS, with their own block placement, follow this pattern. Object stores like S3 separate compute from storage, but they pay the cost of network bandwidth.

2.5 Re-replication and Chunk Versioning

When a chunk server fails, the master detects the missing replicas (because the failed server stops sending heartbeats). The master then initiates re-replication: it selects a healthy chunk server that already holds a replica and instructs it to copy the chunk to a new chunk server. The target server is chosen based on load, disk space, and rack diversity.

During re-replication, there is a risk that the chunk being copied is stale – i.e., it was written before the last mutation. GFS uses chunk version numbers to prevent this. Every time a chunk is mutated, the master increments the version number and records it in its metadata. The master then ensures that all replicas have the same version. When a chunk server reports its chunks via heartbeat, it includes the version. If a replica’s version is lower than the master’s expected version, the master considers that replica stale and discards it (or marks it for deletion). This ensures that only current replicas are used for reads and re-replication.

If all replicas have the same version and one server fails, the remaining replicas are up-to-date, so re-replication works. But if the failed server held the only current replica (due to a previous unreplicated write), then the chunk is permanently lost – data loss. To mitigate this, GFS uses a minimal replication approach: it tries to keep the replication factor as close to the target as possible, but it does not immediately replicate on every small failure; it prioritizes chunks with the lowest replication count.

2.6 Chunk Creation: Write Pipeline

When a client wants to write to a file, it first asks the master for the location of the chunk or for a new chunk allocation. The master grants a lease to one replica (the primary) and returns the locations of all replicas (typically 3). The client then sends the data to all replicas in a pipeline manner. For example, the client pushes data to the nearest replica (e.g., the one on the same rack), which forwards it to the next, which forwards to the third. This pipelining reduces network overhead and leverages the full bandwidth of the network.

The pipeline is ordered sequentially, not in parallel. The reason is to avoid the out-of-order arrival problem: if the client sent data to all replicas simultaneously, they might receive the data at different times, causing inconsistencies. With a single chain, the order of bytes is deterministic.

We will discuss the details of the write operation and the role of leases in the next section.

3. Leases: Consistency and Concurrency Control

3.1 The Problem: Concurrent Writes

Distributed file systems face a classic problem: how to allow multiple clients to write to the same file concurrently while maintaining consistency? Traditional file systems use locking, but distributed locks are expensive and add latency. GFS needed a mechanism that allowed high throughput for append-heavy workloads without requiring all clients to coordinate with each other.

GFS’s solution is the lease mechanism. For each chunk, the master grants a lease to one chunk server, called the primary. The primary is responsible for serializing all mutations (writes and record appends) to that chunk. The lease has a timeout (typically 60 seconds). The primary can renew the lease by contacting the master.

3.2 How Leases Work

The write flow for a client is as follows:

Lease request: The client contacts the master to find the primary for the chunk (or to request a new lease if none exists). The master returns the identity of the primary and the locations of all replicas.
Data push: The client pushes the data to all replicas (primary and secondaries) using the pipeline described earlier. The replicas buffer the data in memory.
Write request: After the data is buffered, the client sends a write request to the primary. The request includes the offset and data location.
Serialization at primary: The primary serializes mutations in the order they are received. It assigns a sequence number to each mutation and applies them to its local chunk. It then forwards the mutation (with the sequence number) to all secondary replicas.
Secondary acknowledgment: Each secondary applies the mutation in order and sends an acknowledgment back to the primary.
Primary response: Once the primary receives acknowledgments from all secondaries (or a configurable quorum, in GFS’s case all replicas), it responds to the client with success. If any secondary fails (or times out), the primary reports failure, and the client may retry with a new lease.

The primary holds the lease for a fixed duration. If the primary fails before the lease expires, the master waits for the lease to expire and then grants a new lease to another replica. This clean timeout avoids split-brain scenarios.

3.3 Atomic Record Append

A unique feature of GFS is atomic record append, a special write operation that does not require the client to specify an offset. The client simply sends a record to a file, and GFS automatically appends it at an offset chosen by the system. The record is guaranteed to be written atomically at least once (if a failure occurs, it might be written more than once, but each occurrence will be a valid record). This is perfect for log-style applications like crawler output or transaction logs.

How does atomic record append work with leases? The primary, upon receiving an append request, chooses an offset (the current end of the chunk) and writes the record there. It then appends the record to its own copy, then to the secondaries in order. If the record fits in the current chunk, it succeeds; if not, the primary pads the chunk (or extends it) and notifies the client to retry on the next chunk. This guarantees that the record appears exactly at the chosen offset in all replicas.

Because the primary chooses the offset, concurrent appends are serialized. However, if the primary crashes after writing to some replicas but not all, the record may appear duplicated or at different offsets. GFS’s consistency model (Section 4) accounts for this.

3.4 Lease Renewal and Master Failover

Leases must be renewed periodically to keep the primary authoritative. The master sets a lease expiration time when granting it. The primary can send a LeaseRenew request to the master before expiry. If the master is unavailable (e.g., during failover), the lease expires naturally, and after a grace period, the new master can grant a new lease to a different replica.

In the event of the master failing, the shadow master (which has been replaying the operation log) takes over. Since the shadow master may not have the most recent lease state (because lease grants are recorded in the operation log, but the log may lag), it must be conservative: it waits long enough for any old lease to expire before granting new ones. This ensures no two primaries exist simultaneously, even during a master failover.

3.5 Consistency Guarantees

With leases, GFS provides a relaxed consistency model for regular writes. After a successful write, all replicas are identical (consistent). If a write fails (e.g., because a secondary did not respond), the file region might be inconsistent – different replicas have different data. The client must retry to achieve consistency. This is acceptable for applications that can handle duplicates or missing data (e.g., MapReduce jobs that are idempotent).

For atomic record append, the guarantee is different: the record is appended atomically at the offset chosen by the primary, but across replicas the record may appear in different places if the primary fails mid-append. The system guarantees that each record is stored at least once and is not interleaved with other records. Applications must be prepared for duplicates but never for partial records.

4. Consistency Model: What You Can Rely On

4.1 Terminology

GFS defines several levels of consistency for file regions:

Consistent: All clients see the same data, regardless of which replica they read from.
Defined: A region is consistent AND also reflects the last mutation (i.e., it is consistent with the latest write).
Inconsistent (undefined): Different clients may see different data, usually due to failed writes.

After a successful write, the region is defined. After a concurrent write that succeeds (i.e., all replicas acknowledged), the region is also defined because the primary serialized the writes. However, if the system had concurrent appends that succeeded, the region may contain interleaved records from different clients, but since each append is atomic, clients see a defined state (the exact interleaving is undefined but all replicas agree on the same byte sequence).

Failures lead to inconsistent regions. For example, if a primary crashes after writing to only two of three secondaries, the third secondary lacks the data. Subsequent reads from that secondary will return stale data. GFS handles this by allowing the master to detect stale replicas via version numbers and exclude them from read responses.

4.2 Impact on Applications

GFS designers explicitly decided that application-level correctness should not depend on the file system providing strong consistency. Instead, applications like MapReduce are designed to be idempotent and to handle duplicates. For example, a web crawler stores pages as records using atomic record append. If a record appears twice, the crawler can deduplicate later based on URL checksums.

This design philosophy – shifting complexity to the application layer – is common in distributed systems. It reduces the file system’s implementation complexity and allows higher throughput. However, it places a burden on application developers to cope with duplicates, missing data, and stale reads.

4.3 Consistency for Directory Operations

Directory operations (create, rename, delete) are handled by the master with strong consistency because the master serializes all metadata changes. Thus, after a successful rename, all subsequent lookups find the new name. There is no inconsistency for namespace operations.

5. Fault Tolerance: Surviving Disasters

5.1 Master Fault Tolerance

The single master is a critical component. GFS achieves fault tolerance for the master through two mechanisms:

Operation log replication: Every metadata change is written to the local disk and also to a remote machine’s disk. If the local disk fails, the operation log can be recovered from the remote copy.
Shadow master: A separate process that replays the same operation log. It can serve read-only metadata requests when the primary master is down. During a failover, the shadow master becomes the new primary after ensuring that all old leases have expired.

Recovery time depends on the amount of log to replay. With periodic checkpoints, recovery takes seconds.

5.2 Chunk Server Fault Tolerance

Chunk server failure is handled automatically. The master detects the failure via missing heartbeats, then schedules re-replication of all chunks hosted on that server. Placement ensures that the new replica is on a different rack to maintain diversity. The replication process is asynchronous; the system continues to serve reads from the remaining replicas.

If a chunk server loses a specific chunk (e.g., disk corruption), it will report that chunk in its heartbeat with a lower version number or not at all. The master then initiates re-replication.

5.3 Data Integrity: Checksums

GFS protects against data corruption by using checksums on each chunk. Chunks are divided into 64 KB blocks, each with a 32-bit checksum. When a client reads a chunk, the chunk server verifies the checksums of the blocks that were read. If corruption is detected, the chunk server reports an error to the client and returns a different replica (if available). The master then re-replicates the chunk from a healthy replica.

Checksums are verified at read time, not during writes, to avoid performance overhead. This trade-off is acceptable because write errors are rare compared to read volume.

5.4 Garbage Collection

When a file is deleted, the master does not immediately remove the chunk metadata. Instead, it renames the file to a hidden name with a deletion timestamp. Later, a background garbage collector scans the namespace, removes chunks of files that have been deleted for more than a configurable duration (e.g., 3 days). This lazy approach avoids the complexity of dealing with open file handles and allows time for recovery if a deletion was accidental.

6. Performance and Scalability Considerations

6.1 Master Bottlenecks

The master’s ability to serve metadata requests is limited by its CPU and memory. GFS benchmarks showed that a master with 1 GB of memory could manage about 10,000 chunks per second in metadata operations. For a cluster with 1000 chunk servers and each server holding 10,000 chunks, the master would need to handle 10 million chunks – within its memory capacity (64 MB per chunk * 10M = 640 MB for chunk locations plus overhead). However, heartbeat traffic alone (every chunk server sends a list of its chunks every few seconds) can saturate the master’s network bandwidth or CPU.

GFS mitigated this by having the master handle only changes in chunk server state, not the full list each heartbeat. It uses a lease-based protocol for chunk server state replication: the master knows the set of replicas for each chunk, and the chunk servers report only when a chunk changes (e.g., due to re-replication). This reduces heartbeat overhead.

6.2 Network Throughput

The write pipeline (data pushed sequentially from client to primary to secondary) optimizes network utilization. Since data is forwarded hop-by-hop, the entire bandwidth of the network can be used without overwhelming any single link. The replication factor of 3 means that each write consumes 3 times the data size in network traffic. For a cluster with 10 Gbps links, this is manageable.

Reads are optimized by choosing the closest replica (same rack). This avoids unnecessary cross-rack traffic.

6.3 Scalability to Thousands of Nodes

GFS was designed for clusters of 1000 to 5000 nodes. Beyond that, the single master becomes a bottleneck. Google later developed Colossus, the successor to GFS, which replaced the single master with a distributed metadata service. Modern file systems like Ceph and GlusterFS use fully distributed metadata to scale to larger clusters.

Nevertheless, GFS’s success at Google’s scale validated the single-master approach for many real-world deployments. HDFS, for example, followed the same pattern and is widely used in clusters of thousands of nodes.

7. Comparisons with Other Systems

7.1 HDFS

Hadoop Distributed File System (HDFS) is directly inspired by GFS. Its architecture includes a single NameNode (master) and DataNodes (chunk servers). Block size is 128 MB (double GFS’s 64 MB). Replication factor is 3. HDFS added a High Availability mode with an active-standby NameNode pair using ZooKeeper for failover. The write pipeline, lease mechanism, and checksumming are very similar.

One difference: HDFS does not support atomic record append natively (though it has a client-side append capability). HDFS also has a more sophisticated namespace structure (permissions, quotas). Overall, HDFS is the most direct descendant of GFS.

7.2 Ceph

Ceph takes a different approach. It uses a distributed metadata cluster (MDS) and a dynamic data placement algorithm (CRUSH). There is no single master for metadata; instead, metadata is sharded across multiple servers. Ceph’s flexibility allows it to scale to exabytes. Its consistency model is stronger (eventual consistency with optional strong consistency), but it is more complex to configure and operate.

7.3 Facebook’s Tectonic

Facebook’s Tectonic file system (also known as FDS) was built for storing billions of photos and videos. It uses a centralized metadata service (Metadata Store) that is sharded across many nodes, partially inspired by GFS’s single master but without the bottleneck. Tectonic uses 16 MB chunks and erasure coding for durability, reducing storage overhead. Its lease-based replication is similar to GFS.

8. Modern Relevance: Cloud Object Stores and Beyond

Many ideas from GFS appear in modern cloud object stores such as Amazon S3, Google Cloud Storage (which is based on Colossus), and Azure Blob Storage. Object stores offer a flat namespace (buckets) rather than a hierarchical file system, but they still use chunked storage with replication/erasure coding. The concept of a primary lease for consistency is used in some form (e.g., S3’s strong consistency for new objects but eventual for overwrites – though this changed in 2020). The design of a metadata service that tracks data locations and manages replication is universal.

For developers building custom storage systems, GFS provides a proven blueprint. Even if you don’t implement the entire system, studying its trade-offs helps you design your own fault-tolerant, high-throughput distributed storage layer.

9. Building Your Own DFS: Practical Advice

If you decide to implement a simplified version of GFS (e.g., as a learning project), start with:

Master: Implement a single-threaded event loop that accepts RPCs for file creation, read, write, and append. Use an in-memory map for metadata, and persist it to an operation log.
Chunk server: Store chunks as flat files on disk. Handle heartbeats, data push, and write forwarding. Implement checksums on 64 KB blocks.
Leases: Add a simple grant/renew/expire mechanism. Use a lock to ensure only one primary per chunk.
Client library: Provide an API for file read/write/append with automatic retries and location caching.

Keep the scope small: only support writes at the chunk level (no partial block writes), and allow only one client at a time initially. Then add concurrency and leasing.

10. Conclusion

The Google File System was a landmark achievement in distributed storage. By making pragmatic trade-offs – a single master, large chunks, leases for consistency, and relaxed consistency at the application level – it achieved high throughput, fault tolerance, and operational simplicity at a scale previously thought impossible. Its influence pervades modern storage systems, from Hadoop to cloud object stores.

Building your own DFS inspired by GFS is more than an academic exercise. It forces you to confront the real challenges of building a distributed system: failures are not rare, consistency is hard, and performance demands clever design. By understanding master architecture, chunk replication, and leases in depth, you gain mental models that you can apply to any distributed system – databases, message queues, or coordination services.

So go ahead: open a code editor, write a master that accepts heartbeats, a chunk server that stores files, and a client that can write and append. You’ll learn more in a weekend than reading ten papers. And when you encounter a production problem in your day job – like a node failure causing data loss, or a consistency bug in a multi-writer scenario – you’ll remember the principles of GFS and know exactly what to do.

This article was written to provide a deep, practical understanding of GFS for engineers and students alike. Code examples are written in a generic pseudecode; real implementations would use RPC frameworks like gRPC or Apache Thrift.