A Distributed Locking Service is a coordination mechanism used in distributed systems to ensure that only one process or node at a time can access or modify a shared resource, even when that resource is spread across multiple machines or data centers.
In large-scale systems, this is essential to maintain consistency and prevent conflicts when multiple components operate concurrently.
Distributed locking is crucial for scenarios such as job schedulers, where it prevents the same job from running multiple times, and database writes, where it avoids race conditions during concurrent updates.
It also matters for shared file systems, ensuring only one node updates a file at a time, and for distributed transactions, enforcing exclusive access to critical resources.
In this chapter, we’ll explore how to design a Distributed Locking Service discussing multiple design approaches used in real-world systems.
Lets start by clarifying the requirements.
A system design interview should always start with a conversation to scope the problem. Here’s an example of how a candidate-interviewer discussion might flow:
Interviewer: "Let's design a distributed locking service"
Candidate: "Okay. The primary goal is to provide mutual exclusion. A client should be able to acquire a lock for a specific resource, and no other client should be able to acquire the same lock until it's released. What happens if a client acquires a lock and then crashes?"
Interviewer: "That's a critical point. The system must handle that. A crashed client can't hold a lock indefinitely."
Candidate: "That implies locks must have a timeout or a lease. We should also consider high availability and fault tolerance. The lock service itself can't be a single point of failure."
After gathering the details, we can summarize the key system requirements.
Building a correct distributed locking service is notoriously difficult due to the inherent challenges of distributed systems.
Lets say, the system has two lock nodes, Lock Node 1 and Lock Node 2, that coordinate with each other to maintain a consistent lock state across the cluster. When Client A acquires a lock through Lock Node 1, all nodes agree that the resource is locked, preventing others from acquiring it.
Due to a network failure, Partition A and Partition B lose communication. The cluster is now divided into two isolated groups that can’t exchange state or heartbeat messages.
In Partition A, Lock Node 1 still believes Client A holds the lock. Meanwhile, in Partition B, Lock Node 2 doesn’t see Client A’s lock and may allow Client B to acquire the same lock.
This results in two clients holding the same lock simultaneously, violating mutual exclusion. This situation is known as a split-brain scenario.
If not designed carefully, a temporary network glitch can cause data corruption, double updates, or duplicate job executions.
To overcome above challenges, our design must prioritize four key goals: