A Distributed Locking Service is a coordination mechanism used in distributed systems to ensure that only one process or node at a time can access or modify a shared resource, even when that resource is spread across multiple machines or data centers.
In large-scale systems, this is essential to maintain consistency and prevent conflicts when multiple components operate concurrently.
Distributed locking is crucial for scenarios such as:
In this chapter, we’ll explore how to design a Distributed Locking Service discussing multiple design approaches used in real-world systems.
Lets start by clarifying the requirements.
A system design interview should always start with a conversation to scope the problem. Here’s an example of how a candidate–interviewer discussion might flow:
Interviewer: "Let's design a distributed locking service"
Candidate: "Okay. The primary goal is to provide mutual exclusion. A client should be able to acquire
a lock for a specific resource, and no other client should be able to acquire the same lock until it's released
. What happens if a client acquires a lock and then crashes?"
Interviewer: "That's a critical point. The system must handle that. A crashed client can't hold a lock indefinitely."
Candidate: "That implies locks must have a timeout or a lease. We should also consider high availability and fault tolerance. The lock service itself can't be a single point of failure."
After gathering the details, we can summarize the key system requirements.
Building a correct distributed locking service is notoriously difficult due to the inherent challenges of distributed systems.
Lets say, the system has two lock nodes, Lock Node 1
and Lock Node 2
, that coordinate with each other to maintain a consistent lock state across the cluster. When Client A acquires a lock through Lock Node 1
, all nodes agree that the resource is locked, preventing others from acquiring it.
Due to a network failure, Partition A and Partition B lose communication. The cluster is now divided into two isolated groups that can’t exchange state or heartbeat messages.
In Partition A, Lock Node 1
still believes Client A holds the lock. Meanwhile, in Partition B, Lock Node 2
doesn’t see Client A’s lock and may allow Client B to acquire the same lock.
This results in two clients holding the same lock simultaneously, violating mutual exclusion. This situation is known as a split-brain scenario.
If not designed carefully, a temporary network glitch can cause data corruption, double updates, or duplicate job executions.
To overcome above challenges, our design must prioritize four key goals: