What is a Distributed Locking Service?

A Distributed Locking Service is a coordination mechanism used in distributed systems to ensure that only one process or node at a time can access or modify a shared resource, even when that resource is spread across multiple machines or data centers.

In large-scale systems, this is essential to maintain consistency and prevent conflicts when multiple components operate concurrently.

Distributed locking is crucial for scenarios such as job schedulers, where it prevents the same job from running multiple times, and database writes, where it avoids race conditions during concurrent updates.

It also matters for shared file systems, ensuring only one node updates a file at a time, and for distributed transactions, enforcing exclusive access to critical resources.

In this chapter, we’ll explore how to design a Distributed Locking Service discussing multiple design approaches used in real-world systems.

Lets start by clarifying the requirements.

1. Clarifying Requirements

A system design interview should always start with a conversation to scope the problem. Here’s an example of how a candidate-interviewer discussion might flow:

Discussion

Interviewer: "Let's design a distributed locking service"

Candidate: "Okay. The primary goal is to provide mutual exclusion. A client should be able to acquire a lock for a specific resource, and no other client should be able to acquire the same lock until it's released. What happens if a client acquires a lock and then crashes?"

Interviewer: "That's a critical point. The system must handle that. A crashed client can't hold a lock indefinitely."

Candidate: "That implies locks must have a timeout or a lease. We should also consider high availability and fault tolerance. The lock service itself can't be a single point of failure."

After gathering the details, we can summarize the key system requirements.

Functional Requirements

Acquire Lock: A client can request a lock on a resource. The call should be blocking or return immediately if the lock is unavailable.
Release Lock: A client can explicitly release a lock it holds.
Lock with TTL (Lease): Locks must have an associated Time-to-Live (TTL) to prevent deadlocks from client crashes. The lock is automatically released after the TTL expires.
Lock Renewal: A client holding a lock should be able to extend its lease.

Non-Functional Requirements

High Availability: The service must remain operational even if some of its nodes fail.
Correctness: Under no circumstances should two clients believe they hold the same lock.
Low Latency: Acquiring and releasing locks should be fast.
Deadlock Freedom: The system should never enter a state where locks can’t be acquired again.

2. Key Challenges and Design Goals

Challenges

Building a correct distributed locking service is notoriously difficult due to the inherent challenges of distributed systems.

Network Partitions: The network can split, isolating nodes. During a partition, two different parts of the cluster might think the other has failed, potentially leading to two clients being granted the same lock (split-brain).
Process Crashes: A client holding a lock can crash. The system must have a way to detect this and release the lock.
Clock Drift: Nodes' clocks are never perfectly synchronized. Relying on timestamps for lease expiration can be risky if clocks drift significantly.
Race Conditions: Multiple clients might try to acquire a lock at the exact same moment. The service must serialize these requests correctly.

Example

Lets say, the system has two lock nodes, Lock Node 1 and Lock Node 2, that coordinate with each other to maintain a consistent lock state across the cluster. When Client A acquires a lock through Lock Node 1, all nodes agree that the resource is locked, preventing others from acquiring it.

Due to a network failure, Partition A and Partition B lose communication. The cluster is now divided into two isolated groups that can’t exchange state or heartbeat messages.

In Partition A, Lock Node 1 still believes Client A holds the lock. Meanwhile, in Partition B, Lock Node 2 doesn’t see Client A’s lock and may allow Client B to acquire the same lock.

This results in two clients holding the same lock simultaneously, violating mutual exclusion. This situation is known as a split-brain scenario.

If not designed carefully, a temporary network glitch can cause data corruption, double updates, or duplicate job executions.

Design Goals

To overcome above challenges, our design must prioritize four key goals:

Safety (Correctness): At any given moment, only one client can hold a lock for a given resource.
Liveness (Deadlock Freedom): All clients must eventually be able to acquire a lock (as long as it's not held forever). A crashed client must not be able to hold a lock indefinitely.
Fault Tolerance: The service must remain available and correct as long as a majority of its servers are running and can communicate.

3. API Design

Premium Content

This content is for premium members only.

Design Distributed Locking Service