AlgoMaster Logo

Failure Detection and Heartbeats

Ashish

Ashish Pratap Singh

In distributed systems, nodes fail all the time. Servers crash, networks partition, processes hang. The question isn't if failures happen, but how quickly you can detect them.

Consider a database cluster with three replicas. The primary node crashes. If your system takes 5 minutes to detect this failure, you have 5 minutes of downtime. If it detects it in 5 seconds, you have 5 seconds of downtime.

Failure detection is the foundation of fault tolerance. Without it, you can't trigger failovers, rebalance load, or alert operators. Every highly available system depends on some form of failure detection.

In this chapter, we'll explore:

  • What is failure detection?
  • Why is it hard?
  • How heartbeats work
  • Different failure detection strategies
  • The trade-offs you must consider
  • Real-world implementations

Problems Where This Pattern is Useful

Failure detection and heartbeats appear in many system design interview problems:

ProblemHow Failure Detection is Used
Distributed DatabaseDetecting failed replicas, triggering leader election
Load BalancerRemoving unhealthy servers from rotation
Message QueueDetecting dead consumers, reassigning partitions
Distributed CacheDetecting failed nodes, rebalancing data
Service DiscoveryMarking services as unhealthy, updating registry
Coordination ServiceLeader election, distributed locking
Container OrchestrationRestarting failed containers, rescheduling pods
Chat/Messaging SystemDetecting offline users, presence indicators

When interviewers ask "How would you handle node failures?", they expect you to discuss heartbeats, timeouts, and the trade-offs involved.

1. What is Failure Detection?

Premium Content

This content is for premium members only.