Last Updated: December 29, 2025
A Load Balancer is a critical infrastructure component that distributes incoming network traffic across multiple servers to ensure no single server becomes overwhelmed.
Loading simulation...
The core idea is straightforward: instead of routing all traffic to one server (which would eventually crash under heavy load), a load balancer acts as a traffic cop, intelligently spreading requests across a pool of healthy servers. This improves application availability, responsiveness, and overall system reliability.
Popular Examples: AWS Elastic Load Balancer (ELB), NGINX, HAProxy
Load balancers appear in almost every distributed system architecture. Understanding how to design one from scratch demonstrates deep knowledge of networking, high availability, and system scalability.
In this chapter, we will explore the high-level design of a Load Balancer.
Lets start by clarifying the requirements.
Before starting the design, it's important to ask thoughtful questions to uncover hidden assumptions, clarify ambiguities, and define the system's scope more precisely.
Here is an example of how a discussion between the candidate and the interviewer might unfold:
Candidate: "What type of traffic should the load balancer handle? Are we focusing on HTTP/HTTPS traffic, or should it support any TCP/UDP traffic?"
Interviewer: "Let's design a general-purpose load balancer that supports both Layer 4 (TCP/UDP) and Layer 7 (HTTP/HTTPS) load balancing."
Candidate: "What is the expected scale? How many requests per second should the system handle?"
Interviewer: "The load balancer should handle up to 1 million requests per second at peak traffic."
Candidate: "How should we handle server failures? Should the load balancer automatically detect and route around unhealthy servers?"
Interviewer: "Yes, health checking is critical. The system should detect failures within seconds and stop routing traffic to unhealthy servers."
Candidate: "Do we need to support session persistence, where requests from the same client go to the same backend server?"
Interviewer: "Yes, sticky sessions should be supported for stateful applications, but it should be configurable."
Candidate: "What about SSL/TLS termination? Should the load balancer handle encryption?"
Interviewer: "Yes, SSL termination at the load balancer is required to offload encryption work from backend servers."
Candidate: "What availability target should we aim for?"
Interviewer: "The load balancer itself must be highly available with 99.99% uptime, since it's on the critical path for all traffic."
After gathering the details, we can summarize the key system requirements.
Beyond features, we need to consider the qualities that make the system production-ready:
Before diving into the design, let's understand the scale we are working with. These numbers will guide our architectural decisions, particularly around memory management and network capacity.
Starting with the numbers from our requirements:
Our target is 1 million requests per second at peak. Traffic is rarely uniform, so let's assume the average is about one-third of peak:
How many connections are open at any given time? This depends on connection duration. If the average request takes 500ms to complete (including network round trip and server processing):
Half a million concurrent connections is substantial but manageable with modern hardware.
Network bandwidth is often the first bottleneck at high throughput. Let's calculate what we need:
This is significant. A single 10 Gbps network card would not be enough. We need either multiple NICs, faster networking (25/40/100 Gbps), or multiple load balancer nodes to distribute the bandwidth requirement.
Health checks generate their own traffic. With 1,000 backend servers and checks every 5 seconds:
This is negligible compared to user traffic. Health checking will not be a bottleneck.
Each active connection needs state tracking in memory:
| Data | Size |
|---|---|
| Source IP + port | 6 bytes |
| Destination IP + port | 6 bytes |
| Connection state | 4 bytes |
| Timestamps | 16 bytes |
| Protocol-specific data | ~200 bytes |
| Buffer space | ~250 bytes |
| Total per connection | ~500 bytes |
For 500,000 concurrent connections:
250 MB is quite modest. Memory will not be our bottleneck. The bigger memory consideration is if we implement SSL termination, where session caches and certificate storage can consume several gigabytes.
These estimates tell us several important things about our design:
A load balancer has two distinct interfaces. The data plane handles actual traffic (this is where millions of requests flow through), and the control plane handles configuration and management (registering backends, changing algorithms, viewing health status).
The data plane is implicit since it just forwards network traffic. The control plane needs explicit APIs for operators to manage the system. Let's define those.
POST /backendsWhen you spin up a new application server, you need to tell the load balancer about it. This endpoint adds a backend to the pool and starts health checking it.
| Parameter | Type | Required | Description |
|---|---|---|---|
address | string | Yes | IP address or hostname of the backend server |
port | integer | Yes | Port the backend is listening on |
weight | integer | No | Weight for weighted load balancing (default: 1) |
health_check_path | string | No | HTTP path for health checks (default: /health) |
The initial status is "unknown" because we have not run a health check yet. Within a few seconds, it will transition to "healthy" or "unhealthy."
| Status | Meaning |
|---|---|
400 Bad Request | Invalid address format or port out of range |
409 Conflict | A backend with this address:port already exists |
DELETE /backends/{backend_id}When you want to decommission a server (for maintenance, scaling down, or replacement), this endpoint removes it from the pool. The load balancer will stop sending new traffic immediately, but existing connections are allowed to complete gracefully.
Path Parameter: backend_id - the ID returned when the backend was registered
The drained_connections field tells you how many connections were in progress when the backend was removed.
| Status | Meaning |
|---|---|
404 Not Found | No backend exists with this ID |
GET /backends/{backend_id}/healthReturns detailed health information about a specific backend, useful for debugging and monitoring dashboards.
The response includes both current state (status, active connections) and historical metrics (total requests, failures) to help operators understand backend performance over time.
PUT /config/algorithmChanges how traffic is distributed across backends. This takes effect immediately for new connections (existing connections are not affected).
| Parameter | Type | Required | Description |
|---|---|---|---|
algorithm | string | Yes | One of: round_robin, weighted_round_robin, least_connections, ip_hash, random |
sticky_sessions | boolean | No | Enable session persistence (default: false) |
sticky_ttl_seconds | integer | No | How long sticky sessions last (default: 3600) |
| Status | Meaning |
|---|---|
400 Bad Request | Unknown algorithm name or invalid TTL value |
Now we get to the interesting part: designing the system architecture. Rather than presenting a complex diagram upfront, we will build the design incrementally, starting with the simplest solution and adding components as we encounter challenges. This approach mirrors how you would tackle this problem in an interview.
Our load balancer needs to do three fundamental things:
The architecture naturally splits into two parts: the data plane that handles actual traffic at high speed, and the control plane that manages configuration and health checking at a slower pace.
The data plane must be fast since every request flows through it. The control plane can be slower since configuration changes and health checks happen infrequently compared to request traffic.
Let's build this architecture step by step, starting with the most basic requirement: getting traffic from clients to backends.
The most basic job of a load balancer is accepting connections from clients and forwarding them to backend servers. This sounds simple, but there are several components involved, and understanding how they work together is key to designing a good system.
Let's introduce the components we need.
This is the entry point for all client traffic. The frontend listener binds to one or more ports (typically 80 for HTTP, 443 for HTTPS) and accepts incoming TCP connections.
When a connection arrives, the listener needs to:
In high-performance load balancers like NGINX or HAProxy, this component is optimized to handle hundreds of thousands of connections efficiently using techniques like epoll (on Linux) or kqueue (on BSD).
This is the brain of the load balancer. Given a connection, the routing engine decides which backend server should handle it.
The routing engine maintains:
When the frontend listener hands off a connection, the routing engine applies the configured algorithm (round robin, least connections, etc.) and returns the selected backend.
A backend pool is a logical group of servers that can handle the same type of traffic. In simple setups, you might have just one pool. In more complex architectures, you might have separate pools for different services (one for API servers, another for static content, etc.).
Each pool contains:
Let's trace a request through the system:
https://api.example.com).At this point, we have basic load balancing working. But what happens when Backend 2 crashes? Without health checking, the load balancer would keep sending traffic to it, and users would see errors. Let's address that next.
Servers fail. It is not a matter of if, but when. A process might crash, a disk might fill up, or a network partition might isolate a server. Without health checking, the load balancer would blindly keep sending traffic to dead servers, and users would see errors.
Health monitoring solves this by continuously checking each backend and automatically routing around failures.
We need a new component: the Health Checker. This is a background process that periodically probes each backend server to verify it is working correctly.
The health checker is responsible for:
Different applications need different types of health checks. A basic TCP check might be enough for some services, while others need a full HTTP request to verify the application is actually working.
| Type | How It Works | When to Use |
|---|---|---|
| TCP Check | Opens a TCP connection and closes it | Basic connectivity verification. Fast but does not verify the application is working. |
| HTTP Check | Sends an HTTP request (GET /health), expects 2xx response | Web applications. Verifies the app can handle requests. |
| Custom Script | Runs a user-defined command or script | Complex checks like database connectivity, queue depth, etc. |
Most production deployments use HTTP health checks. The application exposes a /health or /healthz endpoint that returns 200 OK when everything is working, and returns an error (or times out) when something is wrong.
Here is what happens when a backend fails:
This entire process happens automatically. No human intervention is required to handle a failed server.
We have a problem. We built a load balancer to eliminate single points of failure in our backend, but the load balancer itself is now a single point of failure. If it crashes, all traffic stops. This is obviously unacceptable for a system targeting 99.99% uptime.
The solution is to run multiple load balancer instances. But this introduces new questions:
Let's look at the two main patterns for achieving high availability.
In this pattern, one load balancer handles all traffic while a backup waits idle, ready to take over if the primary fails.
The key concept here is the Virtual IP (VIP). Clients connect to the VIP, not to either load balancer directly. Both load balancers know about this VIP, but only the active one actually responds to traffic on it.
The standby continuously monitors the primary using heartbeat messages (typically sent every second). If the primary stops responding, the standby "claims" the VIP by broadcasting an ARP message that says "I am now the owner of 192.168.1.100." Within 1-3 seconds, traffic starts flowing to the new active node.
This pattern is commonly implemented using protocols like VRRP (Virtual Router Redundancy Protocol) or vendor-specific solutions like AWS's Elastic Load Balancer.
Pros: Simple to set up and reason about. No state synchronization needed since only one node handles traffic at a time.
Cons: Half your capacity sits idle during normal operation. You are paying for servers that are not doing useful work.
In this pattern, multiple load balancer nodes handle traffic simultaneously. If one fails, the others continue without interruption.
How it works:
Traffic is distributed across multiple load balancer nodes using one of several methods:
Pros: All resources are utilized during normal operation. Better throughput since traffic is spread across multiple nodes. More resilient since the failure of one node only reduces capacity rather than causing an outage.
Cons: More complex to manage. If you use sticky sessions, you need to synchronize session state across nodes (typically using a shared Redis cluster).
| Consideration | Active-Passive | Active-Active |
|---|---|---|
| Complexity | Simpler | More complex |
| Resource utilization | 50% (standby is idle) | 100% |
| Failover time | 1-3 seconds | Near-instant (traffic shifts to remaining nodes) |
| Sticky sessions | Easy (all state on one node) | Requires shared state store |
| Scale | Limited to one node's capacity | Scales horizontally |
For most production systems handling significant traffic, active-active is the better choice despite its complexity. The benefits of full resource utilization and instant failover outweigh the added complexity of state synchronization.
Now that we have designed each piece, let's step back and see how they all fit together. Here is the complete architecture that satisfies our requirements for traffic distribution, health monitoring, and high availability.
Let's trace how everything works together:
| Component | What It Does | Key Characteristics |
|---|---|---|
| Virtual IP / DNS | Provides a stable entry point for clients | Abstracts away the LB cluster |
| LB Nodes | Accept connections and route to backends | Stateless, horizontally scalable |
| Session Store (Redis) | Stores sticky session mappings | Shared across all LB nodes |
| Health Checker | Monitors backend and LB health | Runs continuously in the background |
| Config Manager | Manages load balancer configuration | Handles API requests, persists config |
| Config Store | Persists configuration | Could be etcd, Consul, or a database |
| Backend Pool | The application servers being load balanced | Grouped by service type |
This architecture handles millions of requests per second while automatically recovering from failures. The separation of data plane (LB nodes) and control plane (health checker, config manager) keeps the traffic path fast and simple.
Here is something that might surprise you: a load balancer does not really need a traditional database for its core operation. The data plane (traffic forwarding) is entirely in-memory since every nanosecond counts when you are handling millions of requests per second.
However, the control plane does need persistent storage. Configuration (which backends exist, what algorithm to use, health check settings) needs to survive restarts. And if you are using sticky sessions in an active-active setup, session mappings need to be shared across LB nodes.
Let's think through what data lives where.
Different types of data have different requirements for speed, persistence, and sharing. Here is how we split things up:
| Data Type | Storage Location | Why? |
|---|---|---|
| Active connections | In-memory on each LB node | Must be microsecond-fast. Each LB node manages its own connections. |
| Backend health status | In-memory on each LB node | Updated frequently (every health check). Pushed from health checker. |
| Connection counters | In-memory on each LB node | Updated on every request for least-connections algorithm. |
| Session mappings | Redis cluster | Needs to be shared across LB nodes for active-active sticky sessions. |
| Backend configuration | etcd/Consul/PostgreSQL | Persistent, versioned. Survives restarts and deployments. |
| Metrics and logs | Prometheus/InfluxDB | Historical data for dashboards and alerting. |
The key insight is that the hot path (handling requests) should never touch disk or cross a network boundary for core routing decisions. Configuration is loaded into memory at startup and updated via push notifications when it changes.
The control plane needs to store information about backends and pools. Here is a schema that supports our API and features.
Each backend server record contains everything the load balancer needs to route traffic to it:
| Field | Type | Description |
|---|---|---|
backend_id | String (PK) | Unique identifier (e.g., "backend-a1b2c3") |
pool_id | String (FK) | Which pool this server belongs to |
address | String | IP address or hostname |
port | Integer | Port the server listens on |
weight | Integer | Weight for weighted algorithms (default: 1) |
max_connections | Integer | Maximum concurrent connections (0 = unlimited) |
enabled | Boolean | Administrative toggle to disable without removing |
created_at | Timestamp | When this backend was registered |
updated_at | Timestamp | Last configuration change |
Pools group related backends and define shared behavior:
| Field | Type | Description |
|---|---|---|
pool_id | String (PK) | Unique identifier (e.g., "pool-api-servers") |
name | String | Human-readable name |
algorithm | Enum | round_robin, weighted_round_robin, least_connections, ip_hash |
health_check_type | Enum | tcp, http, custom |
health_check_path | String | HTTP path for health checks (e.g., "/health") |
health_check_interval | Integer | Seconds between health checks |
healthy_threshold | Integer | Consecutive successes to mark healthy |
unhealthy_threshold | Integer | Consecutive failures to mark unhealthy |
sticky_sessions | Boolean | Enable session persistence |
sticky_ttl | Integer | How long sticky sessions last (seconds) |
When sticky sessions are enabled, we need to remember which backend each user should go to. Redis is perfect for this since it is fast, supports TTLs, and can be shared across LB nodes.
The client_identifier depends on the stickiness method:
When a request arrives:
sticky:{client_identifier} in RedisThe high-level architecture gives us a solid foundation, but system design interviews often go deeper. Interviewers want to see that you understand the trade-offs involved in key decisions.
In this section, we will explore the most important design choices in detail: load balancing algorithms, health checking strategies, session persistence, Layer 4 vs Layer 7, SSL termination, and handling load balancer failures.
The load balancing algorithm determines how traffic gets distributed across backends. This is one of the most important decisions in your design since it directly affects how evenly load is spread, how well the system handles failures, and how it behaves under various traffic patterns.
A good algorithm should:
Let's explore the main options, starting with the simplest and building to more sophisticated approaches.
Round robin is the simplest load balancing algorithm, and often the default choice. Requests are distributed sequentially across backends: first request goes to Backend 1, second to Backend 2, third to Backend 3, then back to Backend 1.
The algorithm maintains a counter that increments with each request. The selected backend is counter % number_of_backends.
The implementation is trivial:
Round robin's simplicity is its strength. There is no state to track beyond a single counter. Each LB node can run independently with its own counter since exact coordination is not necessary for good distribution over time. The algorithm adds essentially zero latency.
The simplicity comes with blind spots. Round robin treats all backends and all requests as equal, which is often not true:
Homogeneous backends (same hardware, same configuration) handling uniform requests (similar processing cost per request). This is common for stateless API servers.
What if your backends are not identical? Maybe Backend 1 is a beefy 32-core machine while Backend 2 is a smaller 8-core instance. Sending equal traffic to both would waste capacity on Backend 1 and overload Backend 2.
Weighted round robin solves this by assigning a weight to each backend. Backends with higher weights receive proportionally more traffic.
If Backend 1 has weight 3, Backend 2 has weight 1, and Backend 3 has weight 2, the total weight is 6. Over every 6 requests:
A naive implementation might send the first 3 requests to Backend 1, then 1 to Backend 2, then 2 to Backend 3. This creates "bursts" to high-weight servers, which can cause momentary overload.
Better implementations like NGINX's "smooth weighted round robin" interleave requests more evenly. Instead of [1,1,1,2,3,3], you get something like [1,3,1,2,1,3], spreading the load more smoothly over time.
Weighted round robin is great when you know your backends have different capacities and want to utilize them proportionally. Common scenarios:
The weights are static. If Backend 1 suddenly slows down due to a noisy neighbor or garbage collection, it keeps getting the same traffic. The algorithm does not adapt to runtime conditions.
Both round robin and weighted round robin are "blind" to what is actually happening on your backends. They distribute requests based on a predetermined pattern, regardless of whether a backend is struggling with a heavy workload.
Least connections takes a different approach: it sends each new request to whichever backend currently has the fewest active connections.
The load balancer tracks how many connections are currently active on each backend. When a new request arrives, it scans the list and picks the backend with the lowest count.
In this example, Backend 2 has only 5 active connections compared to 10 and 8 on the others. The new request goes to Backend 2.
Least connections naturally adapts to runtime conditions. If one backend is processing slow requests (maybe a complex database query or a large file upload), it accumulates connections and gets fewer new ones. Meanwhile, a backend that is processing requests quickly keeps its connection count low and picks up more traffic.
This self-balancing behavior is powerful. It handles:
Least connections requires state. Each LB node needs to track connection counts for every backend. In a distributed setup with multiple LB nodes, this gets tricky since each node only knows about its own connections. You have a few options:
Workloads with variable request processing times, like applications that mix quick API calls with slow database queries or file operations.
This combines the best of weighted round robin and least connections. It accounts for both configured capacity (weights) and real-time load (connection counts).
Instead of picking the backend with the fewest connections, we pick the one with the lowest ratio of connections to weight. A powerful server (high weight) can handle more connections before its ratio becomes high.
Even though Backend 1 has more connections (10) than Backend 3 (4), it has a higher weight, so its ratio is the same. Both are equally loaded relative to their capacity.
Weighted least connections gives you:
This is why many production load balancers default to weighted least connections for HTTP traffic.
The algorithm is more complex than round robin. You need to track connections per backend and calculate ratios on each request. But the computation is trivial (a few integer operations), and the better load distribution is usually worth it.
Sometimes you need session persistence, meaning requests from the same client should always go to the same backend. The simplest way to achieve this without storing any state is to use the client's IP address as a hash key.
The same IP address always produces the same hash, which always maps to the same backend (as long as the backend count does not change).
Client A (192.168.1.10) always routes to Backend 0. Client B always routes to Backend 1. No state needs to be stored anywhere.
IP hash gives you session persistence at Layer 4, without needing to parse HTTP or store session mappings. Every LB node can compute the same hash independently since the algorithm is deterministic.
This is useful when:
The approach has significant limitations:
hash % 3 to hash % 4. Most clients will map to different backends, losing their sessions.The last problem is serious enough that consistent hashing was invented specifically to solve it.
Consistent hashing solves the "massive reshuffling" problem of regular hashing. When you add or remove a backend, only a small fraction of requests get remapped instead of most of them.
Imagine a ring numbered from 0 to 2^32. Both backends and requests get hashed onto this ring. Each request goes to the first backend found when walking clockwise from its hash position.
Consider what happens when Backend B is removed:
With simple hash % n hashing, changing from 3 backends to 2 would remap roughly 2/3 of all requests. With consistent hashing, only 1/3 get remapped (the requests that were going to the removed backend).
One problem with basic consistent hashing is that backends might be unevenly distributed on the ring. Backend A might end up responsible for a huge arc while Backend C gets a tiny one.
The solution is virtual nodes: instead of placing each backend once on the ring, place it multiple times (e.g., 100-200 positions per backend). This evens out the distribution significantly.
Consistent hashing shines in scenarios with dynamic backend pools:
For most load balancing use cases where backends are relatively stable, the added complexity is not worth it. But if you are building a distributed cache or a system with frequent scaling events, consistent hashing is valuable.
Here is how the algorithms compare across key dimensions:
| Algorithm | Complexity | Statefulness | Adaptiveness | Best For |
|---|---|---|---|---|
| Round Robin | Very simple | Stateless | None | Homogeneous backends, uniform requests |
| Weighted Round Robin | Simple | Stateless | None | Known capacity differences |
| Least Connections | Moderate | Per-node state | High | Variable request processing times |
| Weighted Least Conn | Moderate | Per-node state | High | Production with mixed servers |
| IP Hash | Simple | Stateless | None | Basic session persistence |
| Consistent Hash | Complex | Stateless | None | Dynamic scaling, caching |
For most HTTP applications, start with Weighted Least Connections. It handles heterogeneous backends and variable request costs gracefully, adapting to runtime conditions automatically.
If your backends are identical and your requests are uniform (a rare but possible scenario), Round Robin is fine and slightly faster.
For session persistence, prefer cookie-based stickiness (covered in section 6.3) over IP hash when possible. IP hash is fragile due to NAT and uneven distribution.
Use Consistent Hashing when backends frequently scale up/down, or when you are load balancing cache servers where cache locality matters.
We touched on health checking in the high-level design, but there is more depth to explore.
The health checking strategy you choose affects how quickly you detect failures, how aggressively you mark servers unhealthy, and how smoothly you handle recovery.
Every health check has several configurable parameters. Getting these right is important since too aggressive and you will get false positives (marking healthy servers as unhealthy), too passive and you will keep routing to failed servers longer than necessary.
| Parameter | What It Controls | Typical Value | Trade-off |
|---|---|---|---|
| Interval | Time between checks | 5-10 seconds | Lower = faster detection, higher load |
| Timeout | Max wait for response | 2-3 seconds | Lower = faster detection, more false positives |
| Healthy Threshold | Passes needed to mark healthy | 2-3 | Higher = more stable, slower recovery |
| Unhealthy Threshold | Failures needed to mark unhealthy | 2-3 | Higher = more stable, slower failure detection |
A common configuration is: check every 5 seconds, timeout after 3 seconds, require 3 consecutive failures to mark unhealthy, and require 2 consecutive successes to mark healthy again.
The simplest form. The load balancer opens a TCP connection to the backend and immediately closes it. If the connection succeeds, the backend is considered healthy.
This verifies network connectivity and that something is listening on the port, but it does not verify the application is actually working. A process could be hung, accepting connections but never responding to requests.
The load balancer sends an HTTP request and checks the response. This is the standard for web applications.
The /health endpoint should do meaningful checks. A good health endpoint might:
A bad health endpoint just returns 200 unconditionally, which tells you nothing useful.
For complex scenarios, you can run a custom script that returns success or failure. This is useful when:
Backends transition through states based on health check results. Understanding these transitions helps you tune the behavior.
The thresholds prevent flapping. If a backend fails one check (maybe a brief network blip), it does not immediately get marked unhealthy. Only after 2-3 consecutive failures does the status change. Similarly, a backend that was down needs to pass multiple checks before being trusted with traffic again.
When a backend fails health checks, you should not just immediately drop all connections to it. That would cause errors for requests that are mid-flight. Instead, use connection draining:
This graceful degradation ensures users experience minimal disruption when backends fail.
Not all applications are stateless. Many legacy systems (and some modern ones) store user session data in local memory on the backend server. If User A logs in on Backend 1, their session lives in Backend 1's memory. If their next request goes to Backend 2, they appear logged out.
Session persistence (also called "sticky sessions") solves this by ensuring that requests from the same user consistently go to the same backend.
Without stickiness, the load balancer distributes requests without regard to who is making them:
This creates a confusing, broken experience. Session persistence fixes it.
The most common and reliable approach for HTTP traffic. The load balancer injects a cookie that identifies which backend should handle the user's requests.
Set-Cookie header with the backend identifier (e.g., SERVERID=backend-1).For non-HTTP traffic or when you cannot use cookies, you can route based on client IP address. This is essentially the IP Hash algorithm applied for persistence.
Use IP-based persistence only when cookie-based is not an option.
The cleanest solution is to avoid the problem entirely by making your application stateless. Instead of storing session data in backend memory, store it in a shared session store like Redis.
This requires application changes. You need to configure your framework to use Redis (or another external store) instead of local memory for sessions. For new applications, this is easy. For legacy applications, it might require significant refactoring.
If you are building a new application, design it to be stateless from the start. Use an external session store (Redis is the go-to choice) and avoid the complexity of sticky sessions entirely.
For legacy applications that cannot be modified, use cookie-based stickiness for HTTP traffic. It is the most reliable option.
Avoid IP-based persistence unless you have no other choice.
When discussing load balancers, you will often hear terms like "L4" and "L7." These refer to different layers of the network stack where the load balancer makes its decisions. The choice between them affects what the load balancer can do and how fast it can do it.
A Layer 4 load balancer operates at the transport layer. It sees TCP/UDP packets but does not understand what is inside them. It makes routing decisions based only on:
Speed. Since L4 does not parse application protocols, it is fast. Some hardware L4 load balancers can handle tens of millions of packets per second. It is also protocol-agnostic, meaning it works with any TCP/UDP traffic (databases, game servers, custom protocols).
A Layer 7 load balancer operates at the application layer. It understands HTTP (and sometimes other protocols) and can make intelligent routing decisions based on request content.
/api goes to API servers, /static goes to CDN)api.example.com vs www.example.com)L7 is more resource-intensive. The load balancer must parse HTTP, which takes CPU cycles. SSL termination requires even more processing. A server that could handle 1M packets/second at L4 might handle only 100K requests/second at L7 (rough estimate, varies widely by configuration).
| Capability | Layer 4 | Layer 7 |
|---|---|---|
| Speed | Fastest (packet forwarding) | Slower (protocol parsing) |
| Intelligence | Basic (IP, port) | Rich (URL, headers, cookies) |
| SSL termination | Pass-through only | Full termination and inspection |
| Sticky sessions | IP-based only | Cookie, header, or URL-based |
| Content routing | Not possible | Based on URL, host, method, etc. |
| Header modification | Not possible | Add, remove, or modify headers |
| Caching | Not possible | Can cache responses |
| Protocol support | Any TCP/UDP | HTTP/HTTPS (usually) |
X-Forwarded-For or X-Request-ID)Most modern web applications use L7 load balancers because the additional capabilities outweigh the performance cost. The exception is when you have a layer of L7 load balancers behind an L4 load balancer (the L4 distributes across L7 instances).
Managing SSL certificates on every backend server is tedious and error-prone. You have dozens of servers, and each needs the certificate renewed before it expires.
This is where SSL termination helps: the load balancer handles all the encryption, and backends receive plain HTTP.
The client connects to the load balancer over HTTPS. The load balancer decrypts the request, inspects it (for routing decisions), then forwards it to a backend over plain HTTP. The response follows the reverse path.
Simplified certificate management
You install and renew certificates in one place (the load balancer) instead of on every backend server. When a certificate expires, you update it once, not fifty times.
Offload CPU from backends
SSL encryption and decryption are CPU-intensive operations. By terminating SSL at the load balancer, your backends can use their CPU for application logic instead of cryptography. This is especially valuable if your backends are handling many small requests.
Enable content-based routing
If you want to route based on URL path or headers, the load balancer needs to read the HTTP request. It cannot do this if the traffic is encrypted end-to-end. SSL termination gives the load balancer visibility into the request content.
TLS session caching
The load balancer can cache TLS sessions, so clients reconnecting do not need to do a full TLS handshake every time. This improves latency for repeat visitors.
There is a trade-off: traffic between the load balancer and backends is unencrypted. This is fine if:
If you need encryption all the way to the backend (perhaps for compliance reasons), you have options:
Re-encryption
The load balancer decrypts client traffic, inspects it, then re-encrypts it before sending to the backend. This gives you content inspection plus backend encryption, but with double the cryptographic overhead.
SSL Passthrough
The load balancer does not decrypt traffic at all. It routes based on the TLS Server Name Indication (SNI) and forwards encrypted packets directly to backends. You lose the ability to do content-based routing, but traffic is encrypted end-to-end.
If you are terminating SSL at the load balancer, follow these security practices:
We covered high availability at a conceptual level in section 4.3. Now let's go deeper into the failure scenarios you need to handle and the mechanisms that make failover work.
The load balancer sits on the critical path for all traffic. If it fails and you do not have a backup, your entire service goes offline. This is why high availability is non-negotiable for production load balancers.
Before you can recover from a failure, you need to detect it. Load balancer nodes monitor each other using:
The detection threshold is a balance: too sensitive and you get false positives (declaring a node dead when it just had a brief network hiccup), too slow and you extend your outage.
VRRP (Virtual Router Redundancy Protocol) is the industry standard for IP failover within a single network segment. Two load balancers share a Virtual IP address, but only one responds to traffic at a time.
Failover time: 1-3 seconds
Trade-off: Simple and reliable, but half your capacity sits idle during normal operation.
Multiple load balancers have their own IPs. DNS health checks remove failed ones from rotation.
Failover time: Depends on DNS TTL. With a 60-second TTL, failover can take 1-2 minutes as cached DNS entries expire.
Trade-off: Works across data centers (unlike VRRP), but failover is slower.
Multiple load balancers across the world share the same IP address. BGP routing automatically directs traffic to the nearest healthy one.
When the Asia LB fails, it stops announcing its BGP route. Traffic from Asia clients automatically routes to the next nearest healthy LB (US or EU).
Failover time: Seconds (depends on BGP convergence)
Trade-off: Requires BGP expertise and coordination with network providers. Used by Cloudflare, Google, and other global-scale services.
When a load balancer fails, connections to it are lost. There is no practical way to transfer an open TCP connection from one machine to another. Clients will see connection resets.
You can minimize the impact:
| Strategy | Benefit |
|---|---|
| Fast failover (VRRP/Anycast) | Reduces the window where new connections fail |
| Client retry logic | Applications reconnect automatically |
| External session store | Users do not lose session data |
| Short connection timeouts | Clients detect failure faster |
Recommendation: Design for stateless failover. Assume connections will be dropped on failure, and build applications that handle retries gracefully. The complexity of stateful failover (synchronizing connection state across LB nodes) is rarely worth it.
What is the primary purpose of a load balancer in a distributed system?