Last Updated: January 6, 2026
Every successful application eventually hits the same wall: more users, more data, more requests. What worked smoothly for 1,000 users starts showing cracks at 100,000. What handled 100 requests per second crumbles at 10,000.
This is where scalability becomes critical.
Scalability is the ability of a system to handle increased load by adding resources. The key word here is "ability", a scalable system can grow to meet demand without requiring a complete architectural overhaul.
Understanding scalability is foundational to everything else in system design. The concepts in this chapter directly influence how we achieve availability and reliability, topics we will explore in the next chapters.
Before scaling, you need to understand how to measure it. You cannot improve what you do not measure, and vague statements like "we need to scale" are useless without concrete numbers.
Scalability is typically evaluated along these dimensions:
| Metric | Description | Example |
|---|---|---|
| Requests per second (RPS) | Number of API calls the system handles | 10,000 RPS |
| Concurrent users | Users active at the same time | 50,000 concurrent |
| Data volume | Amount of data stored or processed | 10 TB storage |
| Throughput | Data transferred per unit time | 1 GB/s |
| Query rate | Database queries per second | 50,000 QPS |
| Message rate | Messages processed through queues | 100,000 msg/s |
A system scales well if it maintains acceptable performance as load increases. Here is what good and bad scaling looks like:
| Load Increase | Response Time | Behavior | What It Means |
|---|---|---|---|
| 1x (baseline) | 50ms | Baseline | Normal operation |
| 2x | 55ms | Excellent | Sublinear growth, caching working well |
| 5x | 70ms | Good | System handling load efficiently |
| 10x | 150ms | Acceptable | Linear degradation, predictable |
| 10x | 500ms | Concerning | Superlinear degradation, bottleneck forming |
| 10x | Timeout | Critical | System at breaking point |
The goal is to keep performance relatively stable as load increases. Ideally, you want linear or sublinear degradation, where doubling load does not double response time. When response times spike or the system starts timing out, you have hit a scalability wall.
Vertical scaling means adding more power to your existing machines. Instead of adding more servers, you upgrade to bigger ones.
This is often the first response to performance problems because it requires no architectural changes.
Vertical scaling works well for:
Never dismiss vertical scaling as "not scalable." Many real-world systems run on vertically scaled databases for years. The key is knowing when horizontal scaling becomes necessary.
Vertical scaling eventually hits a ceiling. When the biggest available machine is not big enough, or when you need fault tolerance that a single machine cannot provide, you need a different approach.
Horizontal scaling means adding more machines rather than upgrading existing ones. Instead of one powerful server, you distribute the load across many commodity servers.
This is how companies like Google, Netflix, and Amazon handle billions of requests.
Instead of one powerful server, you have many commodity servers working together. A load balancer distributes incoming requests across all servers.
For horizontal scaling to work effectively, services should be stateless. A stateless service does not store any session data locally. Each request can be handled by any server.
The difference is significant for scaling:
In the stateful model, once a user's session is stored on Server 1, all their requests must go to that same server. This creates hotspots and makes it risky to remove servers. In the stateless model, session data lives in a shared store like Redis, so any server can handle any request. The load balancer has complete freedom to distribute traffic.
To make services stateless:
A typical system is not monolithic. It has multiple components, each with different scaling characteristics and challenges. Understanding these differences is crucial because the scaling strategy that works for one tier often does not work for another.
Application servers are usually the easiest to scale horizontally, provided they are stateless:
Databases are typically the hardest to scale because they manage state. Unlike application servers, you cannot simply spin up more database instances and put a load balancer in front of them. Data consistency, durability, and transaction isolation all complicate matters.
The approach depends on your workload pattern:
For read-heavy workloads (which most applications are), create copies of your database that handle read queries:
Primary handles all writes, replicas receive changes and serve reads.
When to use: Read-to-write ratio is 10:1 or higher, and writes are not the bottleneck.
When read replicas are not enough, or when write volume exceeds what a single primary can handle, you need to split your data across multiple databases based on a partition key:
NoSQL databases like Cassandra, MongoDB, and DynamoDB are designed for horizontal scaling from the ground up:
Caching reduces load on databases and improves response times. A well-designed cache can handle 100x the throughput of a database, making it essential for high-traffic systems. Redis, for example, can handle 100,000+ operations per second on a single node.
Message queues are essential for scaling asynchronous workloads. They decouple producers from consumers, allowing each to scale independently, and they buffer traffic spikes so consumers can process at their own pace.
Theory is useful, but seeing scalability in action makes it stick. Let us walk through how a startup might scale a social media application from zero to millions of users. Each stage solves a specific bottleneck that emerged from growth.
At launch, everything runs on one machine. The application and database share the same server. This setup is simple, cheap, and perfectly adequate for a few thousand users. There is no distributed system complexity, no network latency between components, and debugging is straightforward.
The bottleneck emerges when the application and database start competing for CPU and memory on the same machine.
The first scaling move is usually separating the database onto its own machine. Now each component can be tuned independently. You can give the database server more RAM for caching, while the app server gets more CPU for request processing.
The bottleneck shifts to the database. As user counts grow, the database handles more queries, and read operations start slowing down.
Adding a cache layer dramatically reduces database load. Hot data, things like user profiles, recent posts, and session data, gets served from memory. Redis can handle hundreds of thousands of reads per second, far more than MySQL. With a good caching strategy, 80-90% of reads never hit the database.
The bottleneck is now the single app server. It cannot handle the incoming request volume.
This is where horizontal scaling begins. A load balancer distributes traffic across multiple app servers. Each server is stateless, storing no session data locally. The Redis cache serves as the shared session store.
Adding more app servers is now trivial. Need more capacity? Spin up another server. Traffic spike during peak hours? Auto-scaling adds servers automatically.
The bottleneck shifts back to the database. With more app servers generating more queries, the single MySQL instance becomes overwhelmed.
Most applications are read-heavy, with reads outnumbering writes by 10:1 or more. Read replicas take advantage of this pattern. The primary database handles all writes, while replicas serve read queries. This multiplies read capacity without changing the application much.
The trade-off is replication lag. Replicas may be a few milliseconds behind the primary, so recently written data might not be immediately visible on reads. For most applications, this is acceptable.
The bottleneck becomes write throughput. One primary database can only handle so many writes per second.
Sharding is the final frontier of relational database scaling. Data is partitioned across multiple databases based on a shard key, typically user ID. Each shard handles a subset of users, distributing both read and write load.
This is powerful but comes with significant complexity. Cross-shard queries become expensive or impossible. Rebalancing shards when they grow unevenly is operationally challenging. Many teams at this stage consider moving to distributed databases like CockroachDB or Vitess that handle sharding automatically.
Scalability is about designing systems that can grow with demand without falling apart. Here are the key takeaways:
Scalability addresses the question of handling more load. But a system that can handle a million users is not very useful if it goes down every week. A scaled system that crashes still fails your users.
This brings us to our next topic: availability. How do you ensure your system stays operational even when individual components fail?