AlgoMaster Logo

Denormalization

Last Updated: May 1, 2026

6 min read

Normalization keeps your data clean by storing each fact in exactly one place. The trade-off is that data gets spread across multiple tables, and building a complete view often requires joins.

Those joins have a cost.

Denormalization is the intentional choice to bring some of that data back together. You duplicate certain values to make reads faster and simpler, while accepting that writes become more complex and require extra care to keep data consistent.

Why Denormalization Exists

In a normalized design, data is cleanly separated across tables. In StreamFlow, generating a full stream report means combining data from streams, tracks, albums, artists, and genres.

This structure has clear benefits:

  • Each fact is stored in one place
  • Updates are simple and consistent
  • Data anomalies are eliminated

But there’s a trade-off.

Every time the application needs something like a track with its artist name, the database has to combine data from multiple tables. For a single request, this is usually fine. Modern databases handle joins efficiently, especially when indexes are in place.

The problem shows up at scale or with specific access patterns. Consider a homepage widget that displays "trending tracks right now" and gets viewed millions of times per day. Or a reporting dashboard that always shows the same aggregated numbers. In both cases, the application is reading the same combined data over and over, and the database is doing the same work to combine it every time.

This is where the read/write asymmetry matters. Many real-world applications read data far more often than they write it. A music streaming service might process thousands of new stream events per minute, but its homepage gets millions of page views in the same period. The ratio can be 100:1 or higher.

Normalization optimizes for the write side (update one row, data stays consistent). Denormalization optimizes for the read side (the data is already combined, so reads are fast).

The key insight is that denormalization is not the absence of normalization. It is a conscious decision to reintroduce specific redundancy after starting with a clean, normalized design. You know exactly which facts are duplicated, which anomalies you are accepting, and how you plan to keep the duplicated data consistent.

Common Denormalization Patterns

Three patterns cover the majority of real-world denormalization. Each one trades a specific type of write complexity for a specific type of read improvement.

Duplicated Columns

The simplest form of denormalization is copying a column from one table into another to avoid combining them at read time.

In StreamFlow's normalized schema, getting an artist's name for a track requires following two links: tracks has album_id, which points to albums, which has artist_id, which points to artists. If the most common operation in the application is "show the track title and artist name together," that is a lot of extra work for something the user sees on every screen.

One approach is to add artist_name directly to the tracks table:

SQL

Now the application can read the track title and artist name from a single table. No extra lookups required.

The Cost: If an artist changes their name, you must update the artists table AND every row in tracks that belongs to that artist. Miss any rows, and the data becomes inconsistent. With the normalized design, you would update one row in artists and every read would automatically see the new name.

NormalizedDenormalized
Read: show track + artistCombine tracks, albums, artistsRead tracks only
Write: artist changes nameUpdate 1 row in artistsUpdate artists + every matching row in tracks
ConsistencyAlways consistentCan be temporarily inconsistent

Duplicated columns work best when the source value changes rarely. Artist names don't change often. If you were duplicating something that changed frequently, the sync cost would outweigh the read benefit.

Pre-computed Aggregates

Sometimes the expensive operation is not combining tables but counting or summing across many rows. If StreamFlow needs to display total stream counts on every artist's profile page, the straightforward approach is to count all rows in streams for that artist's tracks every time someone views the page. With millions of stream records, that counting operation gets expensive fast.

A pre-computed aggregate stores the result of that calculation in a separate summary table:

SQL

Instead of counting millions of rows on every page view, the application reads a single row from artist_stats. The trade-off is that every new stream event must also update the corresponding row in artist_stats. The write path does more work so the read path does less.

Pre-computed aggregates are common in dashboards, leaderboards, and profile pages, anywhere the application repeatedly displays the same summary number.

Flattened Hierarchies

StreamFlow's genres table uses a self-referencing parent_genre_id column to represent a hierarchy: "Rock" is the parent of "Alternative Rock," which is the parent of "Indie Rock."

This is a clean, normalized design. But navigating the hierarchy is expensive. To find the top-level genre for "Indie Rock," you follow parent_genre_id to "Alternative Rock," then follow it again to "Rock," then check that "Rock" has no parent. Each level requires another lookup, and the number of levels is not fixed.

Flattening the hierarchy means storing the top-level genre directly on each row:

SQL
genre_idnameparent_genre_idtop_level_genre
1RockNULLRock
2Alternative Rock1Rock
3Indie Rock2Rock
4Grunge2Rock
5JazzNULLJazz

Now filtering all tracks by top-level genre requires checking a single column instead of traversing the hierarchy. The cost: if the genre tree gets reorganized (say "Alternative Rock" moves under a new parent), every row with a stale top_level_genre must be updated.

The Decision Framework

Not every slow query calls for denormalization. Before adding redundancy, ask three questions.

What is the read/write ratio?

If the data is read 1,000 times for every write, paying a higher write cost to speed up reads is a good trade. If reads and writes are roughly balanced, denormalization creates write overhead without proportional benefit.

Most user-facing features like feeds, dashboards, and search results are heavily read-skewed, making them strong candidates. Admin tools and data ingestion pipelines are often write-heavy, and denormalization there usually hurts more than it helps.

How complex are the read queries?

If the common read path touches one or two tables, denormalization won't save much. If it requires combining five tables (like StreamFlow's full stream report), the cumulative lookup cost adds up, and storing pre-combined data can make a real difference. The more tables involved and the more rows scanned, the stronger the case for denormalization.

How important is consistency?

Denormalized data can become temporarily stale. For display-oriented data (a follower count, a trending widget), brief staleness is acceptable. For financial data (payment balances, billing records) or compliance-sensitive data, even momentary inconsistency can cause real problems.

The higher the consistency requirement, the stronger the case for keeping the data normalized.

FactorLean Toward NormalizationLean Toward Denormalization
Read/write ratioBalanced or write-heavyRead-heavy (100:1+)
Read query complexity1-2 tables3+ tables combined
Consistency needsCritical (financial, legal)Tolerable delay (display)
Source data changesFrequentlyRarely

The guiding rule is: normalize first, denormalize selectively. Start with a clean schema where each fact lives in one place. Measure where the actual bottlenecks are. Then denormalize only those specific read paths that need it. Starting with a flat table and calling it "denormalized" is just a poorly designed schema with no awareness of which redundancy exists or how to manage it.

Keeping Data in Sync

Every denormalization creates redundancy, and redundancy requires a sync mechanism. The duplicated data will eventually become stale unless something keeps it updated.

Common approaches include updating the redundant data in application code whenever the source changes, using database triggers that fire automatically on writes, or running periodic batch jobs that recalculate the values.

Each approach has its own trade-offs around complexity, latency, and reliability. The Data Modeling section later in this course covers sync strategies in depth.