AlgoMaster Logo

Normalization Fundamentals

Last Updated: May 1, 2026

8 min read

Storing everything in one large table might feel simple at first, but it quickly becomes hard to manage as your data grows.

You start seeing duplicate values, updates become inconsistent, and a simple delete can accidentally remove unrelated information.

Normalization helps solve this. It’s a set of guidelines for organizing your data so that each piece of information lives in exactly one place, making your database easier to maintain and much more reliable.

The Problem: Data Anomalies

To see why normalization matters, consider what happens when you store everything in a single flat table. Imagine StreamFlow kept all streaming data in one table called stream_report:

SQL

This table has every piece of information in one place. No joins needed. Queries are simple. So what's the problem?

Update Anomaly

Let’s say the artist “Adele” appears in 500 stream records.

If she changes her stage name, you now need to update all 500 rows. Miss even one, and your database ends up with inconsistent data, two different names for the same artist.

In a normalized design, this problem disappears. You update the artist’s name in one place, the artists table, and every reference stays consistent.

SQL

Insert Anomaly

Now imagine StreamFlow signs a new artist who hasn’t released any tracks yet.

In a flat table, there’s no clean way to store this. You’d have to create a fake stream record just to insert the artist, even though no streams exist yet.

That’s a sign something is wrong.

In a normalized schema, you simply insert a row into the artists table. No fake data, no workaround.

Delete Anomaly

Consider an artist with only one track, and that track has been streamed once.

If you delete that single stream record, maybe due to a user data request, you also lose the track title, album title, artist name, and genre.

All of that information disappears because it only existed in that one row.

With normalized tables, deleting a stream removes only the stream event. Everything else remains safely stored in its own table.

All three problems come from the same issue: mixing unrelated data in the same row.

A user’s email has nothing to do with a track’s genre, yet they’re stored together in a flat table.

Normalization fixes this by separating independent pieces of information into their own tables. This keeps your data consistent, reduces duplication, and makes your system far easier to maintain as it grows.

Functional Dependencies

Before jumping into the normal forms, you need one concept: functional dependencies. A functional dependency means that one column's value determines another column's value. If you know the user_id, you can determine the username, email, and country. We write this as:

This reads as "user_id functionally determines username, email, and country." Given any user_id, there is exactly one possible value for each of those columns.

Some more examples from StreamFlow:

Functional dependencies are the reason normalization works. Each normal form is essentially a rule about which functional dependencies are allowed within a table. When a table violates these rules, it contains redundancy, and redundancy leads to anomalies.

The key question is always: does this column depend on the entire primary key, or only part of it? Does it depend on the primary key directly, or through another non-key column? The normal forms answer these questions systematically.

First Normal Form (1NF)

First normal form has two requirements: every column must hold atomic (indivisible) values, and there should be no repeating groups.

Atomic Values

A column is atomic when it contains a single value, not a list, a set, or a nested structure. Consider a table that stores a user's favorite genres as a comma-separated string:

SQL

The fav_genres column violates 1NF because it packs multiple values into one field. Querying for all users who like jazz requires string parsing:

SQL

The fix is to move the multi-valued attribute into a separate table with one row per value:

SQL

Now each preference is its own row. Querying is straightforward, and you get referential integrity through foreign keys.

No Repeating Groups

Repeating groups are columns that represent the same attribute multiple times. This design stores a user's last three streamed tracks as separate columns:

SQL

This violates 1NF because track_1_id, track_2_id, and track_3_id are repeating groups of the same attribute. What happens when you need to store four recent tracks? You'd have to alter the table and add more columns. And querying "find all users who recently played track 42" requires checking three columns.

StreamFlow solves this correctly with the streams table, where each stream event is its own row. Finding recent tracks for a user is a simple query:

SQL

What About JSON Columns?

StreamFlow's tracks table has a metadata column of type JSONB, and streams has device_info as JSONB. Does storing JSON violate 1NF?

It depends on how you use it. If the JSON holds data you never need to filter, join, or aggregate on (like a track's ISRC code, BPM, or recording studio), storing it as JSON is a pragmatic choice. But if you regularly query by values inside the JSON (like filtering streams by device type), those values should probably be their own columns.

The line is practical, not dogmatic. 1NF was defined before JSON columns existed. The modern rule of thumb is: if you need to query it, index it, or join on it, make it a proper column.

Second Normal Form (2NF)

A table is in 2NF when it's in 1NF and every non-key column depends on the entire primary key, not just part of it. This rule only matters for tables with composite primary keys, because a single-column primary key has no "parts" to depend on partially.

Consider a hypothetical table that tracks which playlists contain which tracks, but also stores the track's duration:

SQL

The primary key is (playlist_id, track_id). The position and added_at columns depend on the full composite key: a track's position depends on which playlist it's in and which track it is. But track_duration depends only on track_id. A track's duration is the same regardless of which playlist contains it. This is a partial dependency, and it violates 2NF.

The following diagram shows the dependency structure:

The track_duration column (red) depends on only part of the primary key. If the same track appears in 50 playlists, the duration is stored 50 times. Update it in one row but not the others, and you have an inconsistency.

Fix: Remove the partially dependent column and store it in the table it actually belongs to:

SQL

Now duration_seconds is stored once per track, in the tracks table. When you need the duration for a playlist track, you join:

SQL

StreamFlow's playlist_tracks table is already in 2NF. It stores only the facts that depend on the full composite key: which track is in which playlist, in what position, and when it was added.

Third Normal Form (3NF)

A table is in 3NF when it's in 2NF and no non-key column depends on another non-key column. In other words, every non-key column must depend directly on the primary key, with no intermediary. When a non-key column depends on the primary key only through another non-key column, that's called a transitive dependency.

Here's an example. Suppose StreamFlow stored subscription details directly in the users table:

SQL

The dependency chain looks like this:

The tier_price doesn't depend on the user directly. It depends on the subscription_tier, which in turn depends on user_id. This is a transitive dependency. Every premium user would have the same tier_price value repeated across thousands of rows.

The red columns depend on subscription_tier (orange), not directly on the primary key. If the price for the premium tier changes, you'd need to update every premium user's row.

Fix: Extract the transitively dependent columns into their own table:

SQL

Now the tier price is stored once, in the subscription_tiers table. Updating the premium price is a single-row update. StreamFlow's actual schema separates users and subscriptions into distinct tables, following this principle.

The Practical Test for 3NF

For any non-key column, ask: "Does this column describe the entity identified by the primary key, or does it describe something else?" If tier_price describes a subscription tier rather than a user, it belongs in a subscription tier table. This question catches transitive dependencies without needing to formally map out dependency chains.

Boyce-Codd Normal Form (BCNF)

BCNF is a stricter version of 3NF. A table is in BCNF when every determinant (any column or set of columns that functionally determines another column) is a candidate key. In most practical cases, a table that's in 3NF is also in BCNF. The difference only appears in specific situations involving overlapping candidate keys.

Here's the classic scenario. Suppose StreamFlow assigns each artist to exactly one genre for editorial purposes, and each genre has exactly one editor responsible for it:

SQL

The functional dependencies are:

Every non-key column depends on the primary key (so it's in 3NF). But genre_name → editor_name means genre_name is a determinant that isn't a candidate key. This violates BCNF because a non-key column is determining another column.

The problem shows up as redundancy: if the "Rock" genre has 200 artists, the editor's name for Rock is stored 200 times.

Fix: Split into two tables:

SQL

In practice, BCNF violations are uncommon. They require a specific pattern of overlapping functional dependencies that most schemas don't have. Knowing 3NF thoroughly covers the vast majority of real-world design decisions and interview questions. BCNF is worth understanding for completeness, but don't expect it to come up often.

The Quick Test

There's a simple heuristic that captures the spirit of normalization without memorizing formal rules:

Can I update this fact in exactly one place?

Walk through the columns of any table and ask: if this value changes, how many rows do I need to update?

FactTableRows to UpdateNormalized?
An artist's nameartists1 rowYes
A track's durationtracks1 rowYes
A user's emailusers1 rowYes
A genre editor's nameartist_genre_editorsAll artists in that genreNo
A tier's priceusers_badAll users on that tierNo

If the answer is "one row," you're in good shape. If the answer is "every row that happens to share this value," the table has redundancy that normalization would eliminate.

This test works because all three normal forms target the same underlying issue: redundancy caused by facts being stored in the wrong table. 1NF eliminates multi-valued fields. 2NF eliminates facts that don't depend on the full key. 3NF eliminates facts that depend on non-key columns. But they all boil down to: each fact should live in one place.