Last Updated: May 3, 2026
A query that filters on an unindexed column forces the database to read every row in the table. On a table with 10 million rows, that means 10 million comparisons just to find a handful of matches.
Indexes solve this by giving the database a shortcut to find the rows it needs without scanning the entire table.
Insert these rows to follow along:
The streams table has 15 rows across multiple users, tracks, and countries. That is enough to illustrate how indexes work, though the real benefit of indexes becomes clear at thousands or millions of rows.
An index is a separate data structure that the database maintains alongside a table. It stores a sorted copy of one or more column values, paired with pointers back to the full rows in the table. Think of it like the index at the back of a textbook: instead of reading every page to find where "B-Tree" is discussed, you look up "B-Tree" in the index and jump straight to the right page.
Without an index, the database has one option: read every row in the table and check whether it matches the filter. With an index on the filtered column, the database can locate the matching rows directly.
The following diagram shows the difference:
The sequential scan touches every row regardless of how many match. The index lookup goes directly to the matching entries. On a table with millions of rows, that difference is the difference between a query taking seconds and taking milliseconds.
When you run a query like this:
The database performs a sequential scan (also called a full table scan). It starts at the first row, checks if country = 'US', moves to the second row, checks again, and repeats until it has processed every row in the table.
For the 10-row users table above, this is perfectly fine. The database reads 10 rows and returns 3 matches. But consider what happens when the table grows. If users had 5 million rows and only 200 matched the filter, the database would still read all 5 million rows to find those 200. The work scales with the total number of rows, not the number of matches.
This is where indexes change the equation. By creating an index on the country column, the database can skip straight to the rows where country = 'US' without reading the rest of the table. The work scales with the number of matching rows instead.
B-Tree (balanced tree) is the default index type in PostgreSQL, MySQL, SQL Server, and other popular relational databases.
When you create an index without specifying a type, you get a B-Tree.
A B-Tree organizes values into a hierarchy of nodes. The root node sits at the top and contains a few key values that divide the data into ranges. Internal nodes further subdivide those ranges. Leaf nodes at the bottom contain the actual indexed values along with pointers to the corresponding rows in the table.
The tree stays balanced, meaning every leaf node is the same distance from the root. This guarantees that looking up any value requires the same number of steps, regardless of where in the range it falls.
To find all users where country = 'GB', the database starts at the root node and follows the path:
DE | JP. Since 'GB' falls between 'DE' and 'JP', follow the middle pointer.DE | GB. Since 'GB' matches, follow the pointer for 'GB'.GB → rows 3, 6. Fetch those rows from the table.Three steps instead of scanning every row. On a real B-Tree with millions of entries, the tree might be 3-4 levels deep, meaning any lookup touches only 3-4 pages of data, not millions.
B-Trees handle range queries naturally because the leaf nodes are linked in order. To find all users who signed up after 2024-01-01:
The database uses the B-Tree index on signup_date to find the first entry after 2024-01-01, then walks the linked leaf nodes forward until it passes the end of the range. It does not need to scan the entire index, just the portion that matches.
This is why B-Trees support all of these operations efficiently:
| Operation | Example | How the B-Tree Helps |
|---|---|---|
| Equality | WHERE country = 'US' | Navigate tree to exact value |
| Range | WHERE signup_date > '2024-01-01' | Find start point, scan leaf nodes forward |
| BETWEEN | WHERE user_id BETWEEN 3 AND 7 | Find lower bound, scan to upper bound |
| ORDER BY | ORDER BY signup_date | Read leaf nodes in order (already sorted) |
| MIN / MAX | SELECT MIN(signup_date) | Jump to first or last leaf node |
| Prefix match | WHERE username LIKE 'ali%' | Navigate to 'ali', scan forward while prefix matches |
One operation B-Trees cannot help with is a suffix or infix search like WHERE username LIKE '%dev'. The tree is sorted by the beginning of the string, so there is no shortcut to find values ending with 'dev'. That requires a sequential scan (or a specialized index type like GIN with trigram support, covered in the next chapter).
Because B-Tree leaf nodes are already in sorted order, the database can sometimes skip the sort step entirely when the query has an ORDER BY that matches the index order.
If an index exists on signup_date, the database can read the index leaf nodes in order and return the results without a separate sort operation. This matters on large tables where sorting millions of rows is expensive.
Hash indexes use a hash function to map each value to a fixed-size bucket. The database computes the hash of the lookup value and jumps directly to the right bucket.
For a query like WHERE user_id = 4, the database:
user_id = 4.This is an O(1) operation on average, compared to O(log n) for a B-Tree lookup. For pure equality checks on high-cardinality columns (many distinct values), hash indexes can be slightly faster.
Hash indexes have significant restrictions compared to B-Trees:
| Capability | B-Tree | Hash |
|---|---|---|
Equality (=) | Yes | Yes |
Range (>, <, BETWEEN) | Yes | No |
| ORDER BY | Yes | No |
| MIN / MAX | Yes | No |
Prefix LIKE ('ali%') | Yes | No |
| NULL handling | Yes | Varies by database |
Because values are hashed, the original ordering is lost. The database cannot walk through hash buckets in sorted order, so range queries, sorting, and prefix matching all require a full table scan regardless of the hash index.
In practice, B-Tree indexes handle equality lookups well enough that hash indexes are rarely worth the trade-off. B-Trees give up a small amount of equality lookup speed in exchange for supporting every other operation. Most database administrators default to B-Tree for almost everything.
Hash indexes make sense in narrow scenarios: a column that is only ever filtered with =, never sorted or ranged, and where the marginal speed improvement matters. In PostgreSQL, hash indexes were not crash-safe until version 10 and are still not as widely used or tested as B-Trees.
Indexes are not free. Every index you create comes with costs that affect writes, storage, and maintenance.
An index is a separate data structure stored on disk alongside the table. A B-Tree index on a single integer column typically uses about 20-30% of the space the table data occupies. Text columns produce larger indexes because the keys are bigger. An index on a VARCHAR(100) column takes more space than an index on an INT column.
On a table with 10 million rows, a single B-Tree index might use 200-400 MB of disk. Five indexes on the same table could add 1-2 GB of storage overhead.
Every time you insert, update, or delete a row, the database must also update every index on that table. If a table has 5 indexes, a single INSERT triggers 6 write operations: one for the table data and one for each index.
Updates are even worse when they modify an indexed column. The old entry must be removed from the index and a new one inserted. This is why heavily indexed tables can see significantly slower write performance.
B-Tree indexes can become fragmented over time as rows are inserted and deleted. Leaf nodes develop gaps where deleted entries used to be. PostgreSQL handles this with autovacuum, but on very write-heavy tables, index bloat can still cause performance degradation. Periodic REINDEX operations can reclaim wasted space.
An index is only useful if it narrows down the search to a small fraction of the table. This property is called selectivity.
A query with high selectivity matches a small percentage of rows. Indexes work well here because the database reads the index, finds a few row pointers, and fetches just those rows.
The user_id column has a unique value per row, so this query matches exactly one row. An index on user_id (which a PRIMARY KEY already provides) turns this into a fast lookup regardless of table size.
A query with low selectivity matches a large percentage of rows. In this case, an index can actually be slower than a sequential scan.
If 40% of users are on the free tier, the database would need to look up 40% of the rows through the index, then fetch each one from the table. The random I/O pattern of jumping between index entries and table pages is slower than just reading the table sequentially from start to finish.
The database's query planner knows this. It tracks statistics about column value distributions and decides whether to use an index or a sequential scan based on the estimated selectivity. You can see this decision by running EXPLAIN (covered in a later chapter).
There is no exact percentage where indexes stop being useful, as it depends on factors like table size, row width, and disk speed. As a rough guide:
| Selectivity | Matching Rows | Index Useful? |
|---|---|---|
| Very high | < 1% of rows | Almost always yes |
| High | 1-10% of rows | Usually yes |
| Medium | 10-25% of rows | Depends on table size and row width |
| Low | > 25% of rows | Usually no, sequential scan is faster |
The query planner makes this decision automatically. You do not need to force index usage. Instead, focus on creating indexes on columns with high selectivity that appear frequently in WHERE clauses.
The naming convention idx_{table}_{column} is not required but is widely used. A clear name makes it easier to identify what an index covers when reviewing performance.
A unique index enforces that no two rows can have the same value in the indexed column. Primary keys automatically create a unique index.
This both speeds up lookups on email and prevents duplicate email addresses. Attempting to insert a duplicate value will fail with a constraint violation error.
Removing an index that is no longer needed frees up storage and speeds up writes:
In PostgreSQL, DROP INDEX takes an exclusive lock on the table by default, which blocks other queries. For production tables, use DROP INDEX CONCURRENTLY to avoid blocking:
To see what indexes already exist on a table:
| Database | Command |
|---|---|
| PostgreSQL | SELECT indexname, indexdef FROM pg_indexes WHERE tablename = 'streams'; |
| MySQL | SHOW INDEX FROM streams; |
| SQL Server | sp_helpindex 'streams'; |
Checking existing indexes before creating new ones avoids duplicate indexes, which waste storage and slow down writes without providing any query benefit.
Primary key columns automatically get an index. You never need to create one manually.
Foreign key columns do not automatically get an index in most databases (MySQL's InnoDB is an exception, as it requires an index on foreign key columns and creates one if missing). Since foreign keys are frequently used in JOIN conditions, creating indexes on foreign key columns is one of the highest-impact optimizations you can make.
More indexes are not always better. Each index slows down writes and uses disk space. A table with 15 columns does not need 15 indexes. Focus on columns that appear in WHERE clauses, JOIN conditions, and ORDER BY clauses of your most frequent queries.
This is one of the most common ways to accidentally disable an index:
What goes wrong here?
The index is built on the raw started_at values. When you wrap the column in a function like EXTRACT, the database cannot match the transformed values against the index. It falls back to a sequential scan.
Fix: Rewrite the condition to use the column directly:
Now the database can use the B-Tree index on started_at to find the range.
Foreign key constraints enforce referential integrity, but they do not create indexes automatically (except in MySQL InnoDB). Without an index on the foreign key column, queries that join tables or delete parent rows perform sequential scans on the child table.
Adding CREATE INDEX idx_streams_track_id ON streams (track_id) allows the database to look up matching streams for each track using the index instead of scanning.