Every query so far has worked with individual rows. Scalar functions transform one value at a time, WHERE filters one row at a time, and the result set contains one row per input row (minus whatever gets filtered out).

Aggregate functions work differently. They take a set of rows and collapse them into a single value: a total, a count, an average. That shift from "per-row" to "per-set" thinking is the foundation of everything in this section of the course.

Sample Data

Insert the following rows to follow along:

SQL

Aggregate Functions vs Scalar Functions

The scalar functions operate on one value at a time: give UPPER a string, get back one uppercase string. The output has the same number of rows as the input.

Aggregate functions go in the opposite direction. They consume multiple rows and produce a single result. COUNT takes a set of rows and returns how many there are. SUM takes a column of numbers and returns their total. The output has fewer rows than the input, often just one.

Scalar functions produce one output per input row. Aggregate functions reduce many rows down to one. This distinction matters because you can't freely mix them in a SELECT clause without GROUP BY, something the next few chapters will build toward.

The three most common aggregate functions are COUNT, SUM, and AVG. SQL also has MIN and MAX, which the next chapter covers separately.

COUNT

COUNT is the most frequently used aggregate function. It answers the question "how many?" But there are three distinct forms of COUNT, and they behave differently.

COUNT(*)

COUNT(*) counts every row in the result set, regardless of what values those rows contain. NULLs, duplicates, empty strings, zeroes, it counts them all.

Example: How many streams are in the table?

SQL

The streams table has 20 rows, so this returns 20. If you add a WHERE clause, COUNT(*) counts only the rows that pass the filter:

SQL

This returns 10 because ten streams have country = 'US'.

COUNT(*) is the right choice when you want to count rows, not values. It doesn't look at any particular column.

COUNT(column)

COUNT(column) counts the number of non-NULL values in that column. Rows where the column is NULL are skipped.

The streams table has a country column, and one stream (stream_id 7, from gina_null) has a NULL country. Compare:

SQL

COUNT(*) returns 20. COUNT(country) returns 19, because it ignores the one NULL.

This behavior applies to every aggregate function, not just COUNT. It's the fundamental rule of NULL handling in aggregation: aggregate functions skip NULL values. The only exception is COUNT(*), which doesn't look at column values at all.

COUNT(DISTINCT column)

COUNT(DISTINCT column) counts the number of unique non-NULL values. It combines deduplication with counting in a single operation.

How many different countries do streams come from?

SQL

The streams table has countries US, GB, KR, NULL, JP, and BR. COUNT(DISTINCT country) returns 5, not 6, because it skips NULL and then deduplicates the rest.

This is more concise than the equivalent two-step approach of using a subquery:

SQL

Putting the Three Forms Together

A single query can use all three forms to reveal the structure of a column:

SQL

Reading these three numbers together tells you a lot. There are 20 streams total, 19 have a country recorded, and those 19 values span 5 distinct countries. The gap between 20 and 19 reveals one NULL. The gap between 19 and 5 reveals significant repetition.

This pattern is useful when exploring a new dataset or validating data quality.

SUM

SUM adds up the values in a numeric column across all rows. Non-numeric columns produce an error.

Example: What's the total listening time across all streams?

SQL

This adds up the duration_listened value from all 20 rows and returns a single number.

You can apply arithmetic to the column inside SUM, or to the result. To get the total in minutes:

SQL

Or to sum a calculated value, like converting seconds to hours per row before summing:

SQL

SUM and NULLs

Like all aggregate functions, SUM skips NULL values. If three rows have values 10, NULL, and 30, SUM returns 40, not NULL. This is usually what you want, but it's worth being explicit about.

Where it matters is in calculations involving columns that can be NULL. The payments table has a refund_amount column that defaults to 0 for most rows but could be NULL if someone forgot to set it.

Example: Calculating net revenue:

SQL

This works because refund_amount in the sample data is always 0 or 9.99, never NULL. But if refund_amount were NULL for some rows instead of 0, SUM(refund_amount) would skip those rows, potentially understating refunds. The safe approach with columns that might contain NULLs is to use COALESCE:

SQL

COALESCE(refund_amount, 0) converts NULLs to 0 before SUM processes them.

SUM with Filtered Rows

SUM works with WHERE to aggregate a subset of rows.

Example: Total revenue from completed payments only:

SQL

This filters out the refunded and failed payments before summing.

SUM of Zero Rows

If WHERE eliminates all rows, SUM returns NULL, not 0.

SQL

No rows have status 'pending', so SUM has nothing to add up. The result is NULL. If you need 0 instead of NULL in this case, wrap it with COALESCE:

SQL

This returns 0 instead of NULL. The same behavior applies to AVG and all other aggregate functions except COUNT(*), which returns 0 when applied to an empty set.

AVG

AVG computes the arithmetic mean: the sum divided by the count. Like SUM, it only works on numeric columns.

Example: What's the average payment amount?

SQL

This sums all amount values and divides by the number of non-NULL values.

AVG and NULLs: The Hidden Trap

AVG skips NULL values, and this can produce results that don't match expectations. Consider a simpler example: suppose you have five rows with values 10, 20, NULL, NULL, 30.

SUM = 60 (skips the two NULLs)
COUNT(column) = 3 (skips the two NULLs)
AVG = 60 / 3 = 20

If you expected AVG to treat NULLs as 0, the "correct" average would be 60 / 5 = 12. But SQL doesn't work that way. AVG divides by the count of non-NULL values, not the total row count.

This distinction matters in the streams table if duration_listened could be NULL (representing streams that failed to log a duration). The average listening time would be calculated over only the streams that have a recorded duration, potentially overstating the true average.

To include NULLs as 0 in the average:

SQL

COALESCE converts NULLs to 0 before AVG processes them, changing both the sum (unchanged, since 0 adds nothing) and the count (increased, since 0 is not NULL).

Controlling Decimal Precision

AVG often produces results with many decimal places. Use ROUND to control precision:

SQL

This returns the average payment rounded to two decimal places.

AVG(DISTINCT column)

Like COUNT, AVG supports the DISTINCT keyword. AVG(DISTINCT column) averages only the unique values:

SQL

The payments table has amounts 9.99 and 14.99. AVG(amount) sums all 12 rows and divides by 12. AVG(DISTINCT amount) sums 9.99 and 14.99 (just the two unique values) and divides by 2, giving 12.49.

Weighted Averages

A plain AVG gives equal weight to every row. Sometimes that's misleading. If you're calculating the average track duration across all streams, each stream event should weight the track's duration by how many times it was streamed.

The simplest way to compute a weighted average in SQL is with SUM and COUNT (or SUM and SUM):

SQL

The key idea is the formula: SUM(value * weight) / SUM(weight). A track that was streamed 5 times contributes 5 times as much to the average as a track streamed once.

Combining Multiple Aggregates

A single SELECT can include multiple aggregate functions, and they all operate on the same set of rows:

SQL

Each aggregate independently processes the filtered row set. COUNT(*) counts how many completed payments exist, SUM(amount) adds their amounts, and so on. Filtering happens once (via WHERE), and all aggregates share that filtered input.

You can also mix aggregates with constants and arithmetic:

SQL

Aggregate Functions: COUNT, SUM, AVG