Every codebase eventually accumulates repeated logic. In StreamFlow, the same expression keeps showing up: “what percentage of a track did the user actually listen to?” It starts simple, duration_listened * 100.0 / duration_seconds, but soon it appears in dashboards, batch jobs, ad-hoc queries, and BI tools. Then someone adds a cap at 100 to handle clock drift, and now every copy behaves slightly differently.

User-defined functions solve this cleanly. You give that logic a name, store it in the database, and call it consistently everywhere.

Sample Data

To follow along with the examples in this chapter, insert this data into your StreamFlow database:

SQL

Twelve streams, six tracks, four users. Small enough to reason about by hand, big enough to see what functions do.

What Is a User-Defined Function?

A user-defined function is a named, stored piece of logic that takes inputs, runs some SQL (or procedural code), and returns a result. The database stores the definition once. Every session can call it. When the logic changes, you change it in one place.

Functions come in two main flavors, and the distinction matters for both syntax and performance:

Scalar functions return a single value. You call them wherever a value is allowed: in a SELECT list, a WHERE clause, an ORDER BY, or even as a default value for a column. LENGTH('hello') and NOW() are built-in scalar functions. Your own stream_completion_pct(stream_id) would be a user-defined scalar function.
Table-valued functions return a result set, essentially a table. You use them in the FROM clause, just like you'd use a real table or a view. generate_series(1, 10) in PostgreSQL is a built-in table-valued function. A function like top_tracks_for_user(user_id, 5) that returns five rows is a user-defined one.

The rest of the chapter uses these two shapes as the main organizing idea. Syntax varies between databases, but the concepts are the same everywhere.

Scalar Functions in PostgreSQL

Let's start with the simplest possible scalar function: one that computes the completion percentage for a stream. Given a track's total duration and how many seconds the user listened, return the percentage.

SQL

A few things are happening here. CREATE OR REPLACE FUNCTION defines the function (or replaces it if one with the same name and argument types already exists). The parameter list (listened INT, total INT) declares two input parameters and their types. RETURNS NUMERIC says the function returns a single numeric value. The body sits between the $$ dollar-quote markers, which is just PostgreSQL's way of letting you write code without escaping every quote. LANGUAGE SQL says the body is plain SQL, and IMMUTABLE tells the optimizer something important that we'll unpack in the next section.

Once the function exists, you can use it anywhere a numeric value is allowed:

SQL

The function produces clean, consistent results without us having to remember the formula. If we later decide we want completion capped at 100 (because sometimes clock drift produces 101%), we change it in one place:

SQL

Every query that calls stream_pct picks up the new behavior the next time it runs. That's the promise of UDFs: logic lives in one place.

SQL vs PL/pgSQL

PostgreSQL supports functions written in several languages. The two you'll see most often are:

`LANGUAGE SQL`: the body is one or more SQL statements. Great for simple wrappers around queries. The PostgreSQL optimizer can often inline these, which is the best possible outcome for performance.
`LANGUAGE plpgsql`: a procedural language with variables, loops, conditionals, and exception handling. Needed when pure SQL isn't enough, but cannot be inlined.

Here is the same function in PL/pgSQL, which opens the door to procedural constructs:

SQL

This version does the same thing, but now we have variables (DECLARE) and control flow (IF/THEN). For logic this simple, the SQL version is better. PL/pgSQL pays off when you need loops, exceptions, or multiple result sets that you assemble step by step. A good rule of thumb: reach for LANGUAGE SQL first, and only switch to plpgsql when plain SQL can't express what you need.

Deterministic vs Non-Deterministic Functions

One of the first things the optimizer wants to know about your function is: if I call this twice with the same inputs, will I get the same answer? That single question drives a lot of performance behavior, and PostgreSQL exposes it through function volatility categories:

Category	Meaning	Example
`IMMUTABLE`	Always returns the same result for the same input. No side effects. Doesn't read from the database.	`abs(x)`, `lower(s)`, pure math
`STABLE`	Within a single query, returns the same result for the same input. May read from tables.	`now()`, `current_user`, lookups
`VOLATILE` (default)	Can return different results even with the same input. May have side effects.	`random()`, `nextval()`, INSERT/UPDATE

Why does the database care? Because it can only safely cache, reorder, or pre-compute a function's result if the function is at least stable. Consider this query:

SQL

If seconds_for_minutes is IMMUTABLE, the optimizer evaluates it once at plan time, substitutes 180, and uses an index on duration_seconds. If it's VOLATILE, the optimizer has to call it for every row, because for all it knows the function might return a different value each time. On a million-row table, that's the difference between a millisecond and several seconds.

The default when you don't specify anything is VOLATILE, which is the safest but slowest choice. Mark functions as IMMUTABLE when they truly are. If you lie to the optimizer (claim immutable when the function reads tables), you'll get wrong results in some queries, because the optimizer will happily cache stale values.

Other dialects have their own versions of this concept. MySQL uses DETERMINISTIC / NOT DETERMINISTIC and READS SQL DATA / NO SQL / CONTAINS SQL / MODIFIES SQL DATA keywords. SQL Server has a WITH SCHEMABINDING option that enables deterministic optimizations for certain function patterns. The names differ. The idea is the same everywhere: tell the database how well-behaved your function is, so it can plan around it.

Table-Valued Functions

A table-valued function (TVF) returns rows instead of a single value. You use it in a FROM clause, just like a regular table. TVFs are the right tool when you want to encapsulate a parameterized query, something a view can't do because views don't take arguments.

Here is a TVF that returns a user's most-streamed tracks, limited to the top N:

SQL

The RETURNS TABLE (...) clause declares the shape of the result: each row has four columns with the listed types. Inside the function body, any query whose result matches that shape becomes the function's output. Note that this function is STABLE, not IMMUTABLE, because it reads from tables whose contents could change.

You call a TVF in the FROM clause:

SQL

You can also join a TVF to other tables:

SQL

The LATERAL keyword is what lets the function reference u.user_id from the outer query. Without LATERAL, you couldn't pass a value from the outer row into the function call. This is how you get "top N per group" behavior cleanly in PostgreSQL.

TVFs vs Views

A view is also a named, reusable query. So why have TVFs at all? The answer is parameters. Views can't take arguments, so a view that filters to "alice's top tracks" is a different view from "bob's top tracks." A TVF lets you pass the user ID as input. If your parameterization is simple, you can sometimes get away with a view and a WHERE clause in the calling query, but once the logic involves aggregation or LIMIT inside the reusable piece, only a TVF works.

Scalar UDFs and the Optimizer Fence Problem

Now we get to the part that shows up in real interviews: why scalar UDFs can make queries mysteriously slow.

Consider this reasonable-looking query:

SQL

"Show me all streams where the user listened to more than 90% of the track." Readable, right? Here's what happens under the hood in some databases: the optimizer has to call stream_pct for every single row in the joined result, because it can't see inside the function to reason about indexes, constant folding, or predicate pushdown. The function becomes an opaque black box, and the query plan devolves into a row-by-row walk.

In PostgreSQL, a LANGUAGE SQL function that is simple enough gets inlined: the optimizer substitutes the function body into the outer query and re-plans the whole thing. When inlining works, the function has essentially zero cost. When it doesn't (because the function is PL/pgSQL, or has multiple statements, or uses features that block inlining), the function becomes a per-row cost.

In SQL Server, scalar UDFs have a notorious history of being performance killers. Before SQL Server 2019, every call to a scalar UDF happened in a separate, single-threaded mini-execution context. On a million-row query, that's a million mini-contexts. SQL Server 2019 introduced scalar UDF inlining, which mostly fixes this, but only for functions that meet specific criteria. If you work with older versions or with functions that can't be inlined, you should treat scalar UDFs in WHERE clauses as a code smell.

The diagram below shows the two paths:

The practical advice: if a function is going to appear in a WHERE clause on a large table, prefer plain SQL expressions or inlineable LANGUAGE SQL functions. Avoid PL/pgSQL scalar functions in hot paths. And always test with EXPLAIN ANALYZE before deploying.

UDFs vs Views vs CTEs

By this point it probably feels like UDFs, views, and CTEs all do similar things: they give names to query fragments. The differences matter when you're picking the right tool.

Feature	View	CTE	Scalar UDF	Table-Valued UDF
Stored permanently	Yes	No (per-query)	Yes	Yes
Takes parameters	No	No	Yes	Yes
Returns	Rows	Rows	Single value	Rows
Can be indexed	Only materialized views	No	N/A	No (but underlying tables can)
Optimizer-friendly	Yes	Usually	Depends on inlining	Depends on inlining
Callable in WHERE	No	No	Yes	No
Callable in FROM	Yes	Yes	No	Yes

A rough decision framework:

Need a parameterized result set? Table-valued UDF.
Same unparameterized query used all over the place? View.
Intermediate result scoped to one query? CTE.
Need a reusable computation that returns a single value? Scalar UDF, but be careful in WHERE clauses on large tables.
Tiny bit of logic used twice in one statement? Inline it, or use a CTE. Don't create a function for two uses.

The worst choice for most situations is "turn every repeated expression into a scalar UDF." It feels clean, but in practice it often moves slow code from one place you can see to a place you can't.

Syntax in Other Dialects

PostgreSQL's syntax is only one flavor. The same ideas show up differently in MySQL and SQL Server, and interviews sometimes check whether you know at least one other dialect.

MySQL

SQL

MySQL requires the DELIMITER dance because the function body contains semicolons, and the default statement terminator is also a semicolon. DETERMINISTIC is the MySQL equivalent of PostgreSQL's IMMUTABLE. MySQL doesn't have true table-valued functions; you get scalar functions and stored procedures instead.

SQL Server

SQL

SQL Server uses the @ prefix for variables and parameters, and the schema name (dbo.) is usually part of the function name. SQL Server has three kinds of TVFs: inline TVFs (a single SELECT, which is the fastest kind), multi-statement TVFs (slower, not inlinable historically), and scalar UDFs (the historically problematic ones). In interviews, if you mention that you'd prefer inline TVFs over multi-statement TVFs in SQL Server, you'll sound like you've actually used it.

Overloading and Default Arguments

PostgreSQL allows you to define multiple functions with the same name as long as their argument types differ. This is called overloading, and it's useful when you want one conceptual function that accepts different input shapes. For example, you might have one stream_pct that takes two integers and another that takes a single stream_id:

SQL

Now both stream_pct(180, 210) and stream_pct(42::BIGINT) work. The optimizer picks the right one based on argument types. Be careful: overloading can make dropping functions tricky, because DROP FUNCTION stream_pct is ambiguous. You have to write DROP FUNCTION stream_pct(BIGINT) to name the exact variant.

Default arguments work the same way they do in most programming languages:

SQL

Now top_tracks_for_user(1) returns the top 10 automatically, while top_tracks_for_user(1, 3) overrides the default. Defaults are handy for functions with many parameters where most callers want the same values.

One thing to watch for: if you overload a function and also use default arguments, you can create ambiguous call sites that the optimizer rejects because more than one function matches. When in doubt, stick to either overloading or defaults, not both.

Production Realities

A few practical notes that don't always make it into documentation:

Dropping functions is painful. If other objects depend on the function (views, other functions, default expressions), you can't drop it without dropping them too, unless you use CASCADE, which silently drops the dependents. Always check dependencies first.
Permissions matter. Functions run with the caller's permissions by default in PostgreSQL. If you want a function to read tables the caller can't see directly, use SECURITY DEFINER (with care, because it's a classic vector for privilege escalation bugs).
Testing is awkward. Database functions don't have the same unit test tooling as application code. Tools like pgTAP exist for PostgreSQL, but adoption is low. Many teams end up testing functions implicitly through the queries that use them.
Version control is easy to neglect. A function defined with CREATE OR REPLACE in a one-off session is invisible to your source control. Always put function definitions in migration files and apply them through your normal deploy process.

None of these are reasons to avoid UDFs. They're reasons to treat them as real code, with the same care you'd give to application code.

User-Defined Functions (UDFs)