AlgoMaster Logo

Regular Expressions

Last Updated: May 17, 2026

13 min read

A regular expression (regex) is a small pattern language for matching, extracting, and replacing text that follows a shape rather than an exact value. When you need to ask "does this string look like a phone number?", "pull every order ID out of this log line", or "split on commas or semicolons", plain string methods like Contains and IndexOf run out of road quickly. The System.Text.RegularExpressions namespace gives you a proper tool for those cases. This chapter walks through when to reach for regex, the Regex class and its operations, the pieces of pattern syntax you'll actually use, groups and captures, the RegexOptions flags, the source-generated [GeneratedRegex] form added in .NET 7, and the pitfalls that bite beginners.

When to Use Regex (and When Not To)

Regex shines when the thing you're matching has a shape but the exact characters vary. A few examples from an e-commerce app:

  • An order ID always starts with ORD-, then four digits, a dash, and a sequence of digits. IndexOf can find the prefix, but checking the digit counts cleanly is awkward.
  • A product description contains prices that look like $19.99 or $199. Extracting every one of them by hand means scanning for $, then reading digits, then maybe a dot, then more digits. Regex does that in one pattern.
  • A CSV row might use commas, semicolons, or pipe characters as separators. string.Split(',') only handles one of those at a time.

Regex is the wrong tool when:

  • You only need a plain substring check. text.Contains("FREE") is faster, clearer, and won't surprise anyone who reads it.
  • You only need a prefix or suffix. text.StartsWith("ORD-") and text.EndsWith(".json") say what they mean.
  • You're parsing JSON, XML, HTML, or any other structured language. Use System.Text.Json, an XML reader, or an HTML parser. Regex on these formats works for trivial cases and breaks the moment the input gets real.
  • The pattern depends on user-supplied input. Build the safe pattern with Regex.Escape or rethink the design.

A short heuristic. If you can describe the match in one sentence using words like "any digit", "one or more", "ends with", "either A or B", that's regex territory. If the description is "exactly this string" or "every line that contains X", string methods do the job.

The Regex Class

Everything lives in System.Text.RegularExpressions. The Regex class has two flavors of API, static and instance, and they offer the same operations.

The static methods take both the input and the pattern on every call:

The instance API compiles the pattern once into a Regex object, then lets you call methods on that object many times:

Use the static API for one-off checks scattered around your code. Use the instance API when you have a pattern you'll apply many times in a row, especially inside a loop. The static API does cache compiled patterns internally (.NET keeps around the most recently used 15 by default), so the difference for short-lived programs is small. For long-running services or hot paths, the instance API plus RegexOptions.Compiled (or the source-generated form covered later) makes the cost predictable.

A note on string literals. Regex patterns are full of backslashes (\d, \s, \., and so on). In a regular C# string, "\d" is invalid because \d isn't a recognized escape. You'd have to write "\\d" everywhere. The verbatim string form (@"...") treats backslashes as literal, so the pattern reads the way regex documentation shows it:

Both compile to the same regex, but the verbatim form is what every C# codebase uses. We'll stick to @"..." for the rest of this chapter.

A Map of the Operations

Five operations cover almost everything you'll do with a regex. They all exist on both the static Regex class and on instance objects.

IsMatch answers a yes/no question. Match returns the first match. Matches returns every match. Replace produces a new string with matches swapped out. Split breaks the input on every match. We'll see one example of each before getting into pattern syntax in depth.

Five lines, five operations. We'll come back to each one with realistic e-commerce examples after walking through the pattern pieces they all rely on.

Pattern Syntax Basics

The pattern language has dozens of features, but a handful do most of the work. Here's a reference table for the tokens you'll meet repeatedly, followed by short explanations.

TokenMeaning
a, 1, -A literal character matches itself
\dAny digit, 0-9
\DAny non-digit
\wWord character: letters, digits, underscore
\WAnything not a word character
\sWhitespace: space, tab, newline
\SAnything not whitespace
.Any character except a newline
[abc]Any one of a, b, or c
[a-z]Any lowercase letter
[^abc]Any character that is not a, b, or c
^Start of the input (or line, with Multiline)
$End of the input (or line, with Multiline)
?Zero or one of the preceding token
*Zero or more
+One or more
{n}Exactly n
{n,m}Between n and m
{n,}n or more
*?, +?, ??Lazy versions of *, +, ?
(...)A capturing group
(?:...)A non-capturing group
a|bEither a or b
\., \\, \(Escaped literal of a metacharacter

A few of these need their own walk-through, because they trip up beginners.

Character Classes

A character class matches one character from a set. [abc] matches a, b, or c. [a-z] is a shorthand for the range a through z. [^abc] is the negation: any character that is not a, b, or c.

[SML] matches a single character that is S, M, or L. The string XL doesn't contain any of those individually as a single-character match in the first position, but it does contain L, so the match succeeds. The third line is False because every character in XYZ is an uppercase letter, so the negated class [^A-Z] finds nothing.

Quantifiers

A quantifier says how many copies of the previous token to match. \d+ is "one or more digits". \d{4} is "exactly four digits". \d{2,4} is "between two and four digits". \d* is "zero or more digits".

The empty string matches \d* because zero digits is a valid count for *. That's a frequent source of bugs: a pattern that should require at least one match accidentally uses * when it meant +.

Greedy vs Lazy

By default, quantifiers are greedy: they grab as many characters as possible while still allowing the overall pattern to match. Adding a ? makes them lazy: they grab as few as possible.

The greedy .+ grabbed everything from the first < to the last >. The lazy .+? stopped at the first >. That's the difference. (As an aside, this is a toy example. Parsing real HTML with regex doesn't work; use an HTML parser.)

Anchors

^ matches the start of the input, $ matches the end. Without anchors, a pattern can match anywhere inside the string. With anchors, the pattern has to line up with the boundaries.

The third one is False because the input has Order before ORD- and placed after, so the pattern can't line up with the start and end. Anchors are how you say "the whole string must look like this", which is the right shape for validation.

Grouping and Alternation

Parentheses group a sequence of tokens so a quantifier can apply to all of them at once, and they also capture the matched text. a|b is alternation: match a or b.

(abc)+ means "one or more abcs in a row". ^(yes|no)$ means "the whole input is either yes or no".

A capturing group costs a small amount of memory and bookkeeping. When you only need the grouping for the quantifier or alternation and don't care about the captured text, use the non-capturing form (?:...):

Escaping

To match a character that has special meaning in regex (., (, \, ?, *, +, ^, $, [, ], {, }, |), put a backslash in front of it. To match a literal dot, write \.. To match a literal backslash, write \\. To match a literal opening parenthesis, write \(.

The pattern says "digits, then a literal dot, then exactly two digits". Without the backslash, . would match any character, and the pattern would be much looser than intended.

The Five Operations in Practice

Each operation has a typical e-commerce use case. Walking through them with realistic inputs makes the patterns easier to remember.

IsMatch: Validating a Phone Number

A loose phone number check: ten digits, optionally separated by spaces, dashes, or dots. This is loose by design. Real phone validation needs country codes and follows ITU rules; what we want here is "does this look like a phone number at all" for a quick UI check.

^...$ anchors the pattern to the whole string. [\s.-]? is "zero or one of: whitespace, dot, or dash". The same shape repeats three times. A real production form would also check length-10 minimum after stripping separators, but for a "does this look reasonable?" check, this pattern is fine.

A similar pattern for a loose email check:

A note on email validation. The full RFC 5322 grammar for email addresses is genuinely complex, and "valid by the RFC" doesn't always mean "deliverable". For most apps, a loose pattern like the one above plus an actual verification email is a better trade-off than trying to encode the full RFC in regex.

Match: Extracting the First Price

Match returns a Match object describing the first match. You read .Value for the matched text, .Success to check whether it matched at all, .Index for the position, and .Length for the length.

The pattern is \$ (a literal dollar sign), then \d+ (one or more digits), then (?:\.\d{2})? (optionally a dot and two digits). The non-capturing group is just for grouping, since we don't need the cents separately.

Matches: All Order IDs in a Log Line

Matches returns a MatchCollection. You can iterate it with foreach, ask for .Count, or index into it.

Three matches, each one a full order ID. The pattern doesn't care what's between them; it just walks the string and reports every non-overlapping match.

Replace: Masking Sensitive Data

Masking is useful before writing logs. A row that contains a sequence that looks like a long card-style number should not go to a log file unchanged. A simple pattern catches sequences of 13 to 19 digits.

The replacement text is a plain string. You can also use $1, $2, and so on to refer back to capture groups in the match. For more complex transformations there's an overload that takes a MatchEvaluator delegate (a function that takes a Match and returns the replacement string).

A simpler example: normalizing whitespace in a product description. Replace any run of whitespace with a single space.

Split: Tokenizing a CSV with Mixed Delimiters

Split breaks the input on every match. If the delimiters in a CSV-like row vary, regex split is cleaner than string.Split with a hard-coded set.

The pattern \s*[,;|]\s* matches a single comma, semicolon, or pipe, with optional surrounding whitespace. Split returns the pieces between every match. (Real CSV parsing has to handle quoted fields with embedded commas; for that, use a dedicated CSV library. This pattern is fine for simple cases.)

Groups and Captures

Capturing groups let you pull out the pieces inside a match, not just the match as a whole. They're how regex turns "this matches a pattern" into "here are the parts I care about".

Numbered Groups

Every set of parentheses in a pattern, unless marked ?:, creates a numbered capture group. Group 0 is always the whole match; groups 1, 2, 3 are the parenthesized subpatterns from left to right.

Three groups, each picking out one piece of the order ID. The whole match is at index 0; the captured pieces start at 1.

Named Groups

Numbered groups are fragile. If you add a new capturing group to the front of the pattern, every existing index shifts. Named groups fix that. Write (?<name>...) and access the capture with match.Groups["name"].

For anything more than two groups, named captures are worth the extra characters. The pattern is self-documenting, and adding a new group somewhere else doesn't break the lookups.

You can also use named groups in Replace with ${name}:

The replacement string pulls the captured prefix and year back in by name and replaces the sequence with stars.

RegexOptions

The Regex constructor takes a second argument: a RegexOptions value that changes how the pattern is interpreted. The common ones:

OptionWhat it does
IgnoreCase[a-z] matches A-Z too; literal letters match either case
Multiline^ and $ match at the start/end of each line, not just the input
Singleline. matches newlines as well as everything else
CompiledBuilds an in-memory IL representation of the regex for faster matching
CultureInvariantIgnores the current culture when matching characters (avoids surprises like Turkish i)
ExplicitCaptureOnly named groups capture; bare (...) becomes non-capturing
IgnorePatternWhitespaceLets you add whitespace and # comments inside the pattern

A practical example using a handful of them together:

IgnoreCase lets order match ORDER and ord- match ORD-. CultureInvariant keeps the comparison stable across locales. Compiled is the interesting one.

When Compiled Pays Off

By default, a regex is interpreted: the engine walks the compiled internal tree on every match. RegexOptions.Compiled builds dynamic IL for the pattern when the Regex is constructed. After that, matches run faster, but construction takes longer.

The rule of thumb. If you'll call the regex tens of thousands of times or more, Compiled is worth it. For a regex you'll call once or twice, it costs more than it saves. The exact crossover depends on the pattern; Compiled shines on long, hot-path regexes that run forever.

Source-Generated Regex ([GeneratedRegex])

C# 11 (.NET 7+) added a better way to write hot-path regexes: the [GeneratedRegex] attribute. You decorate a partial method with the attribute, and a source generator writes a full implementation at compile time. The result is faster than RegexOptions.Compiled, works with ahead-of-time (AOT) compilation, and adds nothing to startup time because no IL is emitted at runtime.

The pattern looks like this:

Three things to notice. The class is partial so the source generator can add to it. The method is partial with no body; the generator writes the body. The attribute carries the pattern and the options. At call sites, you use the method like any other static method.

For static patterns known at compile time, this is the form to reach for in new code. It outperforms RegexOptions.Compiled, helps with AOT scenarios (where runtime IL emit isn't allowed), and keeps the pattern in one named place. The only restriction is that the pattern has to be a compile-time constant; you can't use it for patterns built at runtime from user input.

Common Pitfalls

A short tour of the failures that hit beginners (and plenty of experienced developers) the first time they use regex in anger.

Catastrophic Backtracking (ReDoS)

Some patterns can take exponential time on inputs that don't match. The classic shape is "a quantifier on top of another quantifier on top of overlapping alternatives". The pattern ^(a+)+$ looks innocuous, but on input like aaaaaaaaaaaaaab, the engine tries every way to split the as among the inner and outer + and only gives up after exploring all of them.

The defense:

  1. Avoid patterns with nested quantifiers over the same character class. Rewrite ^(a+)+$ as ^a+$.
  2. Set a timeout on the regex: new Regex(pattern, options, TimeSpan.FromMilliseconds(100)). After the timeout, the regex throws RegexMatchTimeoutException instead of hanging.
  3. Use Regex.Escape when the pattern is built from user input, so user-supplied metacharacters don't get interpreted as patterns.

A timeout example:

This is the kind of defense you put in any code that runs user-controlled regex or matches user input against tricky patterns.

Parsing HTML or JSON

Don't. Both languages are recursive, with rules that regex cannot express. A pattern that "works" on hand-picked examples breaks on real-world inputs with comments, escaped quotes, nested structures, or unusual whitespace.

For JSON use System.Text.Json. For HTML use a library like HtmlAgilityPack or AngleSharp. For XML use XDocument or XmlReader. Regex is for flat patterns with shape, not nested grammars.

Forgetting to Escape User Input

If the pattern incorporates a user-supplied string, that string can contain metacharacters. A search term like ($199) would blow up because (, $, and ) are all special. Regex.Escape solves this by adding backslashes in front of every metacharacter.

Regex.Escape is the right call any time you're folding user-supplied text into a pattern. The escaped form matches the original characters literally and never accidentally turns into a regex feature.

Summary

  • Regex is for matching, extracting, replacing, and splitting text by shape, not by exact value. Plain string methods are better for substring, prefix, and suffix checks.
  • The Regex class lives in System.Text.RegularExpressions. Static methods (Regex.IsMatch, Regex.Match, Regex.Matches, Regex.Replace, Regex.Split) are convenient for one-offs; instance methods on a stored Regex are better for repeated use.
  • Use verbatim string literals (@"\d+") for patterns. Anchors ^ and $, character classes [...] and [^...], quantifiers ?, *, +, {n}, {n,m}, and grouping (...) / (?:...) cover most of the syntax you'll need.
  • Capturing groups pull pieces out of a match. Prefer named groups ((?<name>...)) over numbered ones for stability and readability.
  • RegexOptions flags like IgnoreCase, Multiline, Singleline, Compiled, and CultureInvariant tune how the engine interprets the pattern. Compiled trades construction time for per-call speed.
  • [GeneratedRegex] on a partial method (C# 11+) is the modern default for static patterns. It outperforms RegexOptions.Compiled, works with AOT, and emits no runtime IL.
  • Watch for catastrophic backtracking. Avoid nested quantifiers over the same characters, set Regex.MatchTimeout on patterns that touch untrusted input, and run Regex.Escape over any user-supplied text mixed into a pattern. Don't use regex to parse JSON, XML, or HTML.

That closes the Strings section. You've now covered the basics, the immutability and interning behavior that explains why strings behave the way they do, StringBuilder for building long strings without thrash, interpolation and the formatting specifiers, verbatim and raw literals for keeping patterns and JSON readable, the comparison rules, and the regex pattern language. The capstone lab that follows pulls these pieces together into one end-to-end e-commerce program, so you can practice picking the right tool from the whole toolbox at once.