AlgoMaster Logo

Regular Expressions (regexp)

Last Updated: May 17, 2026

11 min read

Go's regexp package handles pattern matching, extraction, and replacement on strings using RE2 syntax. RE2 is a different regex engine from the PCRE family that powers Perl, Python, and JavaScript, and the differences matter: no lookaheads, no backreferences, and a guaranteed linear-time match in the size of the input. The payoff is that a pathological pattern can never freeze your program the way it can with PCRE engines.

Why RE2 Instead of PCRE

PCRE-style engines use backtracking. When a pattern like (a+)+b runs against an input like aaaaaaaaaaaaaaaaaaaaaaaa!, the engine tries every possible way to split the as between the inner and outer groups before giving up. That's exponential in the input length. A 30-character input can take seconds, and a 50-character input can take hours. This failure mode is called catastrophic backtracking, and it's been the root cause of more than one production outage (Cloudflare 2019, Stack Overflow 2016).

RE2 builds an automaton from the pattern and walks the input once. Match time is O(m * n) where m is the pattern length and n is the input length. That's a hard guarantee, not a best-case. The trade-off is that RE2 can't express features that need to remember earlier match positions:

FeaturePCRERE2 (Go)
Lookahead (?=...)YesNo
Lookbehind (?<=...)YesNo
Backreferences \1YesNo
Possessive quantifiers a++YesNo
Named groups (?P<name>...)YesYes
Unicode character classesYesYes
Linear-time guaranteeNoYes

For 95% of real work (validating input, extracting values, replacing patterns), this is a fine trade. When you genuinely need lookbehind, you write a small parser or use two regexes.

Compiling Patterns

A pattern has to be compiled into a *regexp.Regexp before you can use it. There are two entry points:

regexp.Compile returns (*Regexp, error). Use it when the pattern is not a fixed string literal: it might come from configuration, a user, or a database row, and a bad pattern shouldn't crash the program.

When the pattern is a string literal that you've already verified, regexp.MustCompile is shorter:

MustCompile panics on a bad pattern. That's the right choice for a package-level variable because the program can't function with a broken regex anyway, and a panic at startup is loud and obvious. If you want to know the difference, this is the rule of thumb: MustCompile for literals, Compile for anything dynamic.

The package-level variable pattern is the idiomatic way to share a compiled regex across an application:

The compile runs once at program startup. Every subsequent call to validEmail reuses the same automaton. If the customer signup endpoint validates a million addresses in an hour, that's one compile and a million walks of the automaton.

Notice the backticks around the pattern. Go raw string literals don't interpret backslash escapes, so ` \d+ is exactly the four characters \d+. With a regular double-quoted string you'd have to write "\\d+"`, doubling every backslash. For anything beyond the simplest patterns, raw strings are mandatory for readability:

Matching

The simplest question to ask about a string and a pattern is "does the pattern match anywhere in this string?". For that, you use MatchString:

MatchString returns a bool. No allocations, no extracted text. If you just need to gate behavior on "is there a match", this is the cheapest call you can make.

The Match method on *Regexp does the same thing for []byte:

MUG is three letters, BOOK is four, both match. The lowercase book-01 fails because the character class [A-Z] is case-sensitive. HEADPHONES is ten letters, outside the {3,4} range, so it fails too.

When you have a string already, MatchString is the right choice. When the data is naturally []byte (file contents, network input), use Match to skip the conversion.

Finding Matches

Matching tells you "yes or no". Finding gives you the actual text that matched, or its position in the input.

FindString returns the first match as a string, or the empty string if there's no match:

The "or empty string on no match" rule is important. Code that does if re.FindString(s) != "" is checking both "did it match" and "what was the match" in one step. If you only need a yes-or-no answer, MatchString is clearer and skips the allocation for the returned string.

FindStringIndex returns the byte offsets of the first match as []int{start, end}, or nil if there's no match:

The returned indices are byte offsets, not rune offsets. For ASCII input these are the same. For input with multi-byte characters (UTF-8 encoded Unicode), one rune can span multiple bytes, so a slice on loc[0]:loc[1] still works correctly because it slices on byte boundaries that the regex engine guarantees are aligned with rune boundaries.

When you want every match in the input, not just the first one, switch to the All variants. They take an extra n int argument: the maximum number of matches to return, or -1 for "all of them".

FindAllString walks the input once and collects every non-overlapping match. The n argument lets you cap the result. Pass -1 to get them all, or a positive number to limit the work.

FindAllStringIndex returns positions instead of substrings:

The positions are useful when you need context around the match, or when you want to highlight matches in the original string without losing whitespace and punctuation.

Here's a quick reference for the Find* family, which mostly all follow the same shape:

MethodReturnsWhat it gives you
MatchString(s)boolDoes the pattern match anywhere?
FindString(s)stringFirst match, or "" on no match
FindStringIndex(s)[]int[start, end] of first match, or nil
FindAllString(s, n)[]stringAll matches up to n (or -1 for all)
FindAllStringIndex(s, n)[][]intPositions of all matches
FindStringSubmatch(s)[]stringFirst match plus each captured group
FindAllStringSubmatch(s, n)[][]stringAll matches plus their groups

Submatches (Capturing Groups)

Parentheses in a pattern create capturing groups. When you call FindStringSubmatch, the returned slice has the whole match at index 0 and each captured group at indices 1, 2, 3, and so on.

m[0] is the entire match ("BOOK-01"). m[1] is the first group ("BOOK"). m[2] is the second group ("01"). The slice always has 1 + numGroups elements, with empty strings at positions for groups that didn't match.

If you don't want a group to capture but still need the parentheses for grouping (alternation, applying a quantifier), use the non-capturing form (?:...):

The outer \.(\d{2}) is wrapped in (?:...). That makes the entire dollar-and-cents fragment optional without producing a capturing slot for the literal dot. The numeric groups are still captured. When the cents portion is missing entirely (the second input), m[2] is an empty string, which is the standard signal for "this group didn't participate in the match".

Once a pattern has more than two or three groups, indexing into m[1], m[2], m[3] gets fragile. Reorder the pattern slightly and your indices all shift. Named groups fix that.

(?P<name>...) is RE2 syntax for a named capturing group. SubexpNames() returns a slice where index 0 corresponds to the whole match and the remaining indices correspond to each group. The whole-match slot always has an empty name, and unnamed groups also have empty names. Walking the names slice in parallel with the match values lets you build a map[string]string of named fields, which is far easier to maintain than positional indices.

For one-off extraction, named groups can be overkill. For anything that gets reused or modified, they're worth the extra characters.

FindAllStringSubmatch is the all-matches version. It returns [][]string, one inner slice per match:

m[0] keeps the #. m[1] has just the tag text. Both are useful: keep m[0] when you need to know where the hashtag started in the original string, use m[1] when you're storing the tag in a database that already knows it's a hashtag.

Replacing

Replacement is where regex really earns its keep. ReplaceAllString substitutes every match in the source with a replacement string, and the replacement can reference captured groups using $1, $2, and so on.

Two things to notice. The replacement $$$1 produces a literal $ followed by the contents of capture group 1. The first $$ escapes a single $, and $1 is the back-reference. Without the escape, $$1 would look like an unknown variable reference. The other thing is that the cents are dropped: the pattern captured the dollar amount and the cents separately, and the replacement only references the dollars.

For named groups, use ${name}:

The dash became a slash for every product code. The ${cat} and ${num} references make the intent obvious without forcing the reader to count parentheses.

Sometimes you want the replacement to be treated as a literal string with no $ substitution. That's what ReplaceAllLiteralString does:

The replacement literal contains a $, but ReplaceAllLiteralString doesn't try to interpret it as a group reference. Use this when the replacement comes from a variable you don't control, or when you genuinely want a $ in the output.

For dynamic replacements (where each match becomes something different), use ReplaceAllStringFunc:

The callback receives the matched substring and returns the replacement. This is the right tool when the replacement depends on the match content, like normalizing hashtag casing, encoding URLs, or looking up values from a map.

Redacting credit-card-shaped numbers in a log line is a one-liner with ReplaceAllStringFunc:

The callback keeps the last four digits and masks the rest. The pattern uses \b\d{13,16}\b with word boundaries to avoid matching the user ID 12345 (only five digits) or anything that happens to be embedded inside a larger token. The exact pattern for production redaction would be more nuanced, but the structure is the same.

Splitting

strings.Split only takes a fixed string separator. When the separator is a pattern (one or more spaces, comma-or-semicolon, optional whitespace around a delimiter), reach for re.Split instead.

The pattern \s*,\s* matches a comma with any amount of whitespace on either side. Split returns the parts between separators with the whitespace already trimmed. Doing this with strings.Split followed by strings.TrimSpace in a loop works, but it's two passes and clearer to express as one regex.

The n argument limits the number of substrings returned. -1 means "all of them"; a positive number caps the result and leaves the remainder unsplit in the last element:

With n = 2, the first split happens at the first whitespace, producing "BOOK" and "01 paperback fiction". The remaining whitespace inside the second part is preserved because no further splits were requested.

RE2 Syntax Basics

This isn't a regex tutorial, but a quick reference for the syntax you'll use 90% of the time:

CategorySyntaxMeaning
Character classes[a-z]Any lowercase letter
[^0-9]Any character that isn't a digit
\d \DDigit / non-digit
\w \WWord character ([A-Za-z0-9_]) / non-word
\s \SWhitespace / non-whitespace
.Any character except newline
Quantifiers*Zero or more
+One or more
?Zero or one
{n}Exactly n
{n,m}Between n and m, inclusive
*? +? ??Non-greedy variants
Anchors^Start of input (or line in multiline mode)
$End of input (or line in multiline mode)
\bWord boundary
\BNon-word boundary
Groups(...)Capturing group
(?:...)Non-capturing group
(?P<name>...)Named capturing group
Alternationa|ba or b
Flags(?i)Case-insensitive
(?s)Dot matches newline
(?m)^ and $ match line boundaries

The greedy vs non-greedy distinction trips up most beginners. \d+ matches as many digits as possible, then backs off if the rest of the pattern fails. \d+? matches as few digits as possible and grows only if forced. For most patterns, greedy is what you want. Non-greedy comes up when you're extracting content between delimiters and don't want the match to swallow multiple delimiter pairs.

The inline flag form (?i) is handy when the pattern is a string literal and you don't want to rebuild it. Putting (?i) at the start makes the whole pattern case-insensitive:

The flag applies from where it appears to the end of the pattern, so put it at the very start when you want the whole pattern to be case-insensitive.

The diagram shows the typical flow. A pattern string is compiled once into a *Regexp. From there, the same compiled value drives every match, find, submatch, replace, or split call. The compile step is the one you don't want to repeat.

Performance and When Not to Use Regex

The compile-once rule is the single biggest performance lever. Here's the anti-pattern to avoid:

Every iteration compiles the pattern from scratch. For 1,000 addresses, that's 1,000 compiles when one would do. The fix is moving the compile out:

For a typical email pattern, the compile takes microseconds and a match takes hundreds of nanoseconds. Hoisting the compile out of the loop is a 50x to 1000x speedup depending on input size. This is the kind of optimization that pays for itself the moment you write it.

A more subtle anti-pattern: using regex when a string function suffices.

strings.HasPrefix is a direct byte comparison. There's no automaton, no allocations, no compile step. For prefix or suffix checks, fixed-substring search, or splitting on a fixed string, the strings package is faster and easier to read. Regex only earns its complexity budget when the pattern has actual structure.

Here's a quick rule for when regex is the wrong tool:

TaskUse this instead
Does this string start with http://?strings.HasPrefix
Does this string contain @?strings.Contains
Split on a single commastrings.Split
Lowercase the stringstrings.ToLower
Replace one fixed substring with anotherstrings.ReplaceAll
Count occurrences of a fixed substringstrings.Count
Trim spaces from both endsstrings.TrimSpace
Parse HTMLgolang.org/x/net/html
Parse JSONencoding/json

Regex isn't the right tool for HTML or any other recursive grammar. HTML allows nested tags, and RE2 (correctly) can't express balanced-bracket matching. Trying to scrape HTML with regex is one of those tasks that works for the simple cases in your test data and breaks the moment real-world HTML shows up. Use a proper parser.

Likewise for JSON, CSV, or any structured format with its own library. Regex on those formats works for hand-crafted inputs and fails on edge cases that the dedicated parser handles correctly.

Summary

  • Go uses RE2 syntax, which guarantees O(m * n) match time but doesn't support lookahead, lookbehind, or backreferences.
  • Compile patterns once with regexp.MustCompile for string literals or regexp.Compile for dynamic patterns. Reuse the resulting *Regexp across calls.
  • Always use raw string literals (backticks) for patterns to avoid double-escaping every backslash.
  • MatchString is the cheapest call: it returns a bool. Use it when you only need to know if there's a match.
  • FindString returns the first match or "" on no match. FindStringIndex returns positions or nil. The All variants take an n int cap and return every match.
  • FindStringSubmatch returns the whole match plus captured groups. Combine with SubexpNames() for named groups, which keep patterns self-documenting.
  • ReplaceAllString supports $1 and ${name} substitution. ReplaceAllLiteralString skips substitution. ReplaceAllStringFunc lets each match drive its own replacement.
  • re.Split is the regex-aware version of strings.Split for separators that have structure. For fixed delimiters, strings.Split is faster.
  • Don't reach for regex when a strings function works. Fixed-substring search, prefix checks, and simple replacements are all faster and clearer with the strings package.
  • The *Regexp value is safe for concurrent use across goroutines, so a single compiled pattern can serve every part of a multi-goroutine program.

In the next lesson, we'll bring all nine chapters together with a capstone lab.