Go's regexp package handles pattern matching, extraction, and replacement on strings using RE2 syntax. RE2 is a different regex engine from the PCRE family that powers Perl, Python, and JavaScript, and the differences matter: no lookaheads, no backreferences, and a guaranteed linear-time match in the size of the input. The payoff is that a pathological pattern can never freeze your program the way it can with PCRE engines.

Why RE2 Instead of PCRE

PCRE-style engines use backtracking. When a pattern like (a+)+b runs against an input like aaaaaaaaaaaaaaaaaaaaaaaa!, the engine tries every possible way to split the as between the inner and outer groups before giving up. That's exponential in the input length. A 30-character input can take seconds, and a 50-character input can take hours. This failure mode is called catastrophic backtracking, and it's been the root cause of more than one production outage (Cloudflare 2019, Stack Overflow 2016).

RE2 builds an automaton from the pattern and walks the input once. Match time is O(m * n) where m is the pattern length and n is the input length. That's a hard guarantee, not a best-case. The trade-off is that RE2 can't express features that need to remember earlier match positions:

Feature	PCRE	RE2 (Go)
Lookahead `(?=...)`	Yes	No
Lookbehind `(?<=...)`	Yes	No
Backreferences `\1`	Yes	No
Possessive quantifiers `a++`	Yes	No
Named groups `(?P<name>...)`	Yes	Yes
Unicode character classes	Yes	Yes
Linear-time guarantee	No	Yes

For 95% of real work (validating input, extracting values, replacing patterns), this is a fine trade. When you genuinely need lookbehind, you write a small parser or use two regexes.

Cost: RE2 trades a small amount of expressiveness for a strong runtime guarantee. If your input is untrusted (user-supplied patterns, log lines from the public internet), this is exactly the trade you want.

Compiling Patterns

A pattern has to be compiled into a *regexp.Regexp before you can use it. There are two entry points:

regexp.Compile returns (*Regexp, error). Use it when the pattern is not a fixed string literal: it might come from configuration, a user, or a database row, and a bad pattern shouldn't crash the program.

When the pattern is a string literal that you've already verified, regexp.MustCompile is shorter:

MustCompile panics on a bad pattern. That's the right choice for a package-level variable because the program can't function with a broken regex anyway, and a panic at startup is loud and obvious. If you want to know the difference, this is the rule of thumb: MustCompile for literals, Compile for anything dynamic.

Cost: Compiling a pattern is expensive: it parses the regex and builds an automaton. Do it once and reuse the *Regexp. Compiling inside a loop is the most common performance mistake with this package.

The package-level variable pattern is the idiomatic way to share a compiled regex across an application:

The compile runs once at program startup. Every subsequent call to validEmail reuses the same automaton. If the customer signup endpoint validates a million addresses in an hour, that's one compile and a million walks of the automaton.

Notice the backticks around the pattern. Go raw string literals don't interpret backslash escapes, so ` \d+ is exactly the four characters \d+. With a regular double-quoted string you'd have to write "\\d+"`, doubling every backslash. For anything beyond the simplest patterns, raw strings are mandatory for readability:

Cost: Use raw string literals (backticks) for every regex pattern. Double-escaping in "..." strings is a constant source of bugs and slows down review.

Matching

The simplest question to ask about a string and a pattern is "does the pattern match anywhere in this string?". For that, you use MatchString:

MatchString returns a bool. No allocations, no extracted text. If you just need to gate behavior on "is there a match", this is the cheapest call you can make.

The Match method on *Regexp does the same thing for []byte:

MUG is three letters, BOOK is four, both match. The lowercase book-01 fails because the character class [A-Z] is case-sensitive. HEADPHONES is ten letters, outside the {3,4} range, so it fails too.

When you have a string already, MatchString is the right choice. When the data is naturally []byte (file contents, network input), use Match to skip the conversion.

Cost: Don't reach for a regex when a simpler string operation works. strings.Contains(s, "@") is roughly 50x faster than regexp.MatchString("@", s) because there's no automaton to walk. Regex shines when the pattern has structure, not when you're searching for a fixed substring.

Finding Matches

Matching tells you "yes or no". Finding gives you the actual text that matched, or its position in the input.

FindString returns the first match as a string, or the empty string if there's no match:

The "or empty string on no match" rule is important. Code that does if re.FindString(s) != "" is checking both "did it match" and "what was the match" in one step. If you only need a yes-or-no answer, MatchString is clearer and skips the allocation for the returned string.

FindStringIndex returns the byte offsets of the first match as []int{start, end}, or nil if there's no match:

The returned indices are byte offsets, not rune offsets. For ASCII input these are the same. For input with multi-byte characters (UTF-8 encoded Unicode), one rune can span multiple bytes, so a slice on loc[0]:loc[1] still works correctly because it slices on byte boundaries that the regex engine guarantees are aligned with rune boundaries.

When you want every match in the input, not just the first one, switch to the All variants. They take an extra n int argument: the maximum number of matches to return, or -1 for "all of them".

FindAllString walks the input once and collects every non-overlapping match. The n argument lets you cap the result. Pass -1 to get them all, or a positive number to limit the work.

Cost: FindAllString allocates a slice plus one string per match. If you only need the count of matches, walk them with a loop using FindAllStringIndex and a counter, which avoids materializing each match as a fresh string.

FindAllStringIndex returns positions instead of substrings:

The positions are useful when you need context around the match, or when you want to highlight matches in the original string without losing whitespace and punctuation.

Here's a quick reference for the Find* family, which mostly all follow the same shape:

Method	Returns	What it gives you
`MatchString(s)`	`bool`	Does the pattern match anywhere?
`FindString(s)`	`string`	First match, or `""` on no match
`FindStringIndex(s)`	`[]int`	`[start, end]` of first match, or `nil`
`FindAllString(s, n)`	`[]string`	All matches up to `n` (or `-1` for all)
`FindAllStringIndex(s, n)`	`[][]int`	Positions of all matches
`FindStringSubmatch(s)`	`[]string`	First match plus each captured group
`FindAllStringSubmatch(s, n)`	`[][]string`	All matches plus their groups

Submatches (Capturing Groups)

Parentheses in a pattern create capturing groups. When you call FindStringSubmatch, the returned slice has the whole match at index 0 and each captured group at indices 1, 2, 3, and so on.

m[0] is the entire match ("BOOK-01"). m[1] is the first group ("BOOK"). m[2] is the second group ("01"). The slice always has 1 + numGroups elements, with empty strings at positions for groups that didn't match.

If you don't want a group to capture but still need the parentheses for grouping (alternation, applying a quantifier), use the non-capturing form (?:...):

The outer \.(\d{2}) is wrapped in (?:...). That makes the entire dollar-and-cents fragment optional without producing a capturing slot for the literal dot. The numeric groups are still captured. When the cents portion is missing entirely (the second input), m[2] is an empty string, which is the standard signal for "this group didn't participate in the match".

Once a pattern has more than two or three groups, indexing into m[1], m[2], m[3] gets fragile. Reorder the pattern slightly and your indices all shift. Named groups fix that.

(?P<name>...) is RE2 syntax for a named capturing group. SubexpNames() returns a slice where index 0 corresponds to the whole match and the remaining indices correspond to each group. The whole-match slot always has an empty name, and unnamed groups also have empty names. Walking the names slice in parallel with the match values lets you build a map[string]string of named fields, which is far easier to maintain than positional indices.

For one-off extraction, named groups can be overkill. For anything that gets reused or modified, they're worth the extra characters.

FindAllStringSubmatch is the all-matches version. It returns [][]string, one inner slice per match:

m[0] keeps the #. m[1] has just the tag text. Both are useful: keep m[0] when you need to know where the hashtag started in the original string, use m[1] when you're storing the tag in a database that already knows it's a hashtag.

Replacing

Replacement is where regex really earns its keep. ReplaceAllString substitutes every match in the source with a replacement string, and the replacement can reference captured groups using $1, $2, and so on.

Two things to notice. The replacement $$$1 produces a literal $ followed by the contents of capture group 1. The first $$ escapes a single $, and $1 is the back-reference. Without the escape, $$1 would look like an unknown variable reference. The other thing is that the cents are dropped: the pattern captured the dollar amount and the cents separately, and the replacement only references the dollars.

For named groups, use ${name}:

The dash became a slash for every product code. The ${cat} and ${num} references make the intent obvious without forcing the reader to count parentheses.

Sometimes you want the replacement to be treated as a literal string with no $ substitution. That's what ReplaceAllLiteralString does:

The replacement literal contains a $, but ReplaceAllLiteralString doesn't try to interpret it as a group reference. Use this when the replacement comes from a variable you don't control, or when you genuinely want a $ in the output.

For dynamic replacements (where each match becomes something different), use ReplaceAllStringFunc:

The callback receives the matched substring and returns the replacement. This is the right tool when the replacement depends on the match content, like normalizing hashtag casing, encoding URLs, or looking up values from a map.

Redacting credit-card-shaped numbers in a log line is a one-liner with ReplaceAllStringFunc:

The callback keeps the last four digits and masks the rest. The pattern uses \b\d{13,16}\b with word boundaries to avoid matching the user ID 12345 (only five digits) or anything that happens to be embedded inside a larger token. The exact pattern for production redaction would be more nuanced, but the structure is the same.

Cost: ReplaceAllStringFunc allocates a fresh string for each match plus the final output. If your replacements are mostly identity (you only change a small fraction), check the match first and return the input unchanged when nothing needs to change.

Splitting

strings.Split only takes a fixed string separator. When the separator is a pattern (one or more spaces, comma-or-semicolon, optional whitespace around a delimiter), reach for re.Split instead.

The pattern \s*,\s* matches a comma with any amount of whitespace on either side. Split returns the parts between separators with the whitespace already trimmed. Doing this with strings.Split followed by strings.TrimSpace in a loop works, but it's two passes and clearer to express as one regex.

The n argument limits the number of substrings returned. -1 means "all of them"; a positive number caps the result and leaves the remainder unsplit in the last element:

With n = 2, the first split happens at the first whitespace, producing "BOOK" and "01 paperback fiction". The remaining whitespace inside the second part is preserved because no further splits were requested.

Cost: For a fixed-character separator like a single comma, strings.Split(s, ",") followed by strings.TrimSpace on each part is faster than regexp.Split because there's no automaton involved. Use regex only when the separator has structure.

RE2 Syntax Basics

This isn't a regex tutorial, but a quick reference for the syntax you'll use 90% of the time:

Category	Syntax	Meaning
Character classes	`[a-z]`	Any lowercase letter
	`[^0-9]`	Any character that isn't a digit
	`\d` `\D`	Digit / non-digit
	`\w` `\W`	Word character (`[A-Za-z0-9_]`) / non-word
	`\s` `\S`	Whitespace / non-whitespace
	`.`	Any character except newline
Quantifiers	`*`	Zero or more
	`+`	One or more
	`?`	Zero or one
	`{n}`	Exactly `n`
	`{n,m}`	Between `n` and `m`, inclusive
	`*?` `+?` `??`	Non-greedy variants
Anchors	`^`	Start of input (or line in multiline mode)
	`$`	End of input (or line in multiline mode)
	`\b`	Word boundary
	`\B`	Non-word boundary
Groups	`(...)`	Capturing group
	`(?:...)`	Non-capturing group
	`(?P<name>...)`	Named capturing group
Alternation	`a\|b`	`a` or `b`
Flags	`(?i)`	Case-insensitive
	`(?s)`	Dot matches newline
	`(?m)`	`^` and `$` match line boundaries

The greedy vs non-greedy distinction trips up most beginners. \d+ matches as many digits as possible, then backs off if the rest of the pattern fails. \d+? matches as few digits as possible and grows only if forced. For most patterns, greedy is what you want. Non-greedy comes up when you're extracting content between delimiters and don't want the match to swallow multiple delimiter pairs.

The inline flag form (?i) is handy when the pattern is a string literal and you don't want to rebuild it. Putting (?i) at the start makes the whole pattern case-insensitive:

The flag applies from where it appears to the end of the pattern, so put it at the very start when you want the whole pattern to be case-insensitive.

The diagram shows the typical flow. A pattern string is compiled once into a *Regexp. From there, the same compiled value drives every match, find, submatch, replace, or split call. The compile step is the one you don't want to repeat.

Performance and When Not to Use Regex

The compile-once rule is the single biggest performance lever. Here's the anti-pattern to avoid:

Every iteration compiles the pattern from scratch. For 1,000 addresses, that's 1,000 compiles when one would do. The fix is moving the compile out:

For a typical email pattern, the compile takes microseconds and a match takes hundreds of nanoseconds. Hoisting the compile out of the loop is a 50x to 1000x speedup depending on input size. This is the kind of optimization that pays for itself the moment you write it.

Cost: Compile patterns at package init or program startup. Pass *Regexp values around if a function needs one. The *Regexp is safe for concurrent use from multiple goroutines.

A more subtle anti-pattern: using regex when a string function suffices.

strings.HasPrefix is a direct byte comparison. There's no automaton, no allocations, no compile step. For prefix or suffix checks, fixed-substring search, or splitting on a fixed string, the strings package is faster and easier to read. Regex only earns its complexity budget when the pattern has actual structure.

Here's a quick rule for when regex is the wrong tool:

Task	Use this instead
Does this string start with `http://`?	`strings.HasPrefix`
Does this string contain `@`?	`strings.Contains`
Split on a single comma	`strings.Split`
Lowercase the string	`strings.ToLower`
Replace one fixed substring with another	`strings.ReplaceAll`
Count occurrences of a fixed substring	`strings.Count`
Trim spaces from both ends	`strings.TrimSpace`
Parse HTML	`golang.org/x/net/html`
Parse JSON	`encoding/json`

Regex isn't the right tool for HTML or any other recursive grammar. HTML allows nested tags, and RE2 (correctly) can't express balanced-bracket matching. Trying to scrape HTML with regex is one of those tasks that works for the simple cases in your test data and breaks the moment real-world HTML shows up. Use a proper parser.

Likewise for JSON, CSV, or any structured format with its own library. Regex on those formats works for hand-crafted inputs and fails on edge cases that the dedicated parser handles correctly.

Summary

Go uses RE2 syntax, which guarantees O(m * n) match time but doesn't support lookahead, lookbehind, or backreferences.
Compile patterns once with regexp.MustCompile for string literals or regexp.Compile for dynamic patterns. Reuse the resulting *Regexp across calls.
Always use raw string literals (backticks) for patterns to avoid double-escaping every backslash.
MatchString is the cheapest call: it returns a bool. Use it when you only need to know if there's a match.
FindString returns the first match or "" on no match. FindStringIndex returns positions or nil. The All variants take an n int cap and return every match.
FindStringSubmatch returns the whole match plus captured groups. Combine with SubexpNames() for named groups, which keep patterns self-documenting.
ReplaceAllString supports $1 and ${name} substitution. ReplaceAllLiteralString skips substitution. ReplaceAllStringFunc lets each match drive its own replacement.
re.Split is the regex-aware version of strings.Split for separators that have structure. For fixed delimiters, strings.Split is faster.
Don't reach for regex when a strings function works. Fixed-substring search, prefix checks, and simple replacements are all faster and clearer with the strings package.
The *Regexp value is safe for concurrent use across goroutines, so a single compiled pattern can serve every part of a multi-goroutine program.

In the next lesson, we'll bring all nine chapters together with a capstone lab.

Regular Expressions (regexp)