AlgoMaster Logo

UTF-8 & Runes

Last Updated: May 17, 2026

14 min read

Go strings hold raw bytes, but the text inside them is almost always UTF-8 encoded Unicode. That distinction matters the moment you store a customer name like "François Müller" or a product called "Café Latté Mug", because the character count and the byte count stop being the same number. This lesson explains how Unicode and UTF-8 fit together, what a rune is in Go, and how to use the unicode/utf8 package to work with text correctly.

Characters vs Bytes

A computer doesn't actually know what a letter is. It stores numbers. To represent text, somebody had to decide which number stands for which letter. Two different problems live underneath that decision.

The first problem is naming. Every character in every script ever invented needs a unique number. That mapping is the character set. Unicode is the modern, near-universal character set. It assigns each character a number called a code point, written like U+0041 for A or U+00E9 for é. The "U+" prefix is just notation; the value itself is a regular integer (0x41 is 65, 0x00E9 is 233).

The second problem is storage. Once you know é is code point U+00E9, how do you actually save that on disk or send it over a network? That's the encoding. UTF-8 is one popular encoding. UTF-16 and UTF-32 are others. They all represent the same Unicode characters, just with different byte patterns.

Go bakes UTF-8 in everywhere. Source files are UTF-8. String literals you write in your source code are UTF-8. The fmt package assumes UTF-8 when it formats text. You can absolutely store other encodings in a Go string (a string is just bytes), but the whole language and standard library is built around UTF-8 being the default. That choice was deliberate: UTF-8 is backward compatible with ASCII, compact for common scripts, and self-synchronizing so you can find the start of a character from anywhere in the byte stream.

The product name shows 14 visible characters, but len reports 16. That gap is the whole point of this lesson. The two é characters each take more than one byte in UTF-8, so the byte count is higher than the character count.

How UTF-8 Encodes a Character

UTF-8 is a variable-width encoding. A single Unicode code point uses anywhere from 1 to 4 bytes, depending on how large the code point's number is.

Code point rangeBytes usedCovers
U+0000 to U+007F1ASCII (English letters, digits, common punctuation)
U+0080 to U+07FF2Latin extended, Greek, Cyrillic, Arabic, Hebrew
U+0800 to U+FFFF3Most other scripts including CJK
U+10000 to U+10FFFF4Emoji, less common CJK, historic scripts

The pattern of bits in each byte tells the decoder how many bytes the character spans. A byte whose top bit is 0 is a standalone ASCII byte. A byte that starts with 110 is the first byte of a two-byte sequence. A byte starting with 1110 starts a three-byte sequence, and 11110 starts a four-byte sequence. Continuation bytes always start with 10. That structure is what makes UTF-8 self-synchronizing: if you land in the middle of a string, you can scan forward to the next byte that doesn't start with 10 and you've found a character boundary.

The ASCII compatibility deserves its own paragraph. Every ASCII character (A through Z, 0 through 9, common punctuation) is exactly one byte in UTF-8, and that byte is identical to the original ASCII byte. So an English-only product catalog encoded in UTF-8 is byte-for-byte the same as it would be in plain ASCII. UTF-8 only "costs extra" when you actually use non-ASCII characters.

Here's the byte layout for a few characters you'll see in product catalogs.

A is one byte, é is two, is three, and 😀 is four. The first byte of each multibyte sequence carries a bit pattern that says "I'm the start of an N-byte character", and the bytes that follow are continuation bytes. Storing the same product name in different scripts can use very different amounts of memory.

Hello is 5 ASCII characters, 5 bytes. Héllo swaps one e for é and the byte count jumps from 5 to 6 because é takes 2 bytes. The Japanese greeting is 5 characters but each one is 3 bytes, so 15 total. The single emoji is one character but takes the full 4 bytes because its code point is above U+FFFF.

Why len Returns Bytes

The len built-in on a string returns the length of the underlying byte slice. There's no character counter hidden inside a Go string, and asking for one would have to either pick an encoding to count against, walk the string every time, or store a cached count somewhere. Go's designers picked the simple, fast option: len(s) is the byte count, full stop.

The visible name is 15 characters, but len reports 17. The ç is 2 bytes and the ü is 2 bytes, so the total is 13 ASCII bytes + 2 + 2 = 17. If you wrote a "shipping label fits in 30 characters" check using len, you'd get the wrong answer the moment a name like François or Müller shows up.

This is also why you can't safely take "the third character" of a string with s[2]. Indexing a string gives you the byte at that position, not the character.

s[3] is 195, which is the decimal value of 0xC3, the first byte of the two-byte sequence for é. The "third character" by eye is f, the "fourth character" is é, but indexing by byte hands you raw bytes and doesn't care about character boundaries. To work with characters, you need a rune.

The rune Type

Go represents a single Unicode code point with the rune type. rune is just an alias for int32, declared in the builtin package. It's a signed 32-bit integer that's big enough to hold any code point (the largest one, U+10FFFF, fits comfortably in 21 bits).

The type is int32 because that's literally what rune is. The value 65 is the code point for A. When you write 'A' in source code with single quotes, you're writing a rune literal: an integer constant whose value is the code point of that character.

Single quotes and double quotes mean different things in Go. Double quotes give you a string. Single quotes give you a rune.

LiteralTypeValue
"A"stringA one-byte string containing 0x41
'A'rune (int32)The integer 65
"é"stringA two-byte string containing 0xC3 0xA9
'é'rune (int32)The integer 233

Mix them up and the compiler complains. var s string = 'A' won't compile because 'A' is an integer, not a string.

Rune literals support a few escape forms for characters you can't easily type. \u followed by four hex digits names a code point in the Basic Multilingual Plane, and \U followed by eight hex digits names any code point. Both produce the same kind of value, just with different ranges.

%c formats an integer as the character whose code point matches. %X formats it in uppercase hex. The 04 width pads short values with leading zeros so single-byte code points line up with multi-byte ones in display. All three are just integer values; the formatting verb decides how they're displayed.

You can freely convert between rune and int. They're the same family of numbers.

int(r) is just a type conversion, no data changes. rune(233) produces the same rune you started with. The same trick works for ASCII characters too: int('A') is 65, and rune(65) is 'A'. Treat rune-to-int as moving the same number across types.

The unicode/utf8 Package

Once you accept that len gives bytes and a string can hold multibyte characters, you need real tools for working with characters. The unicode/utf8 package is that toolkit. It's small, in the standard library, and every Go developer who handles non-ASCII text ends up using it.

Counting Code Points

utf8.RuneCountInString(s) returns the number of code points in a string. That's the closest Go gives you to "character count", though it's worth knowing that one visible character (a grapheme cluster) can be made up of multiple code points (a letter plus a combining accent, for example). For most everyday text including European, CJK, and emoji without combining marks, the code point count matches what you'd intuitively call the character count.

Notebook is pure ASCII, so bytes and characters match. Café Latté Mug has two accented characters, each adding one extra byte, giving 16 bytes for 14 characters. The Japanese product name is 10 characters but each one is 3 bytes in UTF-8, so 30 bytes total. If you needed to enforce a "product name must be at most 50 characters" rule, utf8.RuneCountInString is what you'd compare against, not len.

Decoding the First Rune

utf8.DecodeRuneInString(s) reads the first character from a string and returns two values: the rune itself and how many bytes it occupied.

é decoded to its code point U+00E9, and size came back as 2 because that's how many bytes the character spans in UTF-8. Slicing the original string at s[size:] gives you everything after the first character, properly aligned to the next character boundary. That size value is the only safe way to advance through a string one character at a time when indexing by byte.

If the string is empty, DecodeRuneInString returns the replacement character U+FFFD and a size of 0. If the bytes at the start of the string aren't valid UTF-8, it returns the replacement character and a size of 1, advancing past the bad byte. We'll come back to the replacement character shortly.

Measuring a Rune's Byte Width

utf8.RuneLen(r) answers the reverse question: given a rune, how many bytes will it take in UTF-8?

RuneLen returns -1 for runes that aren't valid Unicode (negative values, or code points reserved for UTF-16 surrogate halves). For any legitimate code point, it returns 1, 2, 3, or 4 to match the UTF-8 encoding rules.

Encoding a Rune to Bytes

utf8.EncodeRune(buf, r) does the reverse of DecodeRuneInString. It writes the UTF-8 bytes for a rune into a byte slice and returns the number of bytes it wrote. The buffer needs to be at least 4 bytes long to hold the largest possible character.

The function wrote three bytes into buf because is U+2B50, which falls into the 3-byte UTF-8 range. The bytes match the layout in the earlier diagram exactly. EncodeRune is what you'd reach for when you're building up a byte buffer character by character, for instance writing to a binary protocol or constructing a hashed key.

Validating UTF-8

Not every byte sequence is valid UTF-8. If your program ingests data from a file, a network call, or user input, you might end up with bytes that don't decode cleanly. utf8.ValidString(s) checks whether a string is well-formed UTF-8.

good is a normal product name and validates cleanly. bad was crafted to put a UTF-8 lead byte (0xC3) in front of a byte that can't be a continuation byte (0x28, which is (). That's a corrupt sequence, and ValidString flags it. A common reason to use this function is to reject malformed input before it propagates into your data store, where it would later confuse search, indexing, and display code.

The Replacement Character

When decoding fails, Unicode reserves a special code point to stand in for "something was here, but I can't tell what". That's U+FFFD, the replacement character, which renders as in most fonts. Go uses it consistently: any time the standard library has to give you a rune and the bytes don't decode, you'll see U+FFFD.

The decoder kept going. When it hit the bad 0xC3 byte, it returned U+FFFD with a size of 1, so the next iteration could try again from the next byte. That's how Go avoids getting stuck on bad input: skip one byte, emit the replacement character, and continue. The cost is that bad data quietly becomes characters in your output, so it's a good idea to validate strings up front when you can.

You can also produce the replacement character explicitly with utf8.RuneError, which is just a named constant equal to '�'.

When you see in a log message, a database value, or a rendered web page, it almost always means somebody passed non-UTF-8 bytes through a system that expected UTF-8. The encoding got applied anyway, the bad bytes turned into U+FFFD, and the original data is gone.

The unicode Package: Classifying Runes

The unicode package is the rune-level companion to unicode/utf8. Where unicode/utf8 deals in encoding, unicode deals in what each character is: letter, digit, space, uppercase, lowercase, punctuation. Every function in the package operates on a single rune.

IsLetter says yes for A, é, and ñ because Unicode classifies them all as letters, accented or not. The digit 7 is correctly flagged as a digit. The space character is whitespace. The star isn't a letter, a digit, or a space, it's a symbol. These functions use the official Unicode category tables, so they handle scripts you've never thought about (Cyrillic, Devanagari, Hangul) the same way they handle Latin.

The case conversion functions, unicode.ToUpper and unicode.ToLower, work on runes too. They return a new rune with the right casing, or the input rune unchanged if no case mapping applies.

É lowercases to é and stays as É when already uppercase. That's the rune-level operation. For converting an entire string at once, you'd use strings.ToLower or strings.ToUpper, which loop over the runes for you and rebuild a new string.

There's a subtle limit here. Some scripts have case rules that depend on context or that turn one character into multiple characters (the German ß uppercases to SS, for example). unicode.ToUpper(r) is a one-rune-to-one-rune function, so it can't express those cases. For everyday product names and customer names, the simple per-rune mapping is what you want; for full linguistic correctness, you'd reach for golang.org/x/text/cases.

Getting the Nth Character

A natural thing to want is "give me the third character of this product name". With ASCII you could write s[2] and be done. With UTF-8 you can't, because byte 2 might be the middle of a multibyte character. The standard fix is to convert the string to a []rune and index into that.

The conversion []rune(name) walks the string once, decodes each UTF-8 sequence into a rune, and returns a slice of those runes. After that, indexing by character is just slice indexing. runes[2] is the third character because slices are zero-indexed. len(runes) gives you the character count, the same number utf8.RuneCountInString would return.

The other way to think about the []rune conversion is that it normalizes "variable-width UTF-8" into "fixed-width 4-byte runes". After the conversion, every element is one character and indexing is constant time. Before the conversion, the string was bytes, and indexing was bytes. Both representations are useful for different jobs.

This is a classic example of why character-level work needs runes. Reversing a string by reversing its bytes would split the è, û, and é sequences in half and produce garbled output (with replacement characters all over). Reversing the runes keeps each character intact. The conversion string(runes) at the end rebuilds a regular Go string from the rune slice, encoding each rune back into UTF-8.

A brief note on iteration. The for i, r := range s form walks a string by code points, giving you the byte index and the decoded rune at each step. It's the most efficient way to read characters in order from a string. Converting to []rune is the right tool when you need random access by character index, and ranging over the string is the right tool when you need sequential access.

Putting It Together

A small program that pulls in most of what's been covered. It takes a customer review, reports its byte and character counts, validates the UTF-8, finds the first letter and the first digit, and uppercases the whole thing one rune at a time.

The byte count is higher than the character count by 2, accounting for the é (1 extra byte) and the (2 extra bytes). UTF-8 validation passes. The first letter and the first digit show up exactly where you'd expect. The uppercased version correctly turns é into É, leaves the digit and the star alone, and rejoins everything into a valid UTF-8 string at the end.

Summary

  • Unicode is a character set: it assigns each character a code point like U+00E9. UTF-8 is an encoding: it stores those code points as 1 to 4 bytes. Go source files and string literals are UTF-8 by default.
  • ASCII characters take 1 byte in UTF-8 and are byte-for-byte identical to plain ASCII. Latin-with-accents takes 2 bytes per character, most other scripts take 3, and emoji and rare characters take 4.
  • len(s) on a string returns the byte count in O(1). For the character count, use utf8.RuneCountInString(s), which is O(n).
  • A rune is an alias for int32, big enough to hold any Unicode code point. Rune literals use single quotes ('A', 'é', '\U0001F600') and are just integer constants.
  • The unicode/utf8 package provides the encoding-level toolkit: RuneCountInString, DecodeRuneInString, RuneLen, EncodeRune, ValidString. The unicode package classifies runes (IsLetter, IsDigit, IsSpace) and case-maps them (ToUpper, ToLower).
  • Indexing a string with s[i] returns a byte, not a character. To index by character, convert to []rune first, which costs O(n) time and allocates a new slice.
  • Invalid UTF-8 bytes decode to the replacement character U+FFFD (). Validate input with utf8.ValidString when bytes come from outside your program.

The next lesson, String Iteration, dives into the for i, r := range s pattern that decodes runes lazily without allocating a []rune. That's the workhorse loop for processing text in Go, and it builds directly on the rune and UTF-8 ideas covered here.