Last Updated: February 1, 2026
A goroutine is Go's unit of concurrent execution. It's lighter than an OS thread, cheaper to create, and managed entirely by the Go runtime rather than the operating system.
The go keyword spawns a new goroutine that runs concurrently with the calling code. The function executes in the background while main() continues.
| Aspect | Goroutine | OS Thread |
|---|---|---|
| Initial stack size | ~2KB | ~1-8MB |
| Creation time | ~0.3 microseconds | ~10+ microseconds |
| Context switch | ~100-200ns (user space) | ~1-10 microseconds (kernel) |
| Maximum count | Millions | Thousands |
| Managed by | Go runtime | Operating system |
| Stack | Growable | Fixed |
The small initial stack is a key advantage. An OS thread typically reserves 1-8MB of stack space upfront (even if unused), limiting you to thousands of threads. Goroutines start with ~2KB stacks that grow and shrink as needed, allowing millions of concurrent goroutines on a single machine.
Goroutine stacks start small and grow dynamically:
When a goroutine needs more stack space:
This is transparent to your code. Stacks can also shrink during garbage collection if they're using much less than allocated.
Go's scheduler uses the GMP model, named after its three core components:
Each goroutine (G) contains:
An M is an OS thread. It executes goroutines and interacts with the operating system:
GOMAXPROCS running simultaneously)A P is a logical processor context. It acts as a token for executing Go code:
GOMAXPROCS P'sYou might wonder why we need P when we have M. The answer becomes clear when you consider what would happen without P. Let's walk through three scenarios.
Scenario 1: Without P, syscalls would waste CPUs
Imagine a goroutine making a blocking syscall (reading a file). Without the P abstraction:
With P, the runtime decouples the goroutine from the thread:
Scenario 2: Without P, work stealing would need global locks
Without local run queues attached to P's, all goroutines would be in one global queue:
With P's local run queues:
Scenario 3: Without P, controlling parallelism would be awkward
How would you limit concurrency to 4 cores on an 8-core machine? Without P, you'd need to limit thread creation, but threads are also needed for syscalls.
With P:
In essence, P is a "CPU token" that separates scheduling (G on P) from execution (M runs P). This decoupling is what makes Go's scheduler efficient.
GOMAXPROCS controls the number of P's, which determines the maximum number of goroutines executing simultaneously:
Or set via environment variable:
Default behavior: Since Go 1.5, GOMAXPROCS defaults to the number of available CPUs. Before that, it defaulted to 1.
When to change it:
GOMAXPROCS doesn't limit the number of goroutines or OS threads. It limits how many goroutines run simultaneously. You can have millions of goroutines with GOMAXPROCS=4, they just take turns running on those 4 P's.
When you write go f():
Each M runs a scheduling loop:
Finding a runnable G follows a priority order:
runnext (single slot for the next G to run, cache-friendly)When a P's local queue is empty, it steals work from other P's:
Work stealing takes half the victim's run queue, balancing load across processors.
Go uses preemption to prevent a single goroutine from monopolizing a P.
Cooperative preemption (before Go 1.14):
Asynchronous preemption (Go 1.14+):
A goroutine transitions through several states during its lifetime. Understanding these states helps with debugging and performance analysis.
| State | Triggers | Duration | Debugging Visibility |
|---|---|---|---|
| Runnable | go statement, unblocked from wait | Typically microseconds to milliseconds | Visible in pprof as "runnable", in GODEBUG as local/global queue |
| Running | Scheduled by P | Until blocked, preempted, or finished | Current goroutine in stack trace |
| Waiting | Channel op, mutex, I/O, sleep, select | Varies: nanoseconds to forever (leak!) | Visible in pprof with blocking reason |
| Dead | Return, panic, runtime.Goexit | Instant (becomes garbage) | Not visible; memory reclaimed |
Runnable state: The goroutine is ready to execute but waiting for a P. This happens when:
go f())High runnable counts mean goroutines are competing for limited P's. Consider whether you're spawning too many goroutines or if GOMAXPROCS is too low.
Running state: The goroutine is actively executing on an M+P pair. Only GOMAXPROCS goroutines can be in this state simultaneously.
Waiting state: The goroutine is blocked, not consuming CPU. Common reasons:
chan receive: Waiting for data on a channelchan send: Waiting for receiver on a full/unbuffered channelselect: Waiting for any case to be readysync.Mutex.Lock: Waiting for mutexsync.Cond.Wait: Waiting for condition signaltime.Sleep: Waiting for timerIO wait: Network I/O via netpollersyscall: Blocking syscall (M also blocked)Debugging tip: In pprof goroutine profiles, the waiting reason shows why a goroutine is blocked. Look for goroutines stuck in chan receive with no matching sender: that's likely a leak.
When a goroutine blocks, it releases its M (and P) so others can run:
| Operation | Effect |
|---|---|
| Channel send (full buffer/no receiver) | G moves to channel's wait queue |
| Channel receive (empty buffer/no sender) | G moves to channel's wait queue |
| Mutex Lock (already locked) | G moves to mutex's wait queue |
time.Sleep() | G moves to timer heap |
| I/O operation | M enters syscall, P handed off |
runtime.Gosched() | G yields, moves to run queue |
System calls (file I/O, network I/O, etc.) require special handling because they block the OS thread:
When a goroutine makes a blocking syscall:
Network I/O is different from file I/O. While file reads truly block in the kernel, network operations can be made non-blocking with the right OS facilities. The netpoller is Go's integration with these facilities (epoll on Linux, kqueue on BSD/macOS, IOCP on Windows).
What the netpoller enables:
conn.Read(buf)), but under the hood it's non-blocking and event-driven.How it works:
This is why Go handles thousands of concurrent network connections efficiently. The M's don't block on I/O.
A goroutine leak occurs when goroutines are created but never terminate. They consume memory and may hold resources:
1. Blocked channel operations:
Fix: Use buffered channel or select with context.
The fix addresses the root cause: the goroutine blocks because its send has no receiver. We have two options:
Option 1: Buffered channel - The send can complete even without a receiver, because the buffer absorbs the value. The goroutine can then exit, and the buffered value is garbage collected later.
Option 2: Select with context - The goroutine watches for cancellation and exits cleanly when the context is done.
2. Infinite loops without exit:
Why this leaks: The goroutine has no exit condition. Even if the function that called startWorker() returns, the goroutine runs forever, consuming memory and potentially CPU.
Fix: Use context for cancellation. Context is the idiomatic way to signal "please stop" to goroutines. The goroutine checks ctx.Done() regularly and exits when cancelled.
3. Missing case in select:
Why this leaks: A bare receive (<-ch) blocks until a value arrives. If the sender crashes, the channel is abandoned, or the send never happens, this goroutine waits forever.
Fix: Add timeout or context. Every blocking operation should have an escape hatch. Use time.After for simple timeouts or ctx.Done() for cancellation that propagates through your call stack.
Important: time.After creates a timer that isn't garbage collected until it fires. In hot paths, use time.NewTimer and call timer.Stop() to avoid memory leaks from accumulated timers.
Using runtime.NumGoroutine():
Using goleak (Uber's library):
Using pprof:
Then visit http://localhost:6060/debug/pprof/goroutine?debug=1 to see all goroutines and their stack traces.
Sample output:
Fields:
gomaxprocs: Number of P'sidleprocs: P's with no workthreads: Total M'sspinningthreads: M's looking for workrunqueue: Global run queue size[...]: Local run queue sizes for each PView with: go tool trace trace.out
The trace shows:
Understanding the costs of goroutine operations helps you make informed design decisions.
| Metric | Value | Notes |
|---|---|---|
| Initial stack size | ~2 KB | Grows/shrinks dynamically |
| Maximum stack size | 1 GB (64-bit) | Runtime limit, configurable |
| Creation time | ~0.3 microseconds | Much faster than OS thread (~10 μs) |
| Goroutine struct | ~400 bytes | Runtime overhead per goroutine |
Practical implication: Creating a million goroutines uses about 2-3 GB of memory (stack + struct overhead). This is feasible for connection-per-goroutine servers, but watch memory usage under load.
| Operation | Cost | Notes |
|---|---|---|
| Goroutine switch | 100-200 ns | User space, minimal state |
| OS thread switch | 1-10 μs | Kernel mode, full context |
| Syscall (fast path) | ~100 ns | No actual kernel entry |
| Syscall (slow path) | ~1 μs | Enters kernel |
Why goroutine switches are fast:
| Operation | Cost | Notes |
|---|---|---|
| Work stealing | ~200 ns | Per steal attempt |
| Global queue access | ~50 ns | Lock contention possible |
| Local queue push/pop | ~10 ns | Lock-free for owner |
| Netpoller check | ~100 ns | Amortized across many goroutines |
When to use more goroutines:
When to limit goroutines:
Rule of thumb for worker pools:
runtime.NumCPU() workersruntime.NumCPU() * 2, adjust based on profiling