A goroutine is Go's unit of concurrent execution. It's lighter than an OS thread, cheaper to create, and managed entirely by the Go runtime rather than the operating system.

The go keyword spawns a new goroutine that runs concurrently with the calling code. The function executes in the background while main() continues.

Goroutines vs Threads

Aspect	Goroutine	OS Thread
Initial stack size	~2KB	~1-8MB
Creation time	~0.3 microseconds	~10+ microseconds
Context switch	~100-200ns (user space)	~1-10 microseconds (kernel)
Maximum count	Millions	Thousands
Managed by	Go runtime	Operating system
Stack	Growable	Fixed

The small initial stack is a key advantage. An OS thread typically reserves 1-8MB of stack space upfront (even if unused), limiting you to thousands of threads. Goroutines start with ~2KB stacks that grow and shrink as needed, allowing millions of concurrent goroutines on a single machine.

The Growable Stack

Goroutine stacks start small and grow dynamically:

When a goroutine needs more stack space:

The runtime allocates a new, larger stack (typically 2x the current size)
Copies the old stack contents to the new stack
Updates all pointers within the stack
Continues execution

This is transparent to your code. Stacks can also shrink during garbage collection if they're using much less than allocated.

The GMP Model

Go's scheduler uses the GMP model, named after its three core components:

G (Goroutine): The unit of work, representing a function to execute
M (Machine): An OS thread that executes goroutines
P (Processor): A logical processor that mediates between G and M

G (Goroutine)

Each goroutine (G) contains:

Stack pointer and program counter
Stack bounds (for growth detection)
Status (runnable, running, waiting, dead)
Reference to the current M (if running)
Scheduling-related fields (preemption flag, etc.)

M (Machine)

An M is an OS thread. It executes goroutines and interacts with the operating system:

Can be parked (sleeping) when no work is available
Can be created as needed (up to GOMAXPROCS running simultaneously)
System calls cause M to detach from P temporarily
Each M has a "g0" goroutine for scheduling work

P (Processor)

A P is a logical processor context. It acts as a token for executing Go code:

There are exactly GOMAXPROCS P's
Each P has a local run queue of goroutines
An M must acquire a P to execute Go code
P's enable work stealing between threads

Why P Exists

You might wonder why we need P when we have M. The answer becomes clear when you consider what would happen without P. Let's walk through three scenarios.

Scenario 1: Without P, syscalls would waste CPUs

Imagine a goroutine making a blocking syscall (reading a file). Without the P abstraction:

With P, the runtime decouples the goroutine from the thread:

Scenario 2: Without P, work stealing would need global locks

Without local run queues attached to P's, all goroutines would be in one global queue:

With P's local run queues:

Scenario 3: Without P, controlling parallelism would be awkward

How would you limit concurrency to 4 cores on an 8-core machine? Without P, you'd need to limit thread creation, but threads are also needed for syscalls.

With P:

In essence, P is a "CPU token" that separates scheduling (G on P) from execution (M runs P). This decoupling is what makes Go's scheduler efficient.

GOMAXPROCS

GOMAXPROCS controls the number of P's, which determines the maximum number of goroutines executing simultaneously:

Or set via environment variable:

Default behavior: Since Go 1.5, GOMAXPROCS defaults to the number of available CPUs. Before that, it defaulted to 1.

When to change it:

Running in containers with CPU limits: Set it to match the container's CPU quota
CPU-bound workloads: Usually leave at default
I/O-bound workloads: Might benefit from higher values (more M's waiting on I/O)
Debugging: Set to 1 to serialize execution

GOMAXPROCS doesn't limit the number of goroutines or OS threads. It limits how many goroutines run simultaneously. You can have millions of goroutines with GOMAXPROCS=4, they just take turns running on those 4 P's.

How the Scheduler Works

Creating a Goroutine

When you write go f():

Runtime creates a new G with an initial 2KB stack
G is added to the current P's local run queue (or global queue if local is full)
The creating goroutine continues executing
Eventually, the scheduler runs the new G

The Scheduling Loop

Each M runs a scheduling loop:

Finding a runnable G follows a priority order:

Check runnext (single slot for the next G to run, cache-friendly)
Check local run queue
Check global run queue
Network poller (for goroutines waiting on I/O)
Steal from other P's

Work Stealing

When a P's local queue is empty, it steals work from other P's:

Work stealing takes half the victim's run queue, balancing load across processors.

Preemption

Go uses preemption to prevent a single goroutine from monopolizing a P.

Cooperative preemption (before Go 1.14):

Preemption only at safe points: function calls, channel operations, etc.
A tight loop with no function calls could block other goroutines indefinitely

Asynchronous preemption (Go 1.14+):

Runtime uses OS signals (SIGURG on Unix) to preempt goroutines
Even tight loops can be preempted
The scheduler sends a signal, and the goroutine stops at the next safe point

Goroutine States

A goroutine transitions through several states during its lifetime. Understanding these states helps with debugging and performance analysis.

State Details

State	Triggers	Duration	Debugging Visibility
Runnable	`go` statement, unblocked from wait	Typically microseconds to milliseconds	Visible in pprof as "runnable", in GODEBUG as local/global queue
Running	Scheduled by P	Until blocked, preempted, or finished	Current goroutine in stack trace
Waiting	Channel op, mutex, I/O, sleep, select	Varies: nanoseconds to forever (leak!)	Visible in pprof with blocking reason
Dead	Return, panic, runtime.Goexit	Instant (becomes garbage)	Not visible; memory reclaimed

Runnable state: The goroutine is ready to execute but waiting for a P. This happens when:

A new goroutine is created (go f())
A goroutine is unblocked (channel receive completes, mutex acquired)
A goroutine is preempted (ran too long, yielded for GC)

High runnable counts mean goroutines are competing for limited P's. Consider whether you're spawning too many goroutines or if GOMAXPROCS is too low.

Running state: The goroutine is actively executing on an M+P pair. Only GOMAXPROCS goroutines can be in this state simultaneously.

Waiting state: The goroutine is blocked, not consuming CPU. Common reasons:

chan receive: Waiting for data on a channel
chan send: Waiting for receiver on a full/unbuffered channel
select: Waiting for any case to be ready
sync.Mutex.Lock: Waiting for mutex
sync.Cond.Wait: Waiting for condition signal
time.Sleep: Waiting for timer
IO wait: Network I/O via netpoller
syscall: Blocking syscall (M also blocked)

Debugging tip: In pprof goroutine profiles, the waiting reason shows why a goroutine is blocked. Look for goroutines stuck in chan receive with no matching sender: that's likely a leak.

What Causes Blocking?

When a goroutine blocks, it releases its M (and P) so others can run:

Operation	Effect
Channel send (full buffer/no receiver)	G moves to channel's wait queue
Channel receive (empty buffer/no sender)	G moves to channel's wait queue
Mutex Lock (already locked)	G moves to mutex's wait queue
`time.Sleep()`	G moves to timer heap
I/O operation	M enters syscall, P handed off
`runtime.Gosched()`	G yields, moves to run queue

System Calls and the Scheduler

System calls (file I/O, network I/O, etc.) require special handling because they block the OS thread:

Blocking System Calls

When a goroutine makes a blocking syscall:

The M enters the syscall, releasing its P
The P is handed to another M (or a new M is created)
When the syscall returns, the M tries to reacquire a P
If no P is available, the G goes to the global queue, and the M parks

Network I/O: The Netpoller

Network I/O is different from file I/O. While file reads truly block in the kernel, network operations can be made non-blocking with the right OS facilities. The netpoller is Go's integration with these facilities (epoll on Linux, kqueue on BSD/macOS, IOCP on Windows).

What the netpoller enables:

No thread blocked: When waiting for network data, the goroutine is parked (sleeping) but the M is free to run other goroutines. A server can have 100,000 goroutines waiting for network I/O with only a handful of threads.
Efficient multiplexing: Instead of one thread per connection (the thread-per-request model), Go uses a small number of threads to manage many connections via OS-level event notification.
Seamless integration: Your code looks like blocking I/O (conn.Read(buf)), but under the hood it's non-blocking and event-driven.

How it works:

When a goroutine does network I/O that would block:

The runtime sets the socket to non-blocking mode
The socket is registered with the netpoller
The goroutine is parked (moved to waiting state)
The M continues running other goroutines

When the socket is ready:

The netpoller (running on a background thread) detects readiness
The goroutine is marked as runnable
The goroutine is added back to a run queue
An M picks it up and resumes execution

This is why Go handles thousands of concurrent network connections efficiently. The M's don't block on I/O.

Goroutine Leaks

A goroutine leak occurs when goroutines are created but never terminate. They consume memory and may hold resources:

Common Causes

1. Blocked channel operations:

Fix: Use buffered channel or select with context.

The fix addresses the root cause: the goroutine blocks because its send has no receiver. We have two options:

Option 1: Buffered channel - The send can complete even without a receiver, because the buffer absorbs the value. The goroutine can then exit, and the buffered value is garbage collected later.

Option 2: Select with context - The goroutine watches for cancellation and exits cleanly when the context is done.

2. Infinite loops without exit:

Why this leaks: The goroutine has no exit condition. Even if the function that called startWorker() returns, the goroutine runs forever, consuming memory and potentially CPU.

Fix: Use context for cancellation. Context is the idiomatic way to signal "please stop" to goroutines. The goroutine checks ctx.Done() regularly and exits when cancelled.

3. Missing case in select:

Why this leaks: A bare receive (<-ch) blocks until a value arrives. If the sender crashes, the channel is abandoned, or the send never happens, this goroutine waits forever.

Fix: Add timeout or context. Every blocking operation should have an escape hatch. Use time.After for simple timeouts or ctx.Done() for cancellation that propagates through your call stack.

Important: time.After creates a timer that isn't garbage collected until it fires. In hot paths, use time.NewTimer and call timer.Stop() to avoid memory leaks from accumulated timers.

Detecting Goroutine Leaks

Using runtime.NumGoroutine():

Using goleak (Uber's library):

Using pprof:

Then visit http://localhost:6060/debug/pprof/goroutine?debug=1 to see all goroutines and their stack traces.

Debugging the Scheduler

GODEBUG Environment Variable

Sample output:

Fields:

gomaxprocs: Number of P's
idleprocs: P's with no work
threads: Total M's
spinningthreads: M's looking for work
runqueue: Global run queue size
[...]: Local run queue sizes for each P

Stack Traces

Execution Tracer

View with: go tool trace trace.out

The trace shows:

Goroutine creation and blocking
GC events
Syscalls
Network I/O
Scheduler decisions

Performance Characteristics

Understanding the costs of goroutine operations helps you make informed design decisions.

Creation and Memory

Metric	Value	Notes
Initial stack size	~2 KB	Grows/shrinks dynamically
Maximum stack size	1 GB (64-bit)	Runtime limit, configurable
Creation time	~0.3 microseconds	Much faster than OS thread (~10 μs)
Goroutine struct	~400 bytes	Runtime overhead per goroutine

Practical implication: Creating a million goroutines uses about 2-3 GB of memory (stack + struct overhead). This is feasible for connection-per-goroutine servers, but watch memory usage under load.

Context Switch Costs

Operation	Cost	Notes
Goroutine switch	100-200 ns	User space, minimal state
OS thread switch	1-10 μs	Kernel mode, full context
Syscall (fast path)	~100 ns	No actual kernel entry
Syscall (slow path)	~1 μs	Enters kernel

Why goroutine switches are fast:

Only save/restore stack pointer + program counter
No kernel transition
No privilege level change
No TLB flush

Scheduler Overhead

Operation	Cost	Notes
Work stealing	~200 ns	Per steal attempt
Global queue access	~50 ns	Lock contention possible
Local queue push/pop	~10 ns	Lock-free for owner
Netpoller check	~100 ns	Amortized across many goroutines

Benchmarking Goroutine Creation

Practical Guidelines

When to use more goroutines:

I/O-bound work (network, disk): goroutines are cheap, I/O is slow
Independent tasks that don't share state
Connection handling in servers

When to limit goroutines:

CPU-bound work: more goroutines than cores just adds overhead
Heavy memory usage per goroutine: watch total memory
Shared state with high contention: more goroutines = more contention

Rule of thumb for worker pools:

CPU-bound: runtime.NumCPU() workers
I/O-bound: 10-100x CPU count, tuned by benchmarking
Mixed: start with runtime.NumCPU() * 2, adjust based on profiling

Goroutines and the Go Scheduler