Go Performance — Ultimate Handbook

🏗️

Stack vs Heap

allocation · escape analysis · GC

  Stack                              Heap
  ──────────────────────             ──────────────────────────────
  Per-goroutine (starts at 8KB)      Shared across all goroutines
  Grows/shrinks automatically        Managed by the GC
  Fast: bump pointer allocation      Slower: GC must scan and collect
  Freed when function returns        Freed when no more references

  Rule: if a value's lifetime is bounded to a function call → stack
        if it escapes (returned, sent to channel, stored) → heap

What causes heap allocation Escape

// Returning a pointer — value escapes to heap
func newUser() *User {
    u := User{Name: "Alice"}  // u escapes
    return &u
}

// Storing in interface — concrete type escapes
var i any = User{}  // User escapes to heap

// Sending pointer on channel — escapes
ch <- &User{}

// Closure capturing a variable — escapes
x := 42
f := func() { fmt.Println(x) }  // x escapes
_ = f

Escape analysis — see the decisions -gcflags

# Show escape analysis decisions
go build -gcflags="-m" ./...

# More verbose (shows why)
go build -gcflags="-m=2" ./...

// Sample output:
// ./main.go:8:2: &u escapes to heap
// ./main.go:12:14: x escapes to heap
// ./main.go:5:17: User{} does not escape

# Disable inlining to see more escape detail
go build -gcflags="-m -l" ./...

🔬

pprof Profiling

CPU · memory · goroutine · go tool pprof

ℹ️

Profile before you optimize. pprof tells you exactly where CPU time and memory are spent — without it, optimizations are guesswork. Always benchmark before and after a change to confirm impact.

HTTP pprof endpoint — always-on profiling net/http/pprof

import _ "net/http/pprof"  // registers /debug/pprof/ routes

func main() {
    // Expose on a separate port (never on your public port)
    go http.ListenAndServe(":6060", nil)
    // ... start your actual server
}

# Capture 30s CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Goroutine dump
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Open interactive web UI
go tool pprof -http=:8081 cpu.prof

Programmatic profiling runtime/pprof

import "runtime/pprof"

// CPU profile
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()

// Heap profile (call after workload)
mf, _ := os.Create("mem.prof")
runtime.GC()  // run GC for accurate stats
pprof.WriteHeapProfile(mf)
mf.Close()

Benchmark + pprof -cpuprofile

# Run benchmark and produce profiles in one step
go test -bench=BenchmarkFoo \
    -cpuprofile=cpu.prof \
    -memprofile=mem.prof \
    ./...

# Analyze the CPU profile
go tool pprof cpu.prof

# Inside pprof REPL:
# top10       — top 10 functions by self time
# list Foo    — annotated source for function Foo
# web         — open flame graph in browser

♻️

Reducing Allocations

sync.Pool · strings.Builder · pre-alloc

sync.Pool — reuse temporary objects sync.Pool

// Pool holds temporary objects to reduce GC pressure
// Objects may be evicted at any time — never store state in them
var bufPool = sync.Pool{
    New: func() any {
        return new(bytes.Buffer)
    },
}

func buildResponse(data []byte) []byte {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()                    // always reset before use
    defer bufPool.Put(buf)

    buf.Write(data)
    return buf.Bytes()
}

// Before sync.Pool: N allocs/op (one Buffer per call)
// After  sync.Pool: ~0 allocs/op (Buffer is reused)

strings.Builder — build strings without allocs strings.Builder

// Concatenating with + allocates on every iteration
var s string
for _, w := range words {
    s += w  // O(n²) allocations
}

// strings.Builder — single allocation at the end
var sb strings.Builder
sb.Grow(len(words) * 8)  // hint capacity if known
for _, w := range words {
    sb.WriteString(w)
}
result := sb.String()

Pre-allocate slices and maps make with cap

// Without pre-alloc: grows 2x repeatedly, many allocs
var results []int
for i := range items {
    results = append(results, process(items[i]))
}

// With pre-alloc: zero reallocations
results = make([]int, 0, len(items))
for i := range items {
    results = append(results, process(items[i]))
}

// Pre-alloc map
m := make(map[string]int, len(keys))
for _, k := range keys {
    m[k]++
}

⚡

Inlining

-gcflags · inline budget · noinline

ℹ️

The Go compiler inlines small functions at their call sites, eliminating the function call overhead and enabling further optimizations. A function is inlineable if its AST node count is within the inline budget (~80 nodes). Closures, recover(), and large functions are not inlined.

Check what gets inlined -gcflags -m

# -m shows inlining decisions alongside escape analysis
go build -gcflags="-m" ./...

// Output examples:
// ./math.go:5:6: can inline Add
// ./math.go:12:6: cannot inline Process: function too complex
// ./main.go:20:12: inlining call to Add

Writing inlineable functions Small functions

// Inlineable: simple, small, no recover
func max(a, b int) int {
    if a > b { return a }
    return b
}

// Not inlineable: too complex, or uses recover
func safeDiv(a, b int) (result int) {
    defer func() {
        if r := recover(); r != nil { result = 0 }
    }()
    return a / b
}

// Force no inlining (useful for benchmarks)
//go:noinline
func doWork() {}

🚰

Goroutine Leaks

blocked goroutines · goleak · runtime

⚠️

A leaked goroutine is one that is blocked indefinitely — waiting on a channel that will never be written to, or a mutex that will never be released. Leaked goroutines consume memory and OS resources forever. In servers, leaks compound over time and cause OOM crashes.

Common leak patterns and fixes Leak patterns

// ✗ LEAK: goroutine blocks on send if nobody reads
ch := make(chan int)  // unbuffered
go func() {
    result := doWork()
    ch <- result  // blocks forever if caller already returned
}()
return  // goroutine is now leaked

// ✓ FIX: buffered channel or context cancellation
ch := make(chan int, 1)  // buffered: send never blocks

// ✓ FIX: use context to stop the goroutine
go func() {
    select {
    case ch <- doWork():
    case <-ctx.Done():  // exits if context is cancelled
    }
}()

Detect leaks in tests with goleak goleak

import "go.uber.org/goleak"

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

// Or per test:
func TestWorker(t *testing.T) {
    defer goleak.VerifyNone(t)
    // test code here
    // goleak.VerifyNone fails if any goroutine leaked
}

Count goroutines at runtime runtime

// Quick check: current goroutine count
before := runtime.NumGoroutine()
doSomething()
after := runtime.NumGoroutine()

if after > before {
    t.Errorf("goroutine leak: before=%d after=%d", before, after)
}

// pprof goroutine profile shows stack traces
// of all live goroutines — useful to identify what's blocked

🧠

Memory Efficiency

struct layout · value types · GC pressure

Struct field ordering — reduce padding Alignment

// Bad: compiler inserts 7 bytes of padding
type Bad struct {
    a bool    // 1 byte
             // 7 bytes padding
    b int64   // 8 bytes
    c bool    // 1 byte
             // 7 bytes padding
}  // total: 24 bytes

// Good: largest fields first
type Good struct {
    b int64   // 8 bytes
    a bool    // 1 byte
    c bool    // 1 byte
             // 6 bytes padding
}  // total: 16 bytes

// Check size: unsafe.Sizeof(Bad{}) → 24

Value types vs pointer types Values

// Prefer values for small structs (≤ ~3 words)
// Copying is cheaper than a pointer dereference
type Point struct{ X, Y float64 }
func distance(p Point) float64 { ... }  // value: stays on stack

// Use pointers for large structs to avoid copying
type Config struct{ ... }  // 100+ fields
func process(c *Config) { ... }  // pointer: single 8-byte copy

// Slice of values vs slice of pointers:
[]Point   // contiguous memory — cache friendly
[]*Point  // scattered heap objects — more GC work

Avoiding fmt.Sprintf for hot paths Allocations

// fmt.Sprintf always allocates (uses reflection internally)
msg := fmt.Sprintf("user %d not found", id)  // alloc

// strconv is allocation-free for numeric conversions
var buf [32]byte
b := strconv.AppendInt(buf[:0], int64(id), 10)  // no alloc

// errors.New creates a constant error — safe to reuse
var ErrNotFound = errors.New("not found")
// Return ErrNotFound instead of fmt.Errorf("not found") in hot paths

// Only use fmt.Errorf (with %w) when wrapping with context at boundaries

📊

Reading Benchmark Output

ns/op · B/op · allocs/op · benchstat

  go test -bench=. -benchmem ./...

  BenchmarkEncode-8    1234567    42.3 ns/op    64 B/op    1 allocs/op
  │               │    │          │              │          └── heap allocations per op
  │               │    │          │              └──────────── bytes allocated per op
  │               │    │          └─────────────────────────── wall time per operation
  │               │    └────────────────────────────────────── iterations run
  │               └─────────────────────────────────────────── GOMAXPROCS (CPU count)
  └─────────────────────────────────────────────────────────── benchmark name

benchstat — compare two runs benchstat

# Install
go install golang.org/x/perf/cmd/benchstat@latest

# Run before and after, save output
go test -bench=. -count=5 ./... > before.txt
# (make your change)
go test -bench=. -count=5 ./... > after.txt

# Compare
benchstat before.txt after.txt

# Output shows % change and p-value:
# BenchmarkFoo  42.3ns ± 2%  31.1ns ± 1%  -26.5% (p=0.008)

What to look for Interpretation

// High allocs/op → look for heap escapes,
//   string concatenation, interface conversions

// High B/op with low allocs/op → large objects,
//   consider pooling or in-place updates

// High ns/op with 0 allocs → CPU-bound;
//   profile with -cpuprofile to find hot spots

// Noisy results (wide ± range) → run with -count=10,
//   close background apps, use benchstat for stats

// Always measure the whole system; microbenchmarks
// can be misleading if the hot path is elsewhere

📋

Quick Reference

Cheat-sheet

Concept	Command / Call	Notes
Escape analysis	go build -gcflags="-m" ./...	Shows what escapes to heap
Verbose escape	go build -gcflags="-m=2" ./...	Shows why each decision was made
Disable inlining	go build -gcflags="-l" ./...	For debugging; not for production
Inlining decisions	go build -gcflags="-m" ./...	"can inline" / "inlining call to"
CPU profile (HTTP)	/debug/pprof/profile?seconds=30	Requires import _ "net/http/pprof"
Heap profile (HTTP)	/debug/pprof/heap	Current in-use allocations
Goroutine dump	/debug/pprof/goroutine	Stack traces of all goroutines
Analyze profile	go tool pprof -http=:8081 cpu.prof	Web flame graph
Bench + profile	go test -bench=. -cpuprofile=cpu.prof	Profile a benchmark
Compare benchmarks	benchstat before.txt after.txt	golang.org/x/perf/cmd/benchstat
Object pool	sync.Pool{New: func() any { … }}	GC may evict any time; always Reset
String build	strings.Builder + Grow(n)	Single allocation at String()
Pre-alloc slice	make([]T, 0, n)	Avoids reallocation in append loop
Pre-alloc map	make(map[K]V, n)	Fewer rehash events
Struct size	unsafe.Sizeof(T{})	Order fields largest-first to reduce padding
Goroutine count	runtime.NumGoroutine()	Monitor for leaks
Detect leaks in tests	goleak.VerifyNone(t)	go.uber.org/goleak
Force no inlining	//go:noinline directive	Above the function declaration
Alloc-free int→str	strconv.AppendInt(buf[:0], n, 10)	Appends to existing buffer
GC before profile	runtime.GC()	Accurate heap snapshot