Go Performance

Stack vs heap, escape analysis, pprof profiling, sync.Pool, goroutine leaks, and writing efficient Go.

Stack vs heap Escape analysis pprof sync.Pool Inlining allocs/op Goroutine leaks strings.Builder
๐Ÿ—๏ธ

Stack vs Heap

  Stack                              Heap
  โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€             โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
  Per-goroutine (starts at 8KB)      Shared across all goroutines
  Grows/shrinks automatically        Managed by the GC
  Fast: bump pointer allocation      Slower: GC must scan and collect
  Freed when function returns        Freed when no more references

  Rule: if a value's lifetime is bounded to a function call โ†’ stack
        if it escapes (returned, sent to channel, stored) โ†’ heap
What causes heap allocation Escape
// Returning a pointer โ€” value escapes to heap
func newUser() *User {
    u := User{Name: "Alice"}  // u escapes
    return &u
}

// Storing in interface โ€” concrete type escapes
var i any = User{}  // User escapes to heap

// Sending pointer on channel โ€” escapes
ch <- &User{}

// Closure capturing a variable โ€” escapes
x := 42
f := func() { fmt.Println(x) }  // x escapes
_ = f
Escape analysis โ€” see the decisions -gcflags
# Show escape analysis decisions
go build -gcflags="-m" ./...

# More verbose (shows why)
go build -gcflags="-m=2" ./...

// Sample output:
// ./main.go:8:2: &u escapes to heap
// ./main.go:12:14: x escapes to heap
// ./main.go:5:17: User{} does not escape

# Disable inlining to see more escape detail
go build -gcflags="-m -l" ./...
๐Ÿ”ฌ

pprof Profiling

โ„น๏ธ
Profile before you optimize. pprof tells you exactly where CPU time and memory are spent โ€” without it, optimizations are guesswork. Always benchmark before and after a change to confirm impact.
HTTP pprof endpoint โ€” always-on profiling net/http/pprof
import _ "net/http/pprof"  // registers /debug/pprof/ routes

func main() {
    // Expose on a separate port (never on your public port)
    go http.ListenAndServe(":6060", nil)
    // ... start your actual server
}

# Capture 30s CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

# Heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

# Goroutine dump
go tool pprof http://localhost:6060/debug/pprof/goroutine

# Open interactive web UI
go tool pprof -http=:8081 cpu.prof
Programmatic profiling runtime/pprof
import "runtime/pprof"

// CPU profile
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()

// Heap profile (call after workload)
mf, _ := os.Create("mem.prof")
runtime.GC()  // run GC for accurate stats
pprof.WriteHeapProfile(mf)
mf.Close()
Benchmark + pprof -cpuprofile
# Run benchmark and produce profiles in one step
go test -bench=BenchmarkFoo \
    -cpuprofile=cpu.prof \
    -memprofile=mem.prof \
    ./...

# Analyze the CPU profile
go tool pprof cpu.prof

# Inside pprof REPL:
# top10       โ€” top 10 functions by self time
# list Foo    โ€” annotated source for function Foo
# web         โ€” open flame graph in browser
โ™ป๏ธ

Reducing Allocations

sync.Pool โ€” reuse temporary objects sync.Pool
// Pool holds temporary objects to reduce GC pressure
// Objects may be evicted at any time โ€” never store state in them
var bufPool = sync.Pool{
    New: func() any {
        return new(bytes.Buffer)
    },
}

func buildResponse(data []byte) []byte {
    buf := bufPool.Get().(*bytes.Buffer)
    buf.Reset()                    // always reset before use
    defer bufPool.Put(buf)

    buf.Write(data)
    return buf.Bytes()
}

// Before sync.Pool: N allocs/op (one Buffer per call)
// After  sync.Pool: ~0 allocs/op (Buffer is reused)
strings.Builder โ€” build strings without allocs strings.Builder
// Concatenating with + allocates on every iteration
var s string
for _, w := range words {
    s += w  // O(nยฒ) allocations
}

// strings.Builder โ€” single allocation at the end
var sb strings.Builder
sb.Grow(len(words) * 8)  // hint capacity if known
for _, w := range words {
    sb.WriteString(w)
}
result := sb.String()
Pre-allocate slices and maps make with cap
// Without pre-alloc: grows 2x repeatedly, many allocs
var results []int
for i := range items {
    results = append(results, process(items[i]))
}

// With pre-alloc: zero reallocations
results = make([]int, 0, len(items))
for i := range items {
    results = append(results, process(items[i]))
}

// Pre-alloc map
m := make(map[string]int, len(keys))
for _, k := range keys {
    m[k]++
}
โšก

Inlining

โ„น๏ธ
The Go compiler inlines small functions at their call sites, eliminating the function call overhead and enabling further optimizations. A function is inlineable if its AST node count is within the inline budget (~80 nodes). Closures, recover(), and large functions are not inlined.
Check what gets inlined -gcflags -m
# -m shows inlining decisions alongside escape analysis
go build -gcflags="-m" ./...

// Output examples:
// ./math.go:5:6: can inline Add
// ./math.go:12:6: cannot inline Process: function too complex
// ./main.go:20:12: inlining call to Add
Writing inlineable functions Small functions
// Inlineable: simple, small, no recover
func max(a, b int) int {
    if a > b { return a }
    return b
}

// Not inlineable: too complex, or uses recover
func safeDiv(a, b int) (result int) {
    defer func() {
        if r := recover(); r != nil { result = 0 }
    }()
    return a / b
}

// Force no inlining (useful for benchmarks)
//go:noinline
func doWork() {}
๐Ÿšฐ

Goroutine Leaks

โš ๏ธ
A leaked goroutine is one that is blocked indefinitely โ€” waiting on a channel that will never be written to, or a mutex that will never be released. Leaked goroutines consume memory and OS resources forever. In servers, leaks compound over time and cause OOM crashes.
Common leak patterns and fixes Leak patterns
// โœ— LEAK: goroutine blocks on send if nobody reads
ch := make(chan int)  // unbuffered
go func() {
    result := doWork()
    ch <- result  // blocks forever if caller already returned
}()
return  // goroutine is now leaked

// โœ“ FIX: buffered channel or context cancellation
ch := make(chan int, 1)  // buffered: send never blocks

// โœ“ FIX: use context to stop the goroutine
go func() {
    select {
    case ch <- doWork():
    case <-ctx.Done():  // exits if context is cancelled
    }
}()
Detect leaks in tests with goleak goleak
import "go.uber.org/goleak"

func TestMain(m *testing.M) {
    goleak.VerifyTestMain(m)
}

// Or per test:
func TestWorker(t *testing.T) {
    defer goleak.VerifyNone(t)
    // test code here
    // goleak.VerifyNone fails if any goroutine leaked
}
Count goroutines at runtime runtime
// Quick check: current goroutine count
before := runtime.NumGoroutine()
doSomething()
after := runtime.NumGoroutine()

if after > before {
    t.Errorf("goroutine leak: before=%d after=%d", before, after)
}

// pprof goroutine profile shows stack traces
// of all live goroutines โ€” useful to identify what's blocked
๐Ÿง 

Memory Efficiency

Struct field ordering โ€” reduce padding Alignment
// Bad: compiler inserts 7 bytes of padding
type Bad struct {
    a bool    // 1 byte
             // 7 bytes padding
    b int64   // 8 bytes
    c bool    // 1 byte
             // 7 bytes padding
}  // total: 24 bytes

// Good: largest fields first
type Good struct {
    b int64   // 8 bytes
    a bool    // 1 byte
    c bool    // 1 byte
             // 6 bytes padding
}  // total: 16 bytes

// Check size: unsafe.Sizeof(Bad{}) โ†’ 24
Value types vs pointer types Values
// Prefer values for small structs (โ‰ค ~3 words)
// Copying is cheaper than a pointer dereference
type Point struct{ X, Y float64 }
func distance(p Point) float64 { ... }  // value: stays on stack

// Use pointers for large structs to avoid copying
type Config struct{ ... }  // 100+ fields
func process(c *Config) { ... }  // pointer: single 8-byte copy

// Slice of values vs slice of pointers:
[]Point   // contiguous memory โ€” cache friendly
[]*Point  // scattered heap objects โ€” more GC work
Avoiding fmt.Sprintf for hot paths Allocations
// fmt.Sprintf always allocates (uses reflection internally)
msg := fmt.Sprintf("user %d not found", id)  // alloc

// strconv is allocation-free for numeric conversions
var buf [32]byte
b := strconv.AppendInt(buf[:0], int64(id), 10)  // no alloc

// errors.New creates a constant error โ€” safe to reuse
var ErrNotFound = errors.New("not found")
// Return ErrNotFound instead of fmt.Errorf("not found") in hot paths

// Only use fmt.Errorf (with %w) when wrapping with context at boundaries
๐Ÿ“Š

Reading Benchmark Output

  go test -bench=. -benchmem ./...

  BenchmarkEncode-8    1234567    42.3 ns/op    64 B/op    1 allocs/op
  โ”‚               โ”‚    โ”‚          โ”‚              โ”‚          โ””โ”€โ”€ heap allocations per op
  โ”‚               โ”‚    โ”‚          โ”‚              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ bytes allocated per op
  โ”‚               โ”‚    โ”‚          โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ wall time per operation
  โ”‚               โ”‚    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ iterations run
  โ”‚               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ GOMAXPROCS (CPU count)
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ benchmark name
benchstat โ€” compare two runs benchstat
# Install
go install golang.org/x/perf/cmd/benchstat@latest

# Run before and after, save output
go test -bench=. -count=5 ./... > before.txt
# (make your change)
go test -bench=. -count=5 ./... > after.txt

# Compare
benchstat before.txt after.txt

# Output shows % change and p-value:
# BenchmarkFoo  42.3ns ยฑ 2%  31.1ns ยฑ 1%  -26.5% (p=0.008)
What to look for Interpretation
// High allocs/op โ†’ look for heap escapes,
//   string concatenation, interface conversions

// High B/op with low allocs/op โ†’ large objects,
//   consider pooling or in-place updates

// High ns/op with 0 allocs โ†’ CPU-bound;
//   profile with -cpuprofile to find hot spots

// Noisy results (wide ยฑ range) โ†’ run with -count=10,
//   close background apps, use benchstat for stats

// Always measure the whole system; microbenchmarks
// can be misleading if the hot path is elsewhere
๐Ÿ“‹

Quick Reference

Concept Command / Call Notes
Escape analysisgo build -gcflags="-m" ./...Shows what escapes to heap
Verbose escapego build -gcflags="-m=2" ./...Shows why each decision was made
Disable inlininggo build -gcflags="-l" ./...For debugging; not for production
Inlining decisionsgo build -gcflags="-m" ./..."can inline" / "inlining call to"
CPU profile (HTTP)/debug/pprof/profile?seconds=30Requires import _ "net/http/pprof"
Heap profile (HTTP)/debug/pprof/heapCurrent in-use allocations
Goroutine dump/debug/pprof/goroutineStack traces of all goroutines
Analyze profilego tool pprof -http=:8081 cpu.profWeb flame graph
Bench + profilego test -bench=. -cpuprofile=cpu.profProfile a benchmark
Compare benchmarksbenchstat before.txt after.txtgolang.org/x/perf/cmd/benchstat
Object poolsync.Pool{New: func() any { โ€ฆ }}GC may evict any time; always Reset
String buildstrings.Builder + Grow(n)Single allocation at String()
Pre-alloc slicemake([]T, 0, n)Avoids reallocation in append loop
Pre-alloc mapmake(map[K]V, n)Fewer rehash events
Struct sizeunsafe.Sizeof(T{})Order fields largest-first to reduce padding
Goroutine countruntime.NumGoroutine()Monitor for leaks
Detect leaks in testsgoleak.VerifyNone(t)go.uber.org/goleak
Force no inlining//go:noinline directiveAbove the function declaration
Alloc-free intโ†’strstrconv.AppendInt(buf[:0], n, 10)Appends to existing buffer
GC before profileruntime.GC()Accurate heap snapshot