Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

A ScyllaDB Community
Go Faster: Tuning The Go
Runtime For Latency And
Throughput
Paweł Obrępalski
Staff Engineer

Paweł Obrępalski (he/him)
Staff Engineer @ ShareChat
■ Aerospace researcher turned software engineer
■ Focused on large scale recommender systems
■ Leading delivery team at ShareChat
■ Enjoys biking, gym and sauna

What will we talk about today?
■ Go Runtime
● Scheduler
● Garbage Collection
■ Observability
● Metrics
● Proﬁles
■ Runtime Tuning
■ Our Results

The Runtime: Your Silent Partner
■ Beneﬁts
● Effortless concurrency - can manage millions of goroutines
● Automatic memory management
● Cross-platform compatibility
■ Costs
● ~2MB increase in binary size
● Additional startup latency
● Garbage Collection overhead (usually 1-3% CPU)
■ The default behaviour is sensible for most of the workloads
● Check your code before runtime optimisations
● Can tune the behaviour by changing environment variables: GOMAXPROCS, GOGC , and GOMEMLIMIT

Multiplexing At Scale
■ G-M-P model
● G: Goroutines - lightweight treads (2KB stack initially)
● M: OS threads - created as needed, reused later
● P: Processors - ﬁxed by GOMAXPROCS

Run Queues
■ New goroutine:
● Put on local (max 256)
● If full: move half to global
■ Empty local?
● Get from global
● Steal from others
■ Blocking (e.g. I/O)?
● P switches to different M
● G back to Queue
■ Sharing?
● Switch every 10ms

Garbage Collection
■ Problem: Find which objects are not in use
■ How to avoid halting the entire application?
● Concurrent Mark & Sweep
i. Mark all of the active objects
ii. Brief stop the world to create write barriers
iii. Sweep (delete) all non-active objects
■ GC runs alongside application
■ Multiple workers in both stages
■ Can tune the behaviour using GOGC/GOMEMLIMIT

Know When to Optimize
■ You can’t optimize what you can’t see!
■ Areas to cover:
● Application (rps, latency)
● Runtime (GC, goroutines, heap)
● System (CPU, memory, network)
■ Key runtime metrics:
● /gc/cycles/total:gc-cycles - Collection frequency
● /memory/classes/heap/objects:bytes - Live heap
● /sched/latencies:seconds - Scheduling delays

Quick Start
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
http.Handle("/metrics", promhttp.Handler())
// your application code
http.ListenAndServe(":2112", nil)
}
■ Exposing your application metrics is just a few lines of code away
● Out of the box: go_gc, go_sched, go_memstats, go_threads, and much more…
■ Can easily add custom ones (e.g. P50/P99 latency)

Profiling: Finding the Answers
■ Run with Web UI
● go tool pprof -http=:90 localhost:80/debug/pprof/profile
● Will open UI on localhost:90 with data from your application running at port 80
■ Several types:
● CPU (/profile)
● Memory (/heap)
● Allocations (/allocs)
● Mutex (/mutex)
● Goroutines (/goroutine)
■ Exposing profiles on canary instances provides a quick way to observe actual usage
■ Ideally you want to collect the profiles from different releases (continuous profiling)

Flamegraph - CPU
■ Memory allocations: runtime.(*mheap)
■ Garbage collection: runtime.gcBgMarkWorker / runtime.bgsweep
■ Scheduler: runtime.schedule
■ Your code: runtime.main

Flamegraph - Heap
■ Identify where allocations happen
■ Quickly ﬁnd memory leaks

Tips
■ Keep objects local, avoid pointers when possible
■ Check escapes with go build -gcflags=”-m”
■ Consider object pooling
■ Sawtooth memory usage -> excessive allocations
■ Identify problems using heap proﬁles
■ Run with with GODEBUG=gctrace=1 to expose GC details

GOMAXPROCS for containers
■ Setup:
● Container: 2 CPU limit
● Host: 8 CPUs
● GOMAXPROCS: 8
■ Solution:
● Specify manually
● Use automaxprocs
● Upgrade to Go 1.25!

GOGC - Control GC Frequency
■ Target heap memory = Live heap * (1 + GOGC/100)
■ Higher GOGC value:
● GC runs less often
● Lower CPU usage
● Higher memory usage

Impact of different GOGC values - 50/100/200
Source: https://tip.golang.org/doc/gc-guide

GOMEMLIMIT - Avoid OOMs
■ How it works?
● Increases GC frequency when getting near conﬁgured value
● Soft limit - Go does not guarantee staying under it
● Overrides GOGC when necessary
■ Use when you have full control over execution environment (e.g. containers)
■ Good starting point ~90% of available memory
■ Pair high GOGC with GOMEMLIMIT

PGO: Let Production Guide Compilation
■ Free performance! No code changes required
● Up to ~14% depending on the workload
● Biggest gains for compute-bound workloads
■ How does it work?
● Analyze your apps CPU usage
● Inline hot functions more aggressively
■ How do I use it?
● Collect proﬁles (e.g curl <your_service> > cpu.pprof)
● Build with PGO : go build -pgo=cpu.pprof
● Deploy and measure results!

Our results
■ GOMAXPROCS
● Setting to # of available cores is a good starting point
● This usually gives the best throughput
● With higher values we’ve observed reduction in P50 but higher P99 latencies
● Observed up to 30% reduction in cost after tuning this parameter
■ Experiences with PGO?
● Easy to start with, especially if you already gather profiles from production
● Mixed results - most of our services were I/O bound and did not benefit much
● Longer build times - should be fixed with Go 1.22+

Our results
■ GOGC
● In extreme cases GC took over 40% CPU time
■ Review heap proﬁles for leaks/ineﬃciencies and tune GOGC
● CPU usage with GOGC changes from one of biggest services (20k+ peak QPS):
■ 100: 40% CPU
■ 200: 21.5% CPU, 72% peak memory usage increase
■ 300: 15.9% CPU, 364% peak memory usage increase,
■ 500: 5% CPU, 780% peak memory usage increase
■ Tuning GOGC/GOMEMLIMIT
● Average ~5% reduction in CPU usage, ~5% reduction in P99 latency
● We’ve found GOMEMLIMIT at ~90% and high GOGC to be suitable for most workloads
● Increasing memory may be a good trade-off for cost (1 CPU core ~4-5 GB RAM)

Stay Current, Stay Fast
■ Incremental improvements over the versions
■ 1.21: PGO (Proﬁle Guided Optimisation)
■ 1.22: Improvements to runtime decreasing CPU overhead by 1-3%
■ 1.24
● Improvements to runtime decreasing CPU overhead by 2-3%
● New map implementation based on Swiss tables
■ 1.25
● Container-aware GOMAXPROCS!
● Experimental garbage collector (10-40% reduction in overhead in some workloads)

Key Takeaways
■ Optimisations are continuous, iterative process
■ Observability comes ﬁrst - You can’t optimise what you can’t see
■ Go has great performance out of the box
■ Tuning the runtime may provide additional beneﬁts
● Especially important at scale
● Easy to do with GOMAXPROCS, GOGC , and GOMEMLIMIT
■ Stay up to date with latest Go versions for free performance

Thank you! Let’s connect.
Paweł Obrępalski
pawel.obrepalski@sharechat.co
linkedin.com/in/obrepalski/
obrepalski.com

Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

More Related Content

Similar to Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

More from ScyllaDB

Recently uploaded

Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski