A ScyllaDB Community
Go Faster: Tuning The Go
Runtime For Latency And
Throughput
Paweł Obrępalski
Staff Engineer
Paweł Obrępalski (he/him)
Staff Engineer @ ShareChat
■ Aerospace researcher turned software engineer
■ Focused on large scale recommender systems
■ Leading delivery team at ShareChat
■ Enjoys biking, gym and sauna
What will we talk about today?
■ Go Runtime
● Scheduler
● Garbage Collection
■ Observability
● Metrics
● Profiles
■ Runtime Tuning
■ Our Results
The Runtime: Your Silent Partner
■ Benefits
● Effortless concurrency - can manage millions of goroutines
● Automatic memory management
● Cross-platform compatibility
■ Costs
● ~2MB increase in binary size
● Additional startup latency
● Garbage Collection overhead (usually 1-3% CPU)
■ The default behaviour is sensible for most of the workloads
● Check your code before runtime optimisations
● Can tune the behaviour by changing environment variables: GOMAXPROCS, GOGC , and GOMEMLIMIT
Multiplexing At Scale
■ G-M-P model
● G: Goroutines - lightweight treads (2KB stack initially)
● M: OS threads - created as needed, reused later
● P: Processors - fixed by GOMAXPROCS
Run Queues
■ New goroutine:
● Put on local (max 256)
● If full: move half to global
■ Empty local?
● Get from global
● Steal from others
■ Blocking (e.g. I/O)?
● P switches to different M
● G back to Queue
■ Sharing?
● Switch every 10ms
Garbage Collection
■ Problem: Find which objects are not in use
■ How to avoid halting the entire application?
● Concurrent Mark & Sweep
i. Mark all of the active objects
ii. Brief stop the world to create write barriers
iii. Sweep (delete) all non-active objects
■ GC runs alongside application
■ Multiple workers in both stages
■ Can tune the behaviour using GOGC/GOMEMLIMIT
Observability
Know When to Optimize
■ You can’t optimize what you can’t see!
■ Areas to cover:
● Application (rps, latency)
● Runtime (GC, goroutines, heap)
● System (CPU, memory, network)
■ Key runtime metrics:
● /gc/cycles/total:gc-cycles - Collection frequency
● /memory/classes/heap/objects:bytes - Live heap
● /sched/latencies:seconds - Scheduling delays
Quick Start
package main
import (
"net/http"
"github.com/prometheus/client_golang/prometheus/promhttp"
)
func main() {
http.Handle("/metrics", promhttp.Handler())
// your application code
http.ListenAndServe(":2112", nil)
}
■ Exposing your application metrics is just a few lines of code away
● Out of the box: go_gc, go_sched, go_memstats, go_threads, and much more…
■ Can easily add custom ones (e.g. P50/P99 latency)
Profiling: Finding the Answers
■ Run with Web UI
● go tool pprof -http=:90 localhost:80/debug/pprof/profile
● Will open UI on localhost:90 with data from your application running at port 80
■ Several types:
● CPU (/profile)
● Memory (/heap)
● Allocations (/allocs)
● Mutex (/mutex)
● Goroutines (/goroutine)
■ Exposing profiles on canary instances provides a quick way to observe actual usage
■ Ideally you want to collect the profiles from different releases (continuous profiling)
Flamegraph - CPU
■ Memory allocations: runtime.(*mheap)
■ Garbage collection: runtime.gcBgMarkWorker / runtime.bgsweep
■ Scheduler: runtime.schedule
■ Your code: runtime.main
Flamegraph - Heap
■ Identify where allocations happen
■ Quickly find memory leaks
Tips
■ Keep objects local, avoid pointers when possible
■ Check escapes with go build -gcflags=”-m”
■ Consider object pooling
■ Sawtooth memory usage -> excessive allocations
■ Identify problems using heap profiles
■ Run with with GODEBUG=gctrace=1 to expose GC details
Runtime Tuning
GOMAXPROCS for containers
■ Setup:
● Container: 2 CPU limit
● Host: 8 CPUs
● GOMAXPROCS: 8
■ Solution:
● Specify manually
● Use automaxprocs
● Upgrade to Go 1.25!
GOGC - Control GC Frequency
■ Target heap memory = Live heap * (1 + GOGC/100)
■ Higher GOGC value:
● GC runs less often
● Lower CPU usage
● Higher memory usage
Impact of different GOGC values - 50/100/200
Source: https://tip.golang.org/doc/gc-guide
GOMEMLIMIT - Avoid OOMs
■ How it works?
● Increases GC frequency when getting near configured value
● Soft limit - Go does not guarantee staying under it
● Overrides GOGC when necessary
■ Use when you have full control over execution environment (e.g. containers)
■ Good starting point ~90% of available memory
■ Pair high GOGC with GOMEMLIMIT
PGO: Let Production Guide Compilation
■ Free performance! No code changes required
● Up to ~14% depending on the workload
● Biggest gains for compute-bound workloads
■ How does it work?
● Analyze your apps CPU usage
● Inline hot functions more aggressively
■ How do I use it?
● Collect profiles (e.g curl <your_service> > cpu.pprof)
● Build with PGO : go build -pgo=cpu.pprof
● Deploy and measure results!
Our results
■ GOMAXPROCS
● Setting to # of available cores is a good starting point
● This usually gives the best throughput
● With higher values we’ve observed reduction in P50 but higher P99 latencies
● Observed up to 30% reduction in cost after tuning this parameter
■ Experiences with PGO?
● Easy to start with, especially if you already gather profiles from production
● Mixed results - most of our services were I/O bound and did not benefit much
● Longer build times - should be fixed with Go 1.22+
Our results
■ GOGC
● In extreme cases GC took over 40% CPU time
■ Review heap profiles for leaks/inefficiencies and tune GOGC
● CPU usage with GOGC changes from one of biggest services (20k+ peak QPS):
■ 100: 40% CPU
■ 200: 21.5% CPU, 72% peak memory usage increase
■ 300: 15.9% CPU, 364% peak memory usage increase,
■ 500: 5% CPU, 780% peak memory usage increase
■ Tuning GOGC/GOMEMLIMIT
● Average ~5% reduction in CPU usage, ~5% reduction in P99 latency
● We’ve found GOMEMLIMIT at ~90% and high GOGC to be suitable for most workloads
● Increasing memory may be a good trade-off for cost (1 CPU core ~4-5 GB RAM)
Stay Current, Stay Fast
■ Incremental improvements over the versions
■ 1.21: PGO (Profile Guided Optimisation)
■ 1.22: Improvements to runtime decreasing CPU overhead by 1-3%
■ 1.24
● Improvements to runtime decreasing CPU overhead by 2-3%
● New map implementation based on Swiss tables
■ 1.25
● Container-aware GOMAXPROCS!
● Experimental garbage collector (10-40% reduction in overhead in some workloads)
Key Takeaways
■ Optimisations are continuous, iterative process
■ Observability comes first - You can’t optimise what you can’t see
■ Go has great performance out of the box
■ Tuning the runtime may provide additional benefits
● Especially important at scale
● Easy to do with GOMAXPROCS, GOGC , and GOMEMLIMIT
■ Stay up to date with latest Go versions for free performance
Thank you! Let’s connect.
Paweł Obrępalski
pawel.obrepalski@sharechat.co
linkedin.com/in/obrepalski/
obrepalski.com

Go Faster: Tuning the Go Runtime for Latency and Throughput by Paweł Obrępalski

  • 1.
    A ScyllaDB Community GoFaster: Tuning The Go Runtime For Latency And Throughput Paweł Obrępalski Staff Engineer
  • 2.
    Paweł Obrępalski (he/him) StaffEngineer @ ShareChat ■ Aerospace researcher turned software engineer ■ Focused on large scale recommender systems ■ Leading delivery team at ShareChat ■ Enjoys biking, gym and sauna
  • 3.
    What will wetalk about today? ■ Go Runtime ● Scheduler ● Garbage Collection ■ Observability ● Metrics ● Profiles ■ Runtime Tuning ■ Our Results
  • 4.
    The Runtime: YourSilent Partner ■ Benefits ● Effortless concurrency - can manage millions of goroutines ● Automatic memory management ● Cross-platform compatibility ■ Costs ● ~2MB increase in binary size ● Additional startup latency ● Garbage Collection overhead (usually 1-3% CPU) ■ The default behaviour is sensible for most of the workloads ● Check your code before runtime optimisations ● Can tune the behaviour by changing environment variables: GOMAXPROCS, GOGC , and GOMEMLIMIT
  • 5.
    Multiplexing At Scale ■G-M-P model ● G: Goroutines - lightweight treads (2KB stack initially) ● M: OS threads - created as needed, reused later ● P: Processors - fixed by GOMAXPROCS
  • 6.
    Run Queues ■ Newgoroutine: ● Put on local (max 256) ● If full: move half to global ■ Empty local? ● Get from global ● Steal from others ■ Blocking (e.g. I/O)? ● P switches to different M ● G back to Queue ■ Sharing? ● Switch every 10ms
  • 7.
    Garbage Collection ■ Problem:Find which objects are not in use ■ How to avoid halting the entire application? ● Concurrent Mark & Sweep i. Mark all of the active objects ii. Brief stop the world to create write barriers iii. Sweep (delete) all non-active objects ■ GC runs alongside application ■ Multiple workers in both stages ■ Can tune the behaviour using GOGC/GOMEMLIMIT
  • 8.
  • 9.
    Know When toOptimize ■ You can’t optimize what you can’t see! ■ Areas to cover: ● Application (rps, latency) ● Runtime (GC, goroutines, heap) ● System (CPU, memory, network) ■ Key runtime metrics: ● /gc/cycles/total:gc-cycles - Collection frequency ● /memory/classes/heap/objects:bytes - Live heap ● /sched/latencies:seconds - Scheduling delays
  • 10.
    Quick Start package main import( "net/http" "github.com/prometheus/client_golang/prometheus/promhttp" ) func main() { http.Handle("/metrics", promhttp.Handler()) // your application code http.ListenAndServe(":2112", nil) } ■ Exposing your application metrics is just a few lines of code away ● Out of the box: go_gc, go_sched, go_memstats, go_threads, and much more… ■ Can easily add custom ones (e.g. P50/P99 latency)
  • 11.
    Profiling: Finding theAnswers ■ Run with Web UI ● go tool pprof -http=:90 localhost:80/debug/pprof/profile ● Will open UI on localhost:90 with data from your application running at port 80 ■ Several types: ● CPU (/profile) ● Memory (/heap) ● Allocations (/allocs) ● Mutex (/mutex) ● Goroutines (/goroutine) ■ Exposing profiles on canary instances provides a quick way to observe actual usage ■ Ideally you want to collect the profiles from different releases (continuous profiling)
  • 12.
    Flamegraph - CPU ■Memory allocations: runtime.(*mheap) ■ Garbage collection: runtime.gcBgMarkWorker / runtime.bgsweep ■ Scheduler: runtime.schedule ■ Your code: runtime.main
  • 13.
    Flamegraph - Heap ■Identify where allocations happen ■ Quickly find memory leaks
  • 14.
    Tips ■ Keep objectslocal, avoid pointers when possible ■ Check escapes with go build -gcflags=”-m” ■ Consider object pooling ■ Sawtooth memory usage -> excessive allocations ■ Identify problems using heap profiles ■ Run with with GODEBUG=gctrace=1 to expose GC details
  • 15.
  • 16.
    GOMAXPROCS for containers ■Setup: ● Container: 2 CPU limit ● Host: 8 CPUs ● GOMAXPROCS: 8 ■ Solution: ● Specify manually ● Use automaxprocs ● Upgrade to Go 1.25!
  • 17.
    GOGC - ControlGC Frequency ■ Target heap memory = Live heap * (1 + GOGC/100) ■ Higher GOGC value: ● GC runs less often ● Lower CPU usage ● Higher memory usage
  • 18.
    Impact of differentGOGC values - 50/100/200 Source: https://tip.golang.org/doc/gc-guide
  • 19.
    GOMEMLIMIT - AvoidOOMs ■ How it works? ● Increases GC frequency when getting near configured value ● Soft limit - Go does not guarantee staying under it ● Overrides GOGC when necessary ■ Use when you have full control over execution environment (e.g. containers) ■ Good starting point ~90% of available memory ■ Pair high GOGC with GOMEMLIMIT
  • 20.
    PGO: Let ProductionGuide Compilation ■ Free performance! No code changes required ● Up to ~14% depending on the workload ● Biggest gains for compute-bound workloads ■ How does it work? ● Analyze your apps CPU usage ● Inline hot functions more aggressively ■ How do I use it? ● Collect profiles (e.g curl <your_service> > cpu.pprof) ● Build with PGO : go build -pgo=cpu.pprof ● Deploy and measure results!
  • 21.
    Our results ■ GOMAXPROCS ●Setting to # of available cores is a good starting point ● This usually gives the best throughput ● With higher values we’ve observed reduction in P50 but higher P99 latencies ● Observed up to 30% reduction in cost after tuning this parameter ■ Experiences with PGO? ● Easy to start with, especially if you already gather profiles from production ● Mixed results - most of our services were I/O bound and did not benefit much ● Longer build times - should be fixed with Go 1.22+
  • 22.
    Our results ■ GOGC ●In extreme cases GC took over 40% CPU time ■ Review heap profiles for leaks/inefficiencies and tune GOGC ● CPU usage with GOGC changes from one of biggest services (20k+ peak QPS): ■ 100: 40% CPU ■ 200: 21.5% CPU, 72% peak memory usage increase ■ 300: 15.9% CPU, 364% peak memory usage increase, ■ 500: 5% CPU, 780% peak memory usage increase ■ Tuning GOGC/GOMEMLIMIT ● Average ~5% reduction in CPU usage, ~5% reduction in P99 latency ● We’ve found GOMEMLIMIT at ~90% and high GOGC to be suitable for most workloads ● Increasing memory may be a good trade-off for cost (1 CPU core ~4-5 GB RAM)
  • 23.
    Stay Current, StayFast ■ Incremental improvements over the versions ■ 1.21: PGO (Profile Guided Optimisation) ■ 1.22: Improvements to runtime decreasing CPU overhead by 1-3% ■ 1.24 ● Improvements to runtime decreasing CPU overhead by 2-3% ● New map implementation based on Swiss tables ■ 1.25 ● Container-aware GOMAXPROCS! ● Experimental garbage collector (10-40% reduction in overhead in some workloads)
  • 24.
    Key Takeaways ■ Optimisationsare continuous, iterative process ■ Observability comes first - You can’t optimise what you can’t see ■ Go has great performance out of the box ■ Tuning the runtime may provide additional benefits ● Especially important at scale ● Easy to do with GOMAXPROCS, GOGC , and GOMEMLIMIT ■ Stay up to date with latest Go versions for free performance
  • 25.
    Thank you! Let’sconnect. Paweł Obrępalski pawel.obrepalski@sharechat.co linkedin.com/in/obrepalski/ obrepalski.com