Go Faster
Optimising Go Programs
19 February 2015
Jason Moiron
Programmer, Datadog, Inc. (datadoghq.com)
Jason Moiron
Programmer, Datadog, Inc. (datadoghq.com)
"There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time [...], yet we should not pass up our opportunities in that critical 3%."
Learning what to change:
Learning how to change it:
The go stdlib code is not scary. Dive in!
import "os" import "runtime/pprof" func startProfile(destpath string) { f, err := os.Create(destpath) if err != nil { panic(err) } pprof.StartCPUProfile(f) }
import ( _ "net/http/pprof" "net/http" "log" ) func main() { go func() { log.Println(http.ListenAndServe("localhost:6060", nil)) } ... }
$ go tool pprof http://localhost:6060/debug/pprof/profile $ go tool pprof http://localhost:6060/debug/pprof/heap $ go tool pprof http://localhost:6060/debug/pprof/block
$ go tool pprof <binary> <profile output> Welcome to pprof! For help, type 'help'. (pprof) top Total: 462 samples 235 50.9% 50.9% 237 51.3% runtime.cgocall 98 21.2% 72.1% 98 21.2% syscall.Syscall 55 11.9% 84.0% 55 11.9% runtime.memmove 41 8.9% 92.9% 41 8.9% hash/crc32.update 7 1.5% 94.4% 11 2.4% MCentral_Grow ... (pprof) top20 --cum Total: 462 samples 0 0.0% 0.0% 458 99.1% runtime.gosched0 0 0.0% 0.0% 381 82.5% main.main 0 0.0% 0.0% 381 82.5% runtime.main 1 0.2% 0.2% 303 65.6% github.com/DataDog/dd-go/czlib.Decompress 235 50.9% 51.1% 237 51.3% runtime.cgocall
(pprof) list crc32.update Total: 462 samples ROUTINE ====================== hash/crc32.update in /usr/local/go/src/pkg/hash/crc32/crc32.go 41 41 Total samples (flat / cumulative) . . 101: func update(crc uint32, tab *Table, p []byte) uint32 { . . 102: crc = ^crc 6 6 103: for _, v := range p { 35 35 104: crc = tab[byte(crc)^v] ^ (crc >> 8) . . 105: } . . 106: return ^crc
(pprof) disasm crc32.update ROUTINE ====================== hash/crc32.update 41 41 samples (flat, cumulative) 8.9% of total -------------------- /usr/local/go/src/pkg/hash/crc32/crc32.go . . 101: func update(crc uint32, tab *Table, p []byte) uint32 { . . 50ca80: SUBQ $0x10, SP . . 50ca84: MOVQ 0x20(SP), R9 . . 102: crc = ^crc . . 50ca89: MOVL 0x18(SP), AX . . 50ca8d: XORL $-0x1, AX 6 6 103: for _, v := range p { (snip...) 35 35 104: crc = tab[byte(crc)^v] ^ (crc >> 8) . . 50caae: MOVL AX, BP . . 50cab0: XORQ SI, BP . . 50cab3: CMPQ $0x0, R9 . . 50cab7: JE 0x50cae5 5 5 50cab9: MOVZX BP, BP 1 1 50cabd: LEAQ 0(R9)(BP*4), BX 5 5 50cac1: MOVL 0(BX), BX 21 21 50cac3: MOVL AX, BP 1 1 50cac5: SHRL $0x8, BP . . 50cac8: XORL BP, BX 2 2 50caca: MOVL BX, AX
$ go tool pprof -ps <binary> <profile path>
import "expvar" var cache map[string]*Thing var requests int func expose() { go http.ListenAndServe("localhost:6060", nil) requestCount := expvar.NewInt("requestCount") cacheSize := expvar.NewInt("cacheSize") for _ = range <- time.Tick(5 * time.Second) { requestCount.Set(requests) cacheSize.Set(len(cache)) } }
Expose variables on a simple json HTTP interface:
$ curl http://localhost:6060/debug/vars { "cacheSize": 421, "requestCount": 10332, ...}
Measure things relevant to you:
Free stats:
Queue processing daemon called "mouzone"
func mouzone(messages chan msg) { parsed := make(chan parsedMsg, 50) resolved := make(chan resolvedMsg, 50) go func() { for msg := range messages { parsed <- parse(msg) } }() go func() { for msg := range parsed { resolved <- resolve(msg) } }() for msg := range resolved { // ... } }
Benchmarks are short, repeatable timing tests.
func BenchmarkZlib(b *testing.B) { for i:=0; i<b.N; i++ { // ... } }
Run them with go test -bench
$ go test -bench . -benchmem ... BenchmarkFzlibDecompress 300 4849825 ns/op 942101 B/op 1 allocs/op BenchmarkCzlibDecompress 300 5354454 ns/op 2128336 B/op 16 allocs/op BenchmarkZlibDecompress 100 14694251 ns/op 2141291 B/op 59 allocs/op BenchmarkFzlibCompress 20 58481874 ns/op 262156 B/op 1 allocs/op BenchmarkCzlibCompress 20 58993665 ns/op 639292 B/op 7 allocs/op BenchmarkZlibCompress 10 167576430 ns/op 2455070 B/op 1026 allocs/op
Lets use benchmarking to explore Go:
func BenchmarkMapAccess(b *testing.B) { m := map[int]int{0: 0, 1: 1} for i := 0; i < b.N; i++ { _ = m[0] + m[1] } }
func BenchmarkStructAccess(b *testing.B) { m := struct{ a, b int }{0, 1} for i := 0; i < b.N; i++ { _ = m.a + m.b } }
$ go test -bench . ... BenchmarkMapAccess 100000000 13.00 ns/op BenchmarkStructAccess 500000000 0.63 ns/op
package main func main() { x := struct{ a, b int }{0xaa, 0xbb} b := x.a + x.b println(b) }
We can see how this works by looking at the ASM output w/ go tool 6g -S
"".main t=1 size=80 value=0 args=0x0 locals=0x18 ; go has optimised some stuff away; use -N to turn it off 0x001a 00026 (structvmap.go:4) MOVQ $170,BX 0x0021 00033 (structvmap.go:4) MOVQ $187,AX 0x0028 00040 (structvmap.go:5) MOVQ BX,"".x+8(SP) 0x002d 00045 (structvmap.go:5) MOVQ AX,"".x+16(SP) 0x0032 00050 (structvmap.go:5) ADDQ AX,BX ; ...
package main func main() { x := map[string]int{"a": 0xaa, "b": 0xbb} b := x["a"] + x["b"] println(b) }
asm output for line 5:
"".main t=1 size=336 value=0 args=0x0 locals=0x58 ; snip 0x00a4 00164 (structvmap2.go:5) LEAQ go.string."a"+0(SB),BX 0x00ab 00171 (structvmap2.go:5) MOVQ (BX),CX 0x00ae 00174 (structvmap2.go:5) MOVQ 8(BX),BP 0x00b2 00178 (structvmap2.go:5) MOVQ $type.map[string]int+0(SB),BX ; snip a bunch of copying to temp variables 0x00db 00219 (structvmap2.go:5) CALL ,runtime.mapaccess1_faststr(SB) ; snip some more temp vars, val copied to tmp 0x00e8 00232 (structvmap2.go:5) MOVQ BP,"".autotmp_0001+40(SP) 0x00ed 00237 (structvmap2.go:5) LEAQ go.string."b"+0(SB),BX ; same map access dance for "b" 0x0136 00310 (structvmap2.go:5) ADDQ BP,BX
The uncut assembly for Lines 4 & 5
Maps are hash tables. Lookup involves:
Writing is similar.
func BenchmarkStringKeys(b *testing.B) { s, n := "reasonably-long-but-present-unique-identifier", "non-present-unique-id" m := map[string]struct{}{s: empty} for i := 0; i < b.N; i++ { _, _ = m[s] _, _ = m[n] } }
func BenchmarkStructKeys(b *testing.B) { type key struct{ a, b int } k, n := key{0, 1}, key{1, 2} m := map[key]struct{}{k: empty} for i := 0; i < b.N; i++ { _, _ = m[k] _, _ = m[n] } }
$ go test -bench . ... BenchmarkStringKeys 100000000 22.8 ns/op BenchmarkStructKeys 20000000 93.2 ns/op
Lets profile our benchmarks and see what's going on:
$ cd bench/ $ go test -c $ ./bench.test -test.bench="StringKeys" -test.cpuprofile="stringkeys.pprof" $ ./bench.test -test.bench="StructKeys" -test.cpuprofile="structkeys.pprof" $ go tool pprof --ps bench.test stringkeys.pprof > stringkeys.ps $ go tool pprof --ps bench.test structkeys.pprof > structkeys.ps
Output for stringkeys.ps:
Output for structkeys.ps:
Not related to hashing:
// src/runtime/alg.go func strhash(a unsafe.Pointer, s, h uintptr) uintptr { return memhash((*stringStruct)(a).str, uintptr(len(*(*string)(a))), h) }
runtime·aeshashstr
is almost identical to runtime·aeshash
src/runtime/hashmap_fast.go has a number of fast paths:
Map overhead reduced for these common key types.
"When possible, the Go compilers will allocate variables that are local to a function in that function's stack frame."
Let's compare two simple functions that allocate and return 1K:
func stacked() [128]int64 { return [128]int64{} }
func heaped() *[128]int64 { return &[128]int64{} }
$ go test -bench Alloc BenchmarkStackAlloc 10000000 133 ns/op BenchmarkHeapAlloc 20000000 83.5 ns/op
Curious... Is it because returning an 8-byte pointer is cheaper than a 1K buffer?
You have to be careful that your benchmarks test what you think they are testing. Lets see what the go compiler is doing:
$ go test -gcflags="-m" -bench Alloc ./stack_test.go:5: can inline stacked ./stack_test.go:9: can inline heaped ./stack_test.go:15: inlining call to stacked ./stack_test.go:21: inlining call to heaped ./stack_test.go:10: &[128]int64 literal escapes to heap ./stack_test.go:13: BenchmarkStackAlloc b does not escape ./stack_test.go:19: BenchmarkHeapAlloc b does not escape ./stack_test.go:21: heaped &[128]int64 literal does not escape
There are more compiler flags available, use go tool 6g --help for a list. Let's try again with optimisations and inlining off:
$ go test -gcflags="-N -l" -bench Alloc BenchmarkStackAlloc 10000000 135 ns/op BenchmarkHeapAlloc 500000 3146 ns/op
Lets see the asm: go tool 6g -S -N -l stack_test.go
stacked:
"".stacked t=1 size=96 value=0 args=0x400 locals=0x400 ; snip 0x0034 00052 (stack_test.go:6) LEAQ "".autotmp_0000+0(SP),DI 0x0038 00056 (stack_test.go:6) MOVL $0,AX 0x003a 00058 (stack_test.go:6) DUFFZERO ,$ 0x003f 00063 (stack_test.go:6) LEAQ "".autotmp_0000+0(SP),BX 0x0043 00067 (stack_test.go:6) LEAQ "".~r0+1032(FP),BP 0x004b 00075 (stack_test.go:6) MOVQ BP,DI 0x004e 00078 (stack_test.go:6) MOVQ BX,SI 0x0051 00081 (stack_test.go:6) DUFFCOPY ,$ 0x0056 00086 (stack_test.go:6) ADDQ $1024,SP 0x005d 00093 (stack_test.go:6) RET ,
heaped:
"".heaped t=1 size=112 value=0 args=0x8 locals=0x18 ;snip 0x0023 00035 (stack_test.go:10) MOVQ $type.[128]int64+0(SB),BX 0x002a 00042 (stack_test.go:10) MOVQ BX,(SP) 0x002e 00046 (stack_test.go:10) PCDATA $0,$0 0x002e 00046 (stack_test.go:10) CALL ,runtime.newobject(SB) 0x0033 00051 (stack_test.go:10) MOVQ 8(SP),BX 0x0038 00056 (stack_test.go:10) MOVQ BX,"".autotmp_0001+16(SP) 0x003d 00061 (stack_test.go:10) MOVQ "".autotmp_0001+16(SP),DI 0x0042 00066 (stack_test.go:10) CMPQ DI,$0 0x0046 00070 (stack_test.go:10) JEQ $1,94 0x0048 00072 (stack_test.go:10) MOVL $0,AX 0x004a 00074 (stack_test.go:10) DUFFZERO ,$ 0x004f 00079 (stack_test.go:10) MOVQ "".autotmp_0001+16(SP),BX 0x0054 00084 (stack_test.go:10) MOVQ BX,"".~r0+32(FP) 0x0059 00089 (stack_test.go:10) ADDQ $24,SP 0x005d 00093 (stack_test.go:10) RET , 0x005e 00094 (stack_test.go:10) MOVL AX,(DI) 0x0060 00096 (stack_test.go:10) JMP ,72
Measure, don't guess:
"Intuition == guessing == being wrong and thinking you're awesome" - Dustin Sallings
Real programs have complex interactions that are hard to visualize:
But all are measurable.
Quiz:
package main import "math" func main() { NaN := math.NaN() m := map[float64]bool{NaN: true, NaN: true, NaN: false} println(len(m)) }
Jason Moiron
Programmer, Datadog, Inc. (datadoghq.com)