Go Faster

Optimising Go Programs

19 February 2015

Jason Moiron

Programmer, Datadog, Inc. (datadoghq.com)

A Quote

"There is no doubt that the grail of efficiency leads to abuse. Programmers waste enormous amounts of time thinking about, or worrying about, the speed of noncritical parts of their programs, and these attempts at efficiency actually have a strong negative impact when debugging and maintenance are considered. We should forget about small efficiencies, say about 97% of the time [...], yet we should not pass up our opportunities in that critical 3%."

Donald Knuth Structured Programming with go to Statements

The Need for Speed

every timeseries point is a write
low latency alerting, realtime graphing
"high cardinality"
lots of CPU bound work

Measure Everything

Learning what to change:

profiling with go tool pprof
instrumentation with expvar

Learning how to change it:

benchmark with testing.B
reading stdlib code
assembly output with go tool 6/8/5g

The go stdlib code is not scary. Dive in!

Profiling

built in sampling profiler
low overhead (can run in production)

import "os"
import "runtime/pprof"

func startProfile(destpath string) {
    f, err := os.Create(destpath)
    if err != nil {
        panic(err)
    }
    pprof.StartCPUProfile(f)
}

Profiling II

net/http/pprof

import (
    _ "net/http/pprof"
    "net/http"
    "log"
)

func main() {
    go func() {
        log.Println(http.ListenAndServe("localhost:6060", nil))
    }
    ...
}

generates 30 second profile over http

$ go tool pprof http://localhost:6060/debug/pprof/profile
$ go tool pprof http://localhost:6060/debug/pprof/heap
$ go tool pprof http://localhost:6060/debug/pprof/block

Using pprof output

$ go tool pprof <binary> <profile output>
Welcome to pprof!  For help, type 'help'.
(pprof) top
Total: 462 samples
 235  50.9%  50.9%      237  51.3% runtime.cgocall
  98  21.2%  72.1%       98  21.2% syscall.Syscall
  55  11.9%  84.0%       55  11.9% runtime.memmove
  41   8.9%  92.9%       41   8.9% hash/crc32.update
   7   1.5%  94.4%       11   2.4% MCentral_Grow
...
(pprof) top20 --cum
Total: 462 samples
   0   0.0%   0.0%      458  99.1% runtime.gosched0
   0   0.0%   0.0%      381  82.5% main.main
   0   0.0%   0.0%      381  82.5% runtime.main
   1   0.2%   0.2%      303  65.6% github.com/DataDog/dd-go/czlib.Decompress
 235  50.9%  51.1%      237  51.3% runtime.cgocall

Using pprof output II

(pprof) list crc32.update
Total: 462 samples
ROUTINE ====================== hash/crc32.update in /usr/local/go/src/pkg/hash/crc32/crc32.go
    41     41 Total samples (flat / cumulative)
     .      .  101: func update(crc uint32, tab *Table, p []byte) uint32 {
     .      .  102:     crc = ^crc
     6      6  103:     for _, v := range p {
    35     35  104:         crc = tab[byte(crc)^v] ^ (crc >> 8)
     .      .  105:     }
     .      .  106:     return ^crc

Using pprof output III

(pprof) disasm crc32.update
ROUTINE ====================== hash/crc32.update
    41     41 samples (flat, cumulative) 8.9% of total
-------------------- /usr/local/go/src/pkg/hash/crc32/crc32.go
     .      .   101: func update(crc uint32, tab *Table, p []byte) uint32 {
     .      .      50ca80: SUBQ $0x10, SP
     .      .      50ca84: MOVQ 0x20(SP), R9
     .      .   102: crc = ^crc
     .      .      50ca89: MOVL 0x18(SP), AX
     .      .      50ca8d: XORL $-0x1, AX
     6      6   103: for _, v := range p {
(snip...)
    35     35   104: crc = tab[byte(crc)^v] ^ (crc >> 8)
     .      .      50caae: MOVL AX, BP
     .      .      50cab0: XORQ SI, BP
     .      .      50cab3: CMPQ $0x0, R9
     .      .      50cab7: JE 0x50cae5
     5      5      50cab9: MOVZX BP, BP
     1      1      50cabd: LEAQ 0(R9)(BP*4), BX
     5      5      50cac1: MOVL 0(BX), BX
    21     21      50cac3: MOVL AX, BP
     1      1      50cac5: SHRL $0x8, BP
     .      .      50cac8: XORL BP, BX
     2      2      50caca: MOVL BX, AX

Pretty Pictures

$ go tool pprof -ps <binary> <profile path>

Expvar

import "expvar"

var cache map[string]*Thing
var requests int

func expose() {
    go http.ListenAndServe("localhost:6060", nil)

    requestCount := expvar.NewInt("requestCount")
    cacheSize := expvar.NewInt("cacheSize")
    for _ = range <- time.Tick(5 * time.Second) {
        requestCount.Set(requests)
        cacheSize.Set(len(cache))
    }
}

Expose variables on a simple json HTTP interface:

$ curl http://localhost:6060/debug/vars
{ "cacheSize": 421, "requestCount": 10332, ...}

Expvar II

Measure things relevant to you:

channel lengths
throughput
cache sizes

Free stats:

command line
allocation stats
heap stats
gc stats

In the wild

Queue processing daemon called "mouzone"

func mouzone(messages chan msg) {
    parsed := make(chan parsedMsg, 50)
    resolved := make(chan resolvedMsg, 50)
    go func() {
        for msg := range messages {
            parsed <- parse(msg)
        }
    }()
    go func() {
        for msg := range parsed {
            resolved <- resolve(msg)
        }
    }()
    for msg := range resolved {
        // ...
    }
}

In the wild: Measuring Everything II

Benchmarking

Benchmarks are short, repeatable timing tests.

func BenchmarkZlib(b *testing.B) {
    for i:=0; i<b.N; i++ {
    // ...
    }
}

Run them with go test -bench

$ go test -bench . -benchmem
...
BenchmarkFzlibDecompress         300       4849825 ns/op      942101 B/op           1 allocs/op
BenchmarkCzlibDecompress         300       5354454 ns/op     2128336 B/op          16 allocs/op
BenchmarkZlibDecompress          100      14694251 ns/op     2141291 B/op          59 allocs/op
BenchmarkFzlibCompress            20      58481874 ns/op      262156 B/op           1 allocs/op
BenchmarkCzlibCompress            20      58993665 ns/op      639292 B/op           7 allocs/op
BenchmarkZlibCompress             10     167576430 ns/op     2455070 B/op        1026 allocs/op

Benchmarking II

Lets use benchmarking to explore Go:

structs vs maps
map key types
heap & stack allocation

Structs vs Maps

func BenchmarkMapAccess(b *testing.B) {
    m := map[int]int{0: 0, 1: 1}
    for i := 0; i < b.N; i++ {
        _ = m[0] + m[1]
    }
}

func BenchmarkStructAccess(b *testing.B) {
    m := struct{ a, b int }{0, 1}
    for i := 0; i < b.N; i++ {
        _ = m.a + m.b
    }
}

$ go test -bench .
...
BenchmarkMapAccess     100000000        13.00 ns/op
BenchmarkStructAccess  500000000         0.63 ns/op

Structs vs Maps: ASM

package main

func main() {
    x := struct{ a, b int }{0xaa, 0xbb}
    b := x.a + x.b
    println(b)
}

We can see how this works by looking at the ASM output w/ go tool 6g -S

"".main t=1 size=80 value=0 args=0x0 locals=0x18
    ; go has optimised some stuff away; use -N to turn it off
    0x001a 00026 (structvmap.go:4)    MOVQ    $170,BX
    0x0021 00033 (structvmap.go:4)    MOVQ    $187,AX
    0x0028 00040 (structvmap.go:5)    MOVQ    BX,"".x+8(SP)
    0x002d 00045 (structvmap.go:5)    MOVQ    AX,"".x+16(SP)
    0x0032 00050 (structvmap.go:5)    ADDQ    AX,BX
    ; ...

Structs vs Maps: ASM II

package main

func main() {
    x := map[string]int{"a": 0xaa, "b": 0xbb}
    b := x["a"] + x["b"]
    println(b)
}

asm output for line 5:

"".main t=1 size=336 value=0 args=0x0 locals=0x58
    ; snip
    0x00a4 00164 (structvmap2.go:5)    LEAQ    go.string."a"+0(SB),BX
    0x00ab 00171 (structvmap2.go:5)    MOVQ    (BX),CX
    0x00ae 00174 (structvmap2.go:5)    MOVQ    8(BX),BP
    0x00b2 00178 (structvmap2.go:5)    MOVQ    $type.map[string]int+0(SB),BX
    ; snip a bunch of copying to temp variables
    0x00db 00219 (structvmap2.go:5)    CALL    ,runtime.mapaccess1_faststr(SB)
    ; snip some more temp vars, val copied to tmp
    0x00e8 00232 (structvmap2.go:5)    MOVQ    BP,"".autotmp_0001+40(SP)
    0x00ed 00237 (structvmap2.go:5)    LEAQ    go.string."b"+0(SB),BX
    ; same map access dance for "b"
    0x0136 00310 (structvmap2.go:5)    ADDQ    BP,BX

Structs vs Maps: ASM III

The uncut assembly for Lines 4 & 5

Full program readout is 138 LOC vs 1058 LOC.

String keys vs Struct Keys

func BenchmarkStringKeys(b *testing.B) {
    s, n := "reasonably-long-but-present-unique-identifier", "non-present-unique-id"
    m := map[string]struct{}{s: empty}
    for i := 0; i < b.N; i++ {
        _, _ = m[s]
        _, _ = m[n]
    }
}

func BenchmarkStructKeys(b *testing.B) {
    type key struct{ a, b int }
    k, n := key{0, 1}, key{1, 2}
    m := map[key]struct{}{k: empty}
    for i := 0; i < b.N; i++ {
        _, _ = m[k]
        _, _ = m[n]
    }
}

$ go test -bench .
...
BenchmarkStringKeys     100000000        22.8 ns/op
BenchmarkStructKeys     20000000         93.2 ns/op

String keys vs Struct keys: Profile

Lets profile our benchmarks and see what's going on:

$ cd bench/
$ go test -c
$ ./bench.test -test.bench="StringKeys" -test.cpuprofile="stringkeys.pprof"
$ ./bench.test -test.bench="StructKeys" -test.cpuprofile="structkeys.pprof"
$ go tool pprof --ps bench.test stringkeys.pprof > stringkeys.ps
$ go tool pprof --ps bench.test structkeys.pprof > structkeys.ps

Output for stringkeys.ps:

String keys vs Struct keys: Profile II

Output for structkeys.ps:

String keys vs Struct keys: The source!

Not related to hashing:

// src/runtime/alg.go
func strhash(a unsafe.Pointer, s, h uintptr) uintptr {
    return memhash((*stringStruct)(a).str, uintptr(len(*(*string)(a))), h)
}

runtime·aeshashstr is almost identical to runtime·aeshash

src/runtime/hashmap_fast.go has a number of fast paths:

mapaccess(1,2)_fast32 for 32bit ints
mapaccess(1,2)_fast64 for 64bit ints
mapaccess(1,2)_faststr for strings

Map overhead reduced for these common key types.

Heap & Stack Allocation

stack allocation is static
heap allocation goes through dynamic allocator
heap allocated variables must be GC'd

"When possible, the Go compilers will allocate variables that are local to a function in that function's stack frame."

How do I know whether a variable is allocated on the heap or the stack?

Heap & Stack Allocation II

Let's compare two simple functions that allocate and return 1K:

func stacked() [128]int64 {
    return [128]int64{}
}

func heaped() *[128]int64 {
    return &[128]int64{}
}

$ go test -bench Alloc
BenchmarkStackAlloc     10000000           133 ns/op
BenchmarkHeapAlloc      20000000            83.5 ns/op

Curious... Is it because returning an 8-byte pointer is cheaper than a 1K buffer?

Tricking yourself

You have to be careful that your benchmarks test what you think they are testing. Lets see what the go compiler is doing:

$ go test -gcflags="-m" -bench Alloc
./stack_test.go:5: can inline stacked
./stack_test.go:9: can inline heaped
./stack_test.go:15: inlining call to stacked
./stack_test.go:21: inlining call to heaped
./stack_test.go:10: &[128]int64 literal escapes to heap
./stack_test.go:13: BenchmarkStackAlloc b does not escape
./stack_test.go:19: BenchmarkHeapAlloc b does not escape
./stack_test.go:21: heaped &[128]int64 literal does not escape

There are more compiler flags available, use go tool 6g --help for a list. Let's try again with optimisations and inlining off:

$ go test -gcflags="-N -l" -bench Alloc
BenchmarkStackAlloc     10000000           135 ns/op
BenchmarkHeapAlloc      500000            3146 ns/op

Heap & Stack Allocation: ASM

Lets see the asm: go tool 6g -S -N -l stack_test.go

stacked:

"".stacked t=1 size=96 value=0 args=0x400 locals=0x400
    ; snip
    0x0034 00052 (stack_test.go:6)    LEAQ    "".autotmp_0000+0(SP),DI
    0x0038 00056 (stack_test.go:6)    MOVL    $0,AX
    0x003a 00058 (stack_test.go:6)    DUFFZERO    ,$
    0x003f 00063 (stack_test.go:6)    LEAQ    "".autotmp_0000+0(SP),BX
    0x0043 00067 (stack_test.go:6)    LEAQ    "".~r0+1032(FP),BP
    0x004b 00075 (stack_test.go:6)    MOVQ    BP,DI
    0x004e 00078 (stack_test.go:6)    MOVQ    BX,SI
    0x0051 00081 (stack_test.go:6)    DUFFCOPY    ,$
    0x0056 00086 (stack_test.go:6)    ADDQ    $1024,SP
    0x005d 00093 (stack_test.go:6)    RET    ,

Heap & Stack Allocation: ASM II

heaped:

"".heaped t=1 size=112 value=0 args=0x8 locals=0x18
    ;snip
    0x0023 00035 (stack_test.go:10)    MOVQ    $type.[128]int64+0(SB),BX
    0x002a 00042 (stack_test.go:10)    MOVQ    BX,(SP)
    0x002e 00046 (stack_test.go:10)    PCDATA    $0,$0
    0x002e 00046 (stack_test.go:10)    CALL    ,runtime.newobject(SB)
    0x0033 00051 (stack_test.go:10)    MOVQ    8(SP),BX
    0x0038 00056 (stack_test.go:10)    MOVQ    BX,"".autotmp_0001+16(SP)
    0x003d 00061 (stack_test.go:10)    MOVQ    "".autotmp_0001+16(SP),DI
    0x0042 00066 (stack_test.go:10)    CMPQ    DI,$0
    0x0046 00070 (stack_test.go:10)    JEQ    $1,94
    0x0048 00072 (stack_test.go:10)    MOVL    $0,AX
    0x004a 00074 (stack_test.go:10)    DUFFZERO    ,$
    0x004f 00079 (stack_test.go:10)    MOVQ    "".autotmp_0001+16(SP),BX
    0x0054 00084 (stack_test.go:10)    MOVQ    BX,"".~r0+32(FP)
    0x0059 00089 (stack_test.go:10)    ADDQ    $24,SP
    0x005d 00093 (stack_test.go:10)    RET    ,
    0x005e 00094 (stack_test.go:10)    MOVL    AX,(DI)
    0x0060 00096 (stack_test.go:10)    JMP    ,72

In Summary

Measure, don't guess:

"Intuition == guessing == being wrong and thinking you're awesome" - Dustin Sallings

Real programs have complex interactions that are hard to visualize:

impact of the scheduler
cpu cache utilization
cost of allocation
interaction between your program and the OS
behavior of dependent networked systems (eg. dbs)
properties of actual workload as it evolves
garbage production & GC pauses

But all are measurable.

Bonus Round

Quiz:

package main

import "math"

func main() {
    NaN := math.NaN()
    m := map[float64]bool{NaN: true, NaN: true, NaN: false}
    println(len(m))
}

Thank you