If you are a Golang developer, you probably know that the Go language is often chosen when low program latency is critical. People opt for Go in such applications because it allows them to write programs with predictable latency.
But unlike other languages like C or Rust, where a developer can imagine what machine code will be compiled from some particular source code and approximately calculate the number of cycles to predict execution time. Go has some factors that affect latency implicitly. For example, the garbage collector (GC) is one of these factors. In these posts, I aim to formulate implicit rules that help you write code with predictable latency.
It’s the first post for “Go Low Latency” series, based on the “Low Latency Patterns” talk on “GopherCon Singapore 2023”, slides are available at: https://g4s8.github.io/gophercon-sg-2023
Disclaimer Link to heading
In many situations, the Go compiler is smart enough to produce optimized machine code. And typically, you don’t need to care about the internal representation of your data.
If you follow the tips from the Effective Go tutorial, your code will perform well in 90% of cases. In this post, I’ll be discussing the other 10% - situations where your code must respond with predictable latency to events. It’s essential to keep in mind that the readability of code is often more important than a marginal (~10%) performance gain.
Latency Link to heading
According to Wikipedia, latency could be defined as:
A time delay between the cause and the effect.
Of course, there are many possible sources of latency issues that we can’t control, such as hardware delays, system call latency, thread schedulers, and more.
But we can try to do our best at least at application level. One of the main enemies of latency in an application is the Garbage Collector.
Garbage Collector (GC) Link to heading
The Go garbage collector (GC) affects the latency in various ways, as cited from the latency section:
- Brief stop-the-world pauses when the GC transitions between the mark and sweep phases,
- Scheduling delays because the GC takes 25% of CPU resources when in the mark phase,
- User goroutines assisting the GC in response to a high allocation rate,
- Pointer writes requiring additional work while the GC is in the mark phase, and
- Running goroutines must be suspended for their roots to be scanned.
If we need to achieve predictable latency, we should avoid garbage collection during the execution of critical code. This can be accomplished by not allocating memory while this critical code is running because the GC’s memory triggers are based on the GOGC and GOMEMLIMIT parameters. In other words, the GC will not initiate if no new heap memory is allocated for a certain period, allowing the critical code to execute.
The Go compiler uses complex logic to determine which variables should be moved to the heap or not,
and this logic may change from one compiler version to another.
In this article, I’m attempting to formulate empirical patterns and tips for escape analysis
that help avoid heap allocation for critical code sections.
I’ve tested these examples using Go version 1.21
.
Tools Link to heading
The most common tools for escape analysis is a:
go build
tool withgcflags
, e.g.:go build -gcflags '-m=2 -l' [package]
- pprof tool used together with benchmark testing and
benhcmem
option. - ASM compiler to analyze assembly code:
go tool compile -S [file]
andgo tool objdump -s main.main -S example.com > main.go.s
- Also, I really like lensm app which helps to visualize assembly code and show the relation between it and lines of source code.
I’ll use the output of these tools in my examples.
Interfaces Link to heading
When dealing with interface function parameters, be careful — arguments for these parameters are often moved to the heap before being passed to the callee function.
However, it can be challenging to determine precisely when an argument is moved to the heap and when it is not. Even if escape analysis reports that a variable has been moved to the heap, it doesn’t guarantee that it will always happen. This ambiguity arises because the allocation logic depends on two components: the compiler and the runtime. Escape analysis is integrated into the compiler to assist in deciding whether a variable should be moved to the heap. However, during the application’s runtime, the runtime may decide not to move it to the heap. Therefore, it’s advisable to consider the escape analysis output as a hint of where allocation may potentially occur.
When you have a function with an interface parameter, and the caller function passes an argument to the callee function, the compiler generates code to convert this argument into an internal representation of interface values. This internal representation consists of a two-word data structure: one word is a pointer to the argument value, and the other word is a pointer to metadata (itable). The metadata structure contains information about the argument’s type and a function table for dynamic dispatch. In some cases, the first pointer may contain the actual value itself (not a pointer), but only if the compiler can prove that the value size is less or equal to the word length.
Example Link to heading
For now, let’s assume we disable inline optimizations with go build -gcflags='-l'
, we’ll discuss inlines a bit later.
Now let’s see the very small example:
import "fmt"
func main() {
var x1 int = 1
fmt.Println(x1)
}
The code above produces this assembly code:
MOVL $0x1, AX
CALL runtime.convT64(SB)
LEAQ 0x6e9b(IP), CX
MOVQ CX, 0x18(SP)
MOVQ AX, 0x20(SP)
LEAQ 0x18(SP), AX
MOVL $0x1, BX
MOVQ BX, CX
NOPL 0(AX)
CALL fmt.Println(SB)
And here we can see what happens actually. Let’s go step by step.
- To pass variable
x1
with1
value asPrintln
argument it is passed throughruntime.convT64
function, which takesuint64
value as a parameter and returns pointer for this value (stored inAX
):
MOVL $0x1, AX
CALL runtime.convT64(SB)
- Then it loads type information, and store it and the pointer to argument variable at stack with offset (create interface value internal representation):
LEAQ 0x6e9b(IP), CX
MOVQ CX, 0x18(SP)
MOVQ AX, 0x20(SP)
- Set varargs length to
1
:
MOVL $0x1, BX
MOVQ BX, CX
- Call
Println
:
CALL fmt.Println(SB)
One interesting point here is that compiler was not able to prove that the argument value is less than one word size and it didn’t store it as value in interface structure but still used pointer.
If we compile this example with escape analysis output enabled go build -gcflags='-l -m'
, we can see
that it reports that x1
variable is escaped to heap:
./main.go:7:13: ... argument does not escape
./main.go:7:14: x1 escapes to heap
Benchmarking Link to heading
Indeed, compiler can’t prove that the pointer of this variable will not be used later after passing it to
Println
function. But if we make primitive benchmark for this case:
package main
import (
"fmt"
"testing"
)
func BenchmarkPrintln(b *testing.B) {
for i := 0; i < b.N; i++ {
var x int = 1
fmt.Println(x)
}
}
We can suddenly discover that this case doesn’t produce any allocation:
$ go test -bench=. -benchmem .
463819 2381 ns/op 0 B/op 0 allocs/op
What?! Escape analysis output says that the value is moved to the heap, but memory benchmark didn’t detect it? How could this happen?
The answer lies in the runtime. Allocations in this example depend on the variable’s value. Let’s see what happens if we change the value in our benchmark:
package main
import (
"fmt"
"testing"
)
func BenchmarkPrintln(b *testing.B) {
for i := 0; i < b.N; i++ {
var x int = 256
fmt.Println(x)
}
}
And now when we run tests:
$ go test -bench=. -benchmem .
455181 2337 ns/op 8 B/op 1 allocs/op
So the new memory is allocated for 256
value but not for 1
.
But the assembly code is exactly the same for both variables,
it means that something different happens in runtime. The only
candidate here is the runtime.convT64
function.
The runtime and numbers cache Link to heading
Here is the implementation of convT64
function. Just to remind you:
it takes the uint64
value and returns a pointer for it.
func convT64(val uint64) (x unsafe.Pointer) {
if val < uint64(len(staticuint64s)) {
x = unsafe.Pointer(&staticuint64s[val])
} else {
x = mallocgc(8, uint64Type, false)
*(*uint64)(x) = val
}
return
}
As you can see it has two branches:
- First is called when the parameter value is less then
staticuint64s
array length (surprise: it’s 256). And it just takes the pointer of some value in this array by parameter index; - Another is called otherwise, it allocates new memory, store the value in this memory, returns a pointer for the memory.
The array staticuint64s
looks like this:
// staticuint64s is used to avoid allocating in convTx for small integer values.
var staticuint64s = [...]uint64{
0x00, 0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07,
// ...
0xf8, 0xf9, 0xfa, 0xfb, 0xfc, 0xfd, 0xfe, 0xff,
}
So it’s just a numbers from 0x00
to 0xff
. It’s used to avoid allocations
in runtime for small numbers, this is why our 1
variable was not allocated
on heap but 256
(it’s 0xff + 1
) was allocated. I hope now it’s clear
about allocations and the joint efforts of runtime and compiler in this topic.
Small interface arguments Link to heading
As mentioned previously, the compiler may decide to keep the value in interface structure instead of taking a pointer for it if it can prove that the value is less than or equal one word size. However, in the previous example, this behavior did not occur for some reason. It would be beneficial to investigate why it did not happen. At the moment, I don’t have an explanation for this. But if we change the code from:
package main
import "fmt"
func main() {
var x1 int = 15
fmt.Println(x1)
}
To:
package main
import "fmt"
func main() {
fmt.Println(int(15))
}
Now the compiler manages to prove it,
and the value is loaded directly from binary memory without any convT64
calls:
MOVUPS X15, 0x18(SP) ; clear the stack
LEAQ 0x6ea5(IP), DX ; load `15` value from program memory
MOVQ DX, 0x18(SP) ; store it in interface value 1st word
LEAQ 0x36f71(IP), DX ; load type info
MOVQ DX, 0x20(SP) ; store type info as 2nd word
LEAQ 0x18(SP), AX
MOVL $0x1, BX
MOVQ BX, CX
CALL fmt.Println(SB)
A summary of escape analysis Link to heading
We have just seen that the escape analysis of interface parameters can be tricky.
Even if the compiler’s output indicates that the value escapes to the heap,
it only means that the value may escape to the heap.
Sometimes, the runtime may decide not to move it to the heap,
and sometimes, the compiler may optimize it and avoid calling convT**
functions to extract pointers.
Therefore, always check your specific case with benchmarks to confirm whether the argument value is actually allocated.
Perhaps you don’t even need to optimize it.
Avoid allocations Link to heading
First, try to avoid allocations only if you have them and if you need to. Typically, zero-allocation code has a worse design when compared to simple code without zero-allocation considerations.
There are a few possible techniques. Let’s start with an example:
package main
type Inter interface { // like fmt.Stringer
Int64() int64
}
type inter64 int64 // implementation
func (i inter64) Int64() int64 {
return int64(i)
}
//go:noinline
func toInt(i Inter) int64 {
return i.Int64()
}
func main() {
x := inter64(256)
_ = toInt(x)
}
Here we can see allocation of x
via i
param:
$ go build -gcflags '-m=2 -l'
# example.com
./main.go:14:12: parameter i leaks to {heap} with derefs=0:
./main.go:14:12: flow: {heap} = i:
./main.go:14:12: from i.Int64() (call parameter) at ./main.go:15:16
./main.go:14:12: leaking param: i
./main.go:20:12: x escapes to heap:
./main.go:20:12: flow: {heap} = &{storage for x}:
./main.go:20:12: from x (spill) at ./main.go:20:12
./main.go:20:12: from toInt(x) (call parameter) at ./main.go:20:11
./main.go:20:12: x escapes to heap
<autogenerated>:1: parameter .this leaks to {heap} with derefs=0:
<autogenerated>:1: flow: {heap} = .this:
<autogenerated>:1: from .this.Int64() (call parameter) at <autogenerated>:1
Indeed, this variable is converted with convT64
like in previous example:
MOVL $0x100, AX
CALL runtime.convT64(SB)
MOVQ AX, BX
LEAQ go:itab.main.inter64,main.Inter(SB), AX
CALL main.toInt(SB)
Exact type parameters Link to heading
The simplest solution here is to avoid using interfaces:
func toInt64(i inter64) int64 {
return i.Int64()
}
func main() {
x := inter64(256) // MOVL $0x100, AX
_ = toInt64(x) // CALL main.toInt64(SB)
}
Generic functions Link to heading
Certain variables may not be moved to the heap for functions with generic parameters:
func toIntGeneric[T Inter](i T) int64 {
return i.Int64()
}
func main() {
x := inter64(256)
_ = toIntGeneric(x)
}
Here, we’ve defined a new generic function with type T
, where T
can be any type that implements the Inter interface.
Now, let’s compile it:
$ go build -gcflags '-m=2 -l'
# example.com
<autogenerated>:1: parameter .this leaks to {heap} with derefs=0:
<autogenerated>:1: flow: {heap} = .this:
<autogenerated>:1: from .this.Int64() (call parameter) at <autogenerated>:1
No allocations — that’s a win. However, it’s not as straightforward as it sounds.
LEAQ main..dict.toIntGeneric[main.inter64](SB), AX
MOVL $0x100, BX
CALL main.toIntGeneric[go.shape.int64](SB)
When we examine the assembly code, we notice a few interesting things:
- our argument was passed via the
BX
register without heap allocation; - we call a function named
toIntGeneric[go.shape.int64]
; - some data from
toIntGeneric[main.inter64]
is loaded into theAX
register;
Let’s address these points one by one. The first point is what we are aiming for, so it’s expected.
The function name includes the go.shape.int64
suffix because the Go compiler generates actual generic code only for different GC shapes,
not for each type. GC-shapes differ for types that have varying sizes, alignments, and whether or not they contain pointers.
From the generics proposal called GC Shape Stenciling we can get the definition of GC-shape as
The GC shape of a type means how that type appears to the allocator / garbage collector. It is determined by its size, its required alignment, and which parts of the type contain a pointer.
Our type inter64
appears to the allocator as just an int64
value, which doesn’t require heap allocation.
Therefore, this function accepts values of int64-shaped types as if it were:
func toIntGenericInt64(i inter64)
The third point relates to this instruction, which loads type information for inter64
into the AX
register:
LEAQ main..dict.toIntGeneric[main.inter64](SB), AX
The compiler performs this action because function calls methods on generic types via dynamic dispatch. This approach has two drawbacks for generic functions:
- calling generic functions becomes more expensive due to dynamic dispatch;
- it’s not possible to apply different optimizations to this call;
The first point has a relatively lower impact on performance compared to allocation, which generics can help avoid. However, the second point may be a more serious issue. For example, if a function with an interface type parameter could be inlined and optimized, a generic function cannot be inlined. In some cases, a function with an interface type parameter may perform significantly better for performance after optimizations than a generic function.
And this leads us to the next topic.
Inlines Link to heading
When a function call with an interface parameter is inlined, it bypasses virtual table lookup, and the compiler may not move its arguments to the heap.
Let’s seethe compiler optimizations documentation:
Only short and simple functions are inlined. To be inlined a function must conform to the rules:
- function should be simple enough, the number of AST nodes must less than the budget (80);
- function doesn’t contain complex things like closures, defer, recover, select, etc;
- function isn’t prefixed by go:noinline;
- function isn’t prefixed by go:uintptrescapes, since the escape information will be lost during inlining;
- function has body;
And now try to create a function with interface type params for inlining:
package main
type Calc interface {
Add(int) int
}
type calcInt int
func (c *calcInt) Add(n int) (sum int) {
sum = int(*c) + int(n)
*c = calcInt(sum)
return
}
func main() {
var c1 calcInt
_ = sum(&c1, 1, 2, 3)
}
func sum(calc Calc, vals ...int) (sum int) {
for _, val := range vals {
sum = calc.Add(val)
}
return
}
It looks pretty simple:
- there is a
calcInt
type ofint
with methodAdd(int) int
which adds a number to itself and returns the value. The receiver is a pointer type; Calc
interface for function parameter corresponding tocalcInt
type;- a method
sum
which accepts implementation ofCalc
and integers as vararg to calculate a sum of integer usingCalc
; main
func to call it all together;
If we build it with inlines enabled:
$ go build -gcflags '-m=1'
# example.com
./main.go:9:6: can inline (*calcInt).Add
./main.go:20:6: can inline sum
./main.go:17:9: inlining call to sum
./main.go:17:9: devirtualizing calc.Add to *calcInt
./main.go:9:7: c does not escape
./main.go:17:9: ... argument does not escape
./main.go:20:10: leaking param: calc
./main.go:20:21: vals does not escape
We can see that the call was inlined. Let’s check assembly:
; initialize c1 variable with zero
MOVQ $0x0, 0x10(SP)
; initialize varargs with 1, 2, 3
MOVUPS X15, 0x20(SP)
MOVUPS X15, 0x28(SP)
MOVQ $0x1, 0x20(SP)
MOVQ $0x2, 0x28(SP)
MOVQ $0x3, 0x30(SP)
XORL AX, AX
; call to sum inlined
; range loop head
JMP 0x457740
MOVQ AX, 0x18(SP)
MOVQ 0x20(SP)(AX*8), BX
; calling Add method on calcInt implementation directly
LEAQ 0x10(SP), AX
CALL main.(*calcInt).Add(SB)
; range loop tail
MOVQ 0x18(SP), AX
INCQ AX
NOPW
CMPQ AX, $0x3
JL 0x457722
As we can see, the main function after inlining calls Add
method on calcInt
implementation of Calc
directly. So we have no allocations here and
no dynamic dispatch.
What if we change the sum
implementation to accept generic type parameter?
func main() {
var c1 calcInt
_ = sum(&c1, 1, 2, 3)
}
func sum[T Calc](calc T, vals ...int) (sum int) {
for _, val := range vals {
sum = calc.Add(val)
}
return
}
Build it:
$ go build -gcflags '-m=1'
# example.com
./main.go:9:6: can inline (*calcInt).Add
./main.go:20:6: can inline sum[go.shape.*uint8]
./main.go:17:9: inlining call to sum[go.shape.*uint8]
./main.go:20:6: inlining call to sum[go.shape.*uint8]
./main.go:9:7: c does not escape
./main.go:16:6: moved to heap: c1
./main.go:17:9: ... argument does not escape
The function itself was inlined, but what about parameter calls?
; allocate c1 variable on heap
LEAQ 0x4eab(IP), AX
CALL runtime.newobject(SB)
MOVQ AX, 0x30(SP)
; initialize varargs
MOVUPS X15, 0x18(SP)
MOVUPS X15, 0x20(SP)
MOVQ $0x1, 0x18(SP)
MOVQ $0x2, 0x20(SP)
MOVQ $0x3, 0x28(SP)
XORL CX, CX
; function call inlined
; range loop head
JMP 0x457751
MOVQ CX, 0x10(SP)
MOVQ 0x18(SP)(CX*8), BX
; call `calc.Add(val)` dynamically
LEAQ main..dict.sum[*main.calcInt](SB), DX
LEAQ 0xffffff7e(IP), SI
CALL SI
; range loop middle
MOVQ 0x10(SP), CX
INCQ CX
; sum = <result of SI call>
MOVQ 0x30(SP), AX
; range loop tail
CMPQ CX, $0x3
JL 0x45772a
Unlike calls with interface parameters, we can see that the implementation call was not inlined.
Additionally, there is an allocation for the argument because the method has a pointer receiver,
resulting in a GC shape of *calcInt
. This leads the allocator to decide to store it in the heap.
Pros and Cons Link to heading
Here’s a summary table for the three different approaches discussed in this post:
- function with exact type parameters;
- function with interface type parameters;
- function with generic type parameters derived from interface type;
Type | Allocs | Dyn. dispatch | Inline |
---|---|---|---|
Exact | No | No | Yes |
Interface | Inline | Inline | Yes |
Generics | Shapes | Yes | No |
In the table above, “exact” type parameters can help avoid allocations entirely. They have no dynamic dispatch and can be optimized by the compiler. However, this approach may lead to less desirable design decisions in some cases.
Functions with interface type parameters can be inlined with the implementation of the interface. This allows for optimization to avoid dynamic dispatch and argument allocations, if they exist. It provides more design flexibility.
Generic parameters of certain GC shapes are not moved to the heap. However, functions with these parameters have dynamic dispatch for method calls on generic-typed parameters, which prevents implementation calls from being inlined. From a design perspective, it is similar to interface parameters or sometimes even better.
Summary Link to heading
First, verify if you have a problem. Even if escape analysis indicates heap allocation, it may not be your actual case. it may be not your case.
Small functions with interface parameters can often be inlined with the implementation. If this occurs, manual optimization may not be necessary.
Generic functions with value types are your allies, as these GC shapes are usually not moved to the heap.
As a last resort, consider implementing different functions for each type to avoid allocation in performance-critical paths. Examples like github.com/rs/zerolog can provide guidance.
Don’t hesitate to redesign your code when needed. Adjusting the design of implementation can also improve performance without significant drawbacks. For instance, using small functions in interface implementations can help the compiler inline and optimize them.
TweetUpdates Link to heading
Subscribe to get updates about next posts on this topic:
- Telegram: @g4s8_chan
- Twitter: @kirill_che
References Link to heading
- Effective Go
- Garbage Collector
- pprof
- Go assembly
- lensm helps to visualize assembly
- Generic implementation
- Compiler optimizations: