GeistHaus
log in · sign up

About on Felix Geisendörfer

Part of feedburner.com

Recent content in About on Felix Geisendörfer

stories
Profiling Improvements in Go 1.18

Without doubt, Go 1.18 is shaping up to be one of the most exciting releases since Go 1. You’ve probably heard about major features such as generics and fuzzing, but this post is not about that. Instead we’ll talk about profiling and highlight a few noteworthy improvements to look forward to.

A point of personal joy for me is that I was able to contribute to several of the improvements as part of my job at Datadog and that they will significantly improve our new Connecting Go Profiling With Tracing functionality.

If you’re new to Go profiling, you might also enjoy our Guide to Go Profiling before diving into this post.

Better CPU Profiling Accuracy

Let’s start with the most significant profiling change in Go 1.18 – the CPU profiler for Linux has become a lot more accurate. In particular, CPU bursts on multi-core systems used to be underestimated, which is fixed now.

This change goes back to late 2019 (go1.13) when Rhys Hiltner opened GH 35057 to report that he noticed a Go service that was using 20 CPU cores according to top, but only 2.4 cores according to Go’s CPU profiler. Digging deeper, Rhys found that this problem was caused by dropped POSIX signals in the kernel.

To understand the problem, it’s worth knowing that Go 1.17’s CPU profiler is built on top the setitimer(2) syscall. This allows the Go runtime to ask the Linux kernel to get a notification for every 10ms the program spends running on a CPU. These notifications are delivered as SIGPROF signals. Each delivery of a signal causes the kernel to stop the program and invoke the runtime’s signal handler routine on one of the stopped threads. The signal handler takes a stack trace of the current goroutine and adds it to the CPU profile. All of this takes less than ~10usec after which the kernel resumes the execution of the program.

So what was wrong? Well, given Rhys program reporting 20 CPU cores being used by top, the expected signal rate should have been 2,000 per second. However, the resulting profile only contained an average of 240 stack traces per second. Further investigation using Linux Event Tracing revealed that the expected amount of signal:signal_generate events were observed, but not enough signal:signal_deliver events. The difference between signal generation and delivery might seem confusing at first, but is due to the fact that standard POSIX signals do not queue, see signal(7). Only a single signal can be pending delivery at a time. If another signal of the same kind is generated while one is already pending, the new signal gets dropped.

However, this still doesn’t explain why so many signals would be generated during the exact same < ~10usec signal handler window so that they would end up on top of each other. So the next piece to this puzzle is the fact that the resolution of syscalls measuring CPU time are limited to the kernel’s software clock which measures time in jiffies, see time(7). The duration of a jiffy depends on kernel configuration, but the default is 4ms (250Hz). What this means is that the kernel isn’t even able to honor Go’s wish of receiving a signal every 10ms. Instead it has to alternate between 8ms and 12ms as can be seen in the histogram below (details).

Since alternating between 8ms and 12ms still works out to 10ms on average, it is not a problem in and of itself. What is a problem however, is what happens when e.g. 10 threads are running on different CPU cores, causing them to use 40ms of CPU time within a 4ms jiffy window. So when the kernel gets around to do its checking, it needs to generate 4 SIGPROF signals at once, causing all except one to be dropped instead of delivered. Or in other words, setitimer(2) – the kernel’s portable API advertised for profiling – turns out to be unsuitable for profiling busy multi-core systems :(.

However, given that the man pages explain the jiffy limitations of CPU time related syscalls and the semantics of process-directed signals, it’s not entirely clear if this would be recognized as a bug by the kernel maintainers. But even if setitimer(2) would be fixed, it would take a long time for people to upgrade their kernels, so Rhys brought up the idea of using timer_create(2) which offers per-thread accounting of pending signals.

Unfortunately the idea didn’t receive feedback from the Go maintainers, so the issue sat dormant until Rhys and I started discussing it again in May 2021 on twitter. We decided to collaborate, Rhys working on the critical CPU profiler changes, and me on better test coverage for cgo and other edge cases shown in the diagram below.

Another side quest of mine was to create a standalone C program called proftest to observe the signal generation and delivery behavior of setitimer(2) and timer_create(2) which confirmed another bias issue from 2016: setitimer(2)’s process-directed signals are not fairly distributed among CPU-consuming threads. Instead one thread tends to get screwed and receives less signals than all other threads. Now, to be fair, signal(7) states that the kernel chooses an arbitrary thread to which to deliver process-directed signals when multiple threads are eligible. But on the other hand macOS doesn’t show this kind of bias, and I think that’s the point where I need to stop making apologies on behalf of the Linux kernel :).

But the good news is that, except for jiffy resolution, timer_create(2) doesn’t appear to suffer from any of the ailments plaguing setitimer(2). Thanks to per-thread signal accounting, it reliably delivers the right amount of signals and shows no bias towards any particular threads. The only issue is that timer_create(2) requires the CPU profiler to be aware of all threads. This is easy for threads created by the Go runtime, but gets tricky when cgo code is spawning its own threads. Rhys patch deals with this by combining timer_create(2) and setitimer(2). When a signal is received, the signal handler checks the origin of the signal and discards it if it’s not the best signal source available for the current thread. Additionally the patch also has some clever bits to deal with avoiding bias against short-lived threads and is just a great piece of engineering in general.

Anyway, thanks to lots of review and feedback from the Go maintainers, Rhys patch was merged and will ship with Go 1.18, fixing GH 35057 and GH 14434. My test cases didn’t make the cut, partially because it’s hard to make non-flaky assertions on profiling data, but mostly because we didn’t want to risk those patches taking upstream’s attention away from the main changes themselves. Nevertheless this open source community collaboration was one of my favorite computer experiences of 2021!

Profiler Label Bug Fixes

Profiler labels (aka pprof labels or tags) allow users to associate arbitrary key/value pairs with the currently running goroutine which are inherited by child-goroutines. These labels also end up in CPU and Goroutine profiles, allowing to slice and dice these profiles by label values.

At Datadog we use profiler labels for Connecting Go Profiling With Tracing and as part of working on this feature I did a lot of testing on them. This caused me to observe stack traces in my profiles that should have labels attached, but didn’t. I proceeded by reporting the issue as GH 48577 and taking a closer look at the root cause.

Luckily the problem turned out to be really simple. The CPU profiler was sometimes looking at the wrong goroutine when adding the label and required essentially just a one-line fix:

-cpuprof.add(gp, stk[:n])
+cpuprof.add(gp.m.curg, stk[:n])

This might be a bit confusing at first since gp is the current goroutine and usually points to the same place as gp.m.curg (the current goroutine of the thread gp is running on). However, the two are not the same when Go has to switch stacks (e.g. during signal handling, or resizing the stack of the current goroutine), so if the signal arrives at an unfortunate moment, gp points to a purely internal goroutine that is executing on behalf of the user, but lacking the labels belonging to the current stack trace stk. Apparently this is a common issue, so it is even documented in the runtime/HACKING.md guide.

Considering the simplicity of the problem, I submitted a one-line patch for it. However, the patch was greeted by the classic OSS response – “Can you please write a test for it?” – delivered by Michael Pratt. I initially doubted the ROI of such a test case, but I couldn’t have been more wrong. It was definitely very difficult to write a non-flaky test to demonstrate the issue, and the initial version of the patch had to be reverted because of this. However, the process of implementing and improving the test case helped Michael to identify two additional issues related to labels, both of which he ended up fixing:

CL 369741 fixed and off-by-one error in the encoding of the first batch of pprof samples causing a small number of samples to be tagged with the wrong labels.

CL 369983 fixed an issue causing system goroutines (e.g. for GC) to inherit labels when spawned from user goroutines.

So thanks to all three problems being fixed, profiler labels will become a lot more accurate in Go 1.18. The biggest impact will be for programs using a lot of cgo since C calls run on a separate stack with gp != gp.m.curg. However, even regular programs experiencing 3-4% missing labels in Go 1.17 will benefit from this change and achieve full accuracy.

Stack Trace Bug Fix

Another profiling related bug that we discovered at Datadog is GH 49171. The problem manifested itself in delta allocation profiles with negative allocation counts. Digging deeper, we encountered non-deterministic program counter (pc) symbolization as a root cause. What this means is that the exact same stack trace (same program counters) would contain different symbols (e.g. filenames) in profiles taken at a different time, which should be impossible.

The bug was also very hard to reproduce in a standalone fashion, and I spend almost a week before succeeding and reporting it upstream. In the end it turned out to be a compiler regression involving inlined closures in Go 1.17 and was fixed by Than McIntosh.

Epilog

So as you can see, in addition to being packed with exciting new features, Go 1.18 also contains several important profiling improvements. Please let me know if I missed anything or if you have any questions or feedback.

Thank you for reading and thanks to my colleague Nick Ripley for reviewing this post.

https://felixge.de/2022/02/11/profiling-improvements-in-go-1.18/
Connecting Go Profiling With Tracing

Today I’d like to share a few thoughts on a few new profiling features we recently shipped at Datadog. I’ll explain what they do, how they work, and which Go 1.18 contributions were needed in order for things to work well.

Feature Showcase

Let’s start with the end result. Imagine you have a 100ms trace where 10ms are spent on a database query, but 90ms remain unexplained.

When this happens, you usually need to study your code for clues. Did you forget some tracing instrumentation? Time to add it, redeploy and wait. Or perhaps you need to optimize your Go code? If yes, how?

This workflow is manageable, but it turns out there is a better way – we can use profiling data to fill in the gaps. And that’s exactly what our new Code Hotspots feature does. As you can see below, our request used 90ms On-CPU time. This is a strong signal that lets us rule out Off-CPU activity such as uninstrumented service calls, mutex contentions, channel waits, sleeping, etc.

Even better, when clicking the “View Profile” button, we can view this On-CPU time as a per-request flame graph. Here we can see that our time was spent on JSON encoding.

And since our HTTP handler functions don’t show up in the stack traces, we can also indirectly infer that this work was done in a background goroutine spawned by the goroutine handling the request.

In addition to breaking down tracing information using profiling, we can also do the opposite and break down a profile’s CPU Time by Endpoint as shown below. The checkboxes can be used to filter the profile by endpoint.

And to make it easier to understand changes over time, e.g. after a deployment, we can graph this data.

How It Works

The features are built on-top of an existing Go feature called pprof labels which allows us to attach arbitrary key/value pairs to the currently running goroutine. These labels are automatically inherited when a new goroutine is spawned, so they reach into all corners of your code. Crucially these labels are also integrated into the CPU profiler, so they automatically end up in the CPU profiles it produces.

So to implement this feature, we modified the tracing code in our dd-trace-go library to automatically apply labels such as span id and endpoint whenever a new span is created. Additionally we take care of removing the labels when the span is finished. A simplified example of this implementation can be seen below:

func StartSpan(ctx context.Context, endpoint string) *span {
span := &span{id: rand.Uint64(), restoreCtx: ctx}
labels := pprof.Labels(
"span_id", fmt.Sprintf("%d", span.id),
"endpoint", endpoint,
)
pprof.SetGoroutineLabels(pprof.WithLabels(ctx, labels))
return span
}
type span struct {
id uint64
restoreCtx context.Context
}
func (s *span) Finish() {
pprof.SetGoroutineLabels(s.restoreCtx)
}

Our actual implementation is a bit more complex and also reduces the risk of leaking PII (Personally Identifiable Information). It automatically covers the HTTP and gRPC wrappers in our contrib package, as well as any custom traces your application may implement.

Once our backend receives the tracing and profiling information, we’re able to perform the lookups needed to power the features showcased earlier in this post.

Profiling Improvements in Go 1.18

As part of implementing these new features, we did a lot of testing, including unit tests, micro-benchmarks, macro-benchmarks and more. As expected, this surfaced problems in our code and allowed us to quickly fix them.

Somewhat less expected, we also discovered several issues in the Go runtime that were impacting the accuracy of pprof labels, as well as CPU profiling in general. The good news is that with the help of the community, the Go maintainers and contributions from our end – all of these issues have been fixed in the upcoming Go 1.18 release.

If you’re interested in the full details on this, check out the companion post: Profiling Improvements in Go 1.18.

That being said, unless you use a lot of cgo, the new features should already work great for you in Go 1.17.

Feedback Wanted

I’m currently looking for people who are interested in taking these new features for a spin in order to get feedback.

It doesn’t matter if you’re an existing or potential customer, just send me an email and we can set up a 30 min zoom. To sweeten the deal, I’m also happy to answer general Go profiling questions along the way :).

Appendix: Getting Started

If you want to get started quickly, you can simply copy the code below into your application.

Additionally you can check out this fully working dd-trace-go-demo application that shows an integration including HTTP and PostgreSQL.

package main
import (
"time"
"gopkg.in/DataDog/dd-trace-go.v1/ddtrace/tracer"
"gopkg.in/DataDog/dd-trace-go.v1/profiler"
)
func main() {
const (
env = "dev"
service = "example"
version = "0.1"
)
tracer.Start(
tracer.WithEnv(env),
tracer.WithService(service),
tracer.WithServiceVersion(version),
tracer.WithProfilerCodeHotspots(true),
tracer.WithProfilerEndpoints(true),
)
defer tracer.Stop()
err := profiler.Start(
profiler.WithService(service),
profiler.WithEnv(env),
profiler.WithVersion(version),
// Enables CPU profiling 100% of the time to capture hotspot information
 // for all spans. Default is 25% right now, but this might change in the
 // next dd-trace-go release.
 profiler.CPUDuration(60*time.Second),
profiler.WithPeriod(60*time.Second),
)
if err != nil {
panic(err)
}
defer profiler.Stop()
// <your application code>
}

Thanks to my colleague Nick Ripley for reviewing this post.

https://felixge.de/2022/02/11/connecting-go-profiling-with-tracing/
Advent of Go Profiling: Day 1-1: Branchless Go

It seems like my previous post has struck a chord and many of you have sent your own solutions via twitter. Before I move on to pick another day as a challenge, I’d like to do a little follow up for day 1-1.

Valentin Deleplace’s Solution

Let’s start with looking at the winning solution from Valentin Deleplace which achieves a spectacular 13.8 ns/op on my machine. This is 6.5x faster than my v3 solution and 23x faster than my v1 🤯.

func Answer(input string) (int, error) {
var prev int64 = -9999
var increases int = -1
val := int64(0)
for p, N := 0, len(input); p < N; p++ {
c := input[p]
if c != '\n' {
val = (val << 8) + int64(c)
} else {
if val > prev {
increases++
}
prev = val
val = 0
}
}
if val > prev {
increases++
}
return increases, nil
}

I knew that my solution was not ideal, but the magnitude of the improvement still surprised me. So I decided to take a closer look to see how it was done:

  • The call to strings.IndexByte() was retired in favor of directly looping over the individual characters of the input. This works well here because the average line length is very short (4 characters) and calling assembly functions from Go has a non-trivial overhead associated with it.
  • Somewhat unexpectedly, the function body fits into Go’s inlining budget. Build with -gcflags='-m' if you want to verify this yourself. This eliminates the overhead of calling the Answer function in the benchmark.
  • Looping over the input is done without using range on the input string which avoids unicode overhead.
  • Parsing of the input digits is directly integrated into the for loop and avoids iterating over those chars twice.
  • val << 8 is used instead of val*10. This changes val, but doesn’t break the val > prev comparison.
  • Instead of using a separate bool to indicate if a previous value has been seen, prev and increases are initialized to -9999 and -1 respectively which is fair game since we noticed the elves hesitance to throw negative numbers at us.
  • Error handling such as checking for valid digits has been eliminated.

Most importantly Valentin’s solution is far more compact than my v3 and remains pretty readable given the circumstances 👏.

Branchless Go

In my experience code optimization always involves a lot of trial and error. So I think it’s important to acknowledge that optimization ideas often fail and to discuss what we can learn from these failures.

One such failure was my first attempt to further optimize Valentin’s solution by eliminating branches. My inspiration for this came from Daniel Lemire’s Parsing JSON Really Quickly: Lessons Learned presentation as well as Matt Nakama’s Branchless Coding in Go.

I’ve never done this before, so the code below can certainly be improved, but here is what I came up with:

func Answer(input string) (int, error) {
var prev int64 = -9999
var increases int = -1
val := int64(0)
for p, N := 0, len(input); p < N; p++ {
c := input[p]
notNl := boolToInt64(c != '\n')
increases += int((^notNl & 1) * boolToInt64(val > prev))
prev = notNl*prev + val*(^notNl&1)
val = notNl * ((val << 8) + int64(c))
}
increases += int(boolToInt64(val > prev))
return increases, nil
}
func boolToInt64(b bool) int64 {
return int64((*(*uint8)(unsafe.Pointer(&b))) & 1)
}

The core idea is to cast bool values into 0 or 1 which allows us to take a branch like this:

if c != '\n' {
val = (val << 8) + int64(c)
} else {
val = 0
}

And turn it into the equivalent branch-free expression shown below:

val = ((val << 8) + int64(c)) * boolToInt64(c != '\n')

This works because we always compute the expression from the first branch and then multiply it by 1 if the first branch should be taken, or 0 for the second branch which is the same as setting val = 0 directly.

And while boolToInt64(c != '\n') might look like a compilation nightmare, the Go compiler seems to emit something relatively reasonable (see https://godbolt.org/z/Tzr99vqr3):

CMPB R8B, $10 ; compare c and '\n' (ASCII 10)
SETNE "".b+5(SP) ; store 1 on stack if c != '\n' otherwise 0
MOVBLZX "".b+5(SP), R9 ; move result from stack to register R9
ANDL $1, R9 ; last part of int64((*(*uint8)(unsafe.Pointer(&b))) & 1

From my perspective, something like this would be even better:

CMPB R8B, $10 ; compare c and '\n' (ASCII 10)
SETNE R9 ; store 1 in R9 if c != '\n' otherwise 0

But alas — I wasn’t able to get the compiler to generate this. Anyway, since we eliminated all branches, our solution should run very quickly now, right? Let’s have a look:

$ benchstat v4-valentin.txt v5.txt
name old time/op new time/op delta
Answer-6 13.8ns ± 1% 77.1ns ± 0% +460.24% (p=0.008 n=5+5)

Ouch. Turns out a little knowledge is a dangerous thing 🤕. Initially I was pretty sad about this, as I had put a lot of effort into all this “clever” code. But knowing that managing your emotions is often the hardest part — and because it was 3am — I decided to call it a night.

I ended up guessing the problem while riding my bicycle the next day. However I could have probably also figured it out by looking at the View -> Source output of the CPU profile shown below:

As you can see, we’re spending quite a lot of time on line 36 that performs a conditional increases++ operation.

To understand why this is bad, let’s recall the input data we are using:

199
200
208
210
200
207
240
269
260
263

As well as Valentin’s code we are trying to optimize:

if c != '\n' {
val = (val << 8) + int64(c)
} else {
if val > prev {
increases++
}
prev = val
val = 0
}

As you can see, all lines have 3 digits plus one newline, so the chance to end up in the original else block is only 1/4. For bigger input numbers the chance would be even lower. Additionally the val > prev predicate has to be true, so our unconditional execution of the increases++ operation doesn’t have a very good chance of paying off in most cases.

However, we know that the val > prev predicate by itself is true 7 out of 10 times, so perhaps we can beat the CPU’s branch predictor by limiting ourselves to only eliminating that branch? The code for this is shown below:

func Answer(input string) (int, error) {
var prev int64 = -9999
var increases int = -1
val := int64(0)
for p, N := 0, len(input); p < N; p++ {
c := input[p]
if c != '\n' {
val = (val << 8) + int64(c)
} else {
increases += int(boolToInt64(val > prev))
prev = val
val = 0
}
}
if val > prev {
increases++
}
return increases, nil
}
func boolToInt64(b bool) int64 {
return int64((*(*uint8)(unsafe.Pointer(&b))) & 1)
}

Let’s have a look how this compares to Valentin’s original solution:

benchstat v4-valentin.txt v6.txt .txt
name old time/op new time/op delta
Answer-6 13.8ns ± 1% 12.7ns ± 2% -7.91% (p=0.008 n=5+5)

Awesome, looks like we saved another nanosecond which the elves are going to turn into a gift for the children!

Alright, that’s it for today. I hope this was fun and that I’ll find time to do a post about another AoC challenge soon.

Thanks for reading this!

Appendix: FAQ

I’d imagine the branch-free code above raises quite a few eye brows. To avoid any misunderstanding, let me try to answer a few questions you may have:

Are you recommending to write Go code like this?

Absolutely not. I like to use advent of code to explore and learn. Please do not write production code like this unless “You know what you are doing"™️. I definitely wouldn’t consider myself to belong to this group :).

Will this work for different data distributions?

Probably not! The code is optimized for the small sample input used by AoC to explain the problem.

Why don’t you just implement the algorithm in assembly?

That could be fun! But for this year’s advent of code, I think I’ll only use assembly if it’s in Go’s stdlib.

https://felixge.de/2021/12/03/advent-of-go-profiling-2021-day-1-1-branchless-go/
Advent of Go Profiling: Day 1-1

Tis the season to be geeky, so it’s time for Advent of Code 2021. My theme for last year was to learn some rust, but for this year I decided to focus on Go profiling.

The idea is to code up naive solutions and use Go profiling tools to optimize them. A few folks on twitter were interested in seeing the results as blog posts, so I’m hoping to write a few of these.

You can find all code and profiles in the history of this repository: github.com/felixge/advent-2021.

Disclaimer: I currently work on Continuous Go Profiling for Datadog, but I’m not barking on behalf of my employer here.

v1: Naive Solution (link to code)

The day 1-1 challenge is pretty easy to solve, so I quickly came up with a naive solution, see below.

func Answer(input string) (int, error) {
var prev struct {
val int64
set bool
}
var increases int
for _, line := range strings.Split(input, "\n") {
val, err := strconv.ParseInt(line, 10, 64)
if err != nil {
return 0, err
}
if val > prev.val && prev.set {
increases++
}
prev.set = true
prev.val = val
}
return increases, nil
}

This works, but is it fast enough? Let’s run a benchmark.

$ go test -count 5 -run '^$' -bench . -cpuprofile=v1.cpu.pprof > v1.txt
$ benchstat v1.txt
name time/op
Answer-6 325ns ± 1%

While this may not seem bad to you, it’s definitely not Christmas Scale™️, so let’s figure out what we can optimize by looking at our CPU profile as a FlameGraph:

Looks like 44% of our time is spend in strings.Split(), let’s optimize this.

v2: Optimize strings.Split() (link to code)

Looking at the FrameGraph above, notice that strings.Split() calls strings.IndexByte() which is implemented in assembly. Turns out we can call this directly to do our own newline separation:

func Answer(input string) (int, error) {
var prev struct {
val int64
set bool
}
var increases int
var line string
for len(input) > 0 {
i := strings.IndexByte(input, '\n')
if i == -1 {
line = input
input = ""
} else {
line = input[0:i]
input = input[i+1:]
}
val, err := strconv.ParseInt(line, 10, 64)
if err != nil {
return 0, err
}
if val > prev.val && prev.set {
increases++
}
prev.set = true
prev.val = val
}
return increases, nil
}

This is a bit more complicated, but let’s see what this gives us:

$ benchstat v1.txt v2.txt
name old time/op new time/op delta
Answer-6 325ns ± 1% 168ns ± 2% -48.25% (p=0.008 n=5+5)

A 48% decrease in execution time — looks like we’re off to a good start! What’s next?

Seems like strconv.ParseInt() has become our new bottleneck.

v3: Optimize strconv.ParseInt() (link to code)

Even though the elves are not stating it explicitly, they are only giving us positive integers. So we could apply the same trick as before and call strconv.ParseUint() directly instead of going through strconv.ParseInt().

However, looking at the implementation of ParseUint it’s clear that it is doing a bunch of stuff to support different bases and bit sizes. We don’t need any of this, so let’s just implement the subset we actually need:

func parseInt(val string) (int64, error) {
var intVal int64
factor := int64(1)
for i := len(val) - 1; i >= 0; i-- {
c := val[i]
if c >= '0' && c <= '9' {
intVal += int64(c-'0') * factor
} else {
return intVal, fmt.Errorf("bad int: %q", val)
}
factor *= 10
}
return intVal, nil
}

Let’s see if it was worth it:

$ benchstat v2.txt v3.txt
name old time/op new time/op delta
Answer-6 168ns ± 2% 92ns ± 3% -45.29% (p=0.016 n=5+4)

Another 45% — we might be reaching Christmas Scale™️ here after all.

Can we do even better?

The FlameGraph above doesn’t give us any obvious clues for what to try next. But we can dig deeper. For example let’s take a look a look View -> Source:

As you can see we’re spending quite a bit of time on the sub-slice operation on line 41. Is this our next clue?

v4: The Nerd Snipe

The elves are informing me that the current solution over 3x faster than v1 — which is good enough for now:

$ benchstat v1.txt v3.txt
name old time/op new time/op delta
Answer-6 325ns ± 1% 92ns ± 3% -71.68% (p=0.016 n=5+4)

However they are willing to offer a special Go Profiling Star to anybody who manages to further improve upon v3. This is where you can come in. Just send me a link to your v4 solution in this twitter thread including your benchstats v3.txt v4.txt output and I will link to it from here.

  1. Valentin Deleplace’s solution with a -87% delta against v3
  2. Gareth Lewin’s solution with a -83% delta against v3
  3. Pontus Leitzler’s solution with a -74% delta against v3
  4. @Fugiman’s solution with a -50% delta against v3

Note: The perf delta results above are self-reported. In my own testing Valentin’s solution achieves a spectacular 14 ns/op and has -52% delta against Gareth’s solution.

That’s it for today. I’ll definitely only find time for doing a few of these posts, but I’ll try to keep them coming.

https://felixge.de/2021/12/01/advent-of-go-profiling-2021-day-1-1/
CSV Boilerplate for Go

tl;dr: Here is my boilerplate for reading and writing CSV files in Go. I hope you’ll find it useful.

Every once in a while, I have to read and/or write CSV files in Go. In theory that’s easy. One just has to include encoding/csv from the standard library, and write a bit of boilerplate to map between a Go struct and the CSV records.

However, the resulting code has often left me unsatisifed for a variety of reasons:

  1. I spend too much time reinventing the wheel, playing around with different ways to structure the code.
  2. The code is annoying to maintain, e.g. changing the output column order often requires modifications in at least 4 places: the struct definition, header writing, record writing, and record reading.
  3. The result is not very robust when it comes to bad CSV input and doesn’t handle problems such as missing, unexpected, or misordered columns or headers.
  4. Error messages often end up missing important information such as row number, column name, etc..
  5. How do I write good tests for this that will be easy to maintain in the future?

One way to deal with this is to use a high level library, and you should certainly consider that. But the reality is often an ugly mess. You might find that the library of your choice doesn’t handle a particular edge case, e.g. more than one header row being present (my bank loves putting in 6 header lines). Or the error messages you’re getting are not helpful enough for your users. Or the error handling is too strict. Or the reflection based mapping is too magic. Or the performance sucks. I could go on and on, but you get the point - dealing with CSV is nasty and calling it a format is a bit of an insult to the idea of a format. From my point of view, writing a library for this problem is simply not possible without creating a horribly overengineered mess or ignoring many edge cases.

So in the end it seems like you have to chose between two suboptimal options. You can try out a bunch of libraries and use and/or fork the one closest to your needs. Or you can roll your own boilerplate for the millionth time, only to look back at it with horrible disgust shortly after.

Your choice should probably depend on the craziness of your input, and I won’t be able to make it for you. But I’ve decided that I want to make the boilerplate route a bit less annoying for myself in the future. To accomplish this I’ve created a boilerplate template that I’ll copy and paste for my future needs, adjusting it as needed. For the rest of the blog post I want to walk you through my thought process in designing it.

To begin, I decided to use real world CSV as my example, containing bits of the typical ugliness one has to deal with. I quickly found a file called cars.csv which looks like this:

Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT
Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US
Buick Skylark 320;15.0;8;350.0;165.0;3693.;11.5;70;US
Plymouth Satellite;18.0;8;318.0;150.0;3436.;11.0;70;US
...

As you can see, there are already two ugly bits to be examined here: semicolons instead of commas as a separator, and a second header line that seems to contain type information for whatever reason. Not great, not terrible, but pretty typical for what’s out there.

The goals for my boilerplate is to read and write the file format in a way that solves the five problems outlined in the beginning. It should do so without an excessive amount of abstraction, and certainly no reflection, yet as elegantly as possible. There should also be an intermediate Go struct to hold the data in memory as shown below.

type Car struct {
Car string
MPG float64
Cylinders int64
Displacement float64
Horsepower float64
Weight float64
Acceleration float64
Model int64
Origin string
}

Usually my first naive attempt is to write some code that defines the headers and struct mapping in a way that is separate from the CSV encoding/decoding itself. Here’s an example:

var carHeaders = []string{
"Car",
"MPG",
"Cylinders",
// remaining columns ...
}
func (c *Car) UnmarshalRecord(record []string) error {
if got, want := len(record), len(carHeaders); got != want {
return fmt.Errorf("bad column number: got=%d want=%d", got, want)
}
c.Car = record[0]
if c.MPG, err = strconv.ParseFloat(record[1], 64); err != nil {
return fmt.Errorf("column=%q: %w", carHeaders[1], err)
} else if c.Cylinders, err = strconv.ParseInt(record[2], 10, 64); err != nil {
return fmt.Errorf("column=%q: %w", carHeaders[2], err)
}
// remaining columns ...
 return nil
}
func (c *Car) MarshalRecord() ([]string, error) {
return []string{
c.Car,
fmt.Sprintf("%f", c.MPG),
fmt.Sprintf("%d", c.Cylinders),
// remaining columns ...
 }, nil
}

This works, but adding, removing, or reordering any of the columns requires us to modify our code in 4 to 5 different places, and as I mentioned in the beginning, we want to avoid that. So let’s try some abstraction that allows us to declaratively specify our columns and how to marshal/unmarshal them:

type carColumn struct {
Name string
Type string
UnmarshalValue func(*Car, string) error
MarshalValue func(*Car) (string, error)
}
var carColumns = []carColumn{
{
"Car",
"STRING",
func(c *Car, val string) error {
c.Car = val
return nil
},
func(c *Car) (string, error) {
return c.Car, nil
},
},
{
"MPG",
"DOUBLE",
func(c *Car, val string) (err error) {
c.MPG, err = strconv.ParseFloat(val, 64)
return
},
func(c *Car) (string, error) {
return fmt.Sprintf("%f", c.MPG), nil
},
},
{
"Cylinders",
"INT",
func(c *Car, val string) (err error) {
c.Cylinders, err = strconv.ParseInt(val, 10, 64)
return
},
func(c *Car) (string, error) {
return fmt.Sprintf("%d", c.Cylinders), nil
},
},
// remaining columns ...
}

This is of course a bit verbose, but it solves our problem. Reordering our columns is now a single cut & paste operation, and adding or removing a column has gotten a lot easier as well. This approach even gives us a convenient way to put the Type information found in the second line of the CSV file, so we can easily duplicate this quirk when writing our own CSV files later on.

We still need to update our UnmarshalRecord and MarshalRecord from earlier. The good news is that we’re unlikely to ever have to modify this code again:

func (c *Car) UnmarshalRecord(record []string) error {
if got, want := len(record), len(carColumns); got != want {
return fmt.Errorf("bad number of columns: got=%d want=%d", got, want)
}
for i, col := range carColumns {
if err := col.UnmarshalValue(c, record[i]); err != nil {
return fmt.Errorf("column=%q: %w", col.Name, err)
}
}
return nil
}
func (c *Car) MarshalRecord() ([]string, error) {
record := make([]string, len(carColumns))
for i, col := range carColumns {
val, err := col.MarshalValue(c)
if err != nil {
return nil, err
}
record[i] = val
}
return record, nil
}

Now it’s time to implement the CSV decoding itself. My boilerplate implements this via the ReadCarsCSV function that can be seen below.

const carComma = ';'
func ReadCarsCSV(r io.Reader) ([]*Car, error) {
cr := csv.NewReader(r)
cr.Comma = carComma
records, err := cr.ReadAll()
if err != nil {
return nil, err
}
cars := make([]*Car, 0, len(records))
for i, record := range records {
if got, want := len(record), len(carColumns); got != want {
return nil, fmt.Errorf("row=%d: bad number of columns: got=%d want=%d", i+1, got, want)
}
switch i {
case 0:
for i, got := range record {
if want := carColumns[i].Name; got != want {
return nil, fmt.Errorf("unexpected header column %d: got=%q want=%q", i, got, want)
}
}
case 1:
for i, got := range record {
if want := carColumns[i].Type; got != want {
return nil, fmt.Errorf("unexpected type column %d: got=%q want=%q", i, got, want)
}
}
default:
car := &Car{}
if err := car.UnmarshalRecord(record); err != nil {
return nil, fmt.Errorf("row=%d: %w", i+1, err)
}
cars = append(cars, car)
}
}
return cars, nil
}

The code above accomplishes the main task of returning all cars read from the given io.Reader, but it also validates the number of columns, as well as the header and type names found in the first two rows. Error messages should also be good, including both the row number as well the the offending column name in case something goes wrong for one of the records.

It should also be easy to modify. For example if you want to ignore the second header line during reading, just remove the code. Or instead of requiring all headers to have a fixed position, you could dynamically discover their position from the input by matching their names against the names in the carColumns slice and then reorder the elements in the record accordingly.

Or you might decide you need a streaming interface to lower memory usage and GC pressure like this:

type CarReader struct {
r io.Reader
// ...
}
func (r *CarReader) Read(c *Car) error {
// ...
}

The possibilities are endless, and you won’t find yourself backed into a corner by the choice of your library.

Next up is converting our data back to CSV. My solution to that looks like this:

func WriteCarsCSV(w io.Writer, cars []*Car) error {
cw := csv.NewWriter(w)
cw.Comma = carComma
header := make([]string, len(carColumns))
types := make([]string, len(carColumns))
for i, col := range carColumns {
header[i] = col.Name
types[i] = col.Type
}
cw.Write(header)
cw.Write(types)
for _, car := range cars {
record, err := car.MarshalRecord()
if err != nil {
return err
}
cw.Write(record)
}
cw.Flush()
return cw.Error()
}

As you can see, we take care of writing out the quirky second header line. We also save some code by not checking the errors for every cw.Write() call and handle them by returning cw.Error() at the end instead.

Now we should think about testing. I’m a big fan of the 80/20 rule when it comes to testing, so below is a test that covers the happy path very efficiently.

func TestCSVReadWriteCycle(t *testing.T) {
in := strings.TrimSpace(`
Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT
Chevrolet Chevelle Malibu;18.0;8;307.0;130.0;3504.;12.0;70;US
Buick Skylark 320;15.0;8;350.0;165.0;3693.;11.5;70;US
Plymouth Satellite;18.0;8;318.0;150.0;3436.;11.0;70;US
`)
wantOut := strings.TrimSpace(`
Car;MPG;Cylinders;Displacement;Horsepower;Weight;Acceleration;Model;Origin
STRING;DOUBLE;INT;DOUBLE;DOUBLE;DOUBLE;DOUBLE;INT;CAT
Chevrolet Chevelle Malibu;18.000000;8;307.000000;130.000000;3504.000000;12.000000;70;US
Buick Skylark 320;15.000000;8;350.000000;165.000000;3693.000000;11.500000;70;US
Plymouth Satellite;18.000000;8;318.000000;150.000000;3436.000000;11.000000;70;US
`)
for i := 0; i < 2; i++ {
cars, err := ReadCarsCSV(strings.NewReader(in))
if err != nil {
t.Fatal(err)
}
buf := &bytes.Buffer{}
if err := WriteCarsCSV(buf, cars); err != nil {
t.Fatal(err)
}
gotOut := strings.TrimSpace(buf.String())
if gotOut != wantOut {
t.Fatalf("\ngot:\n%s\nwant:\n%s", gotOut, wantOut)
}
in = gotOut
}
}

The test first reads the provided in CSV, and makes sure that converting it back to CSV results in wantOut which is a little different because of the way Go formats our floating point numbers. The test then proceeds to treat the output from the first iteration as the input for the second one. This makes sure that our implementation can read the original format, as well as the slightly different output it produces itself.

Of course we could also aim to reproduce the float formatting quirks from the original file. At first glance it seems like all floats are formatted with one digit after the period. However, a closer look reveals that the Weight column has a period, but skips the following digit, e.g. 3504.. I’ve decided to not go down this rabbit hole, but it should be clear that our boilerplate approach puts us in a great position for dealing with CSV quirks like this.

Depending on how serious you are about parsing CSVs, you probably also want to test a few error cases and maybe even throw some fuzzing at this. But I’ve decided the test above is good enough for my boilerplate, so you’ll have to do this part yourself.

Ok, that’s it. I’d love to hear your thoughts, especially if you have ideas for improving it further. Or let me know if you see good alternatives to the idea of copy & pasting a bunch of boilerplate. It’s not like I love it, but as far as I can tell it hits a sweet spot for working with Go in this case.

Thanks to Thorsten Ball and Tomás Senart for reviewing this post and making many valuable suggestions.

https://felixge.de/2020/09/27/csv-boilerplate-for-go/
Implementing State Machines in PostgreSQL

Finite-state machine (FSM) is a wonderful model of computation that has many practical applications. One of my favorites is turning business logic into simple FSMs. For example consider the following requirements for an order management system:

  • Orders must not ship before they are paid.
  • Orders can be canceled, but only if they haven’t shipped yet.
  • An order that has already been paid, needs to be refunded after canceling.

Informally, turning such a list of requirements into an FSM is simply the process of coming up with: a set of states an order can be in, a set of events, a transition function that maps the combination of each state and event to a new state, and an initial state.

In practice this is usually done by drawing up a directed graph that shows how the different states are connected to each other via events:

Finite-state machine for an order management system in PostgreSQL

Besides guiding our implementation, these graphs can be very useful for discussions and analysis. For example you can immediately see how there is no way for a customer to return an order after it has shipped. Additionally the process of naming the involved states and events automatically creates a precise language that can be adopted by all stakeholders.

You might have noticed that the graph above does not show all possible combinations of all states and all events. The reason is that the missing combinations implicitly lead to an error state. The complete FSM can be shown as a graph as well, but I don’t think it’s very useful in practice.

Anyway, as promised in the title of this article we’re going to implement this FSM in PostgreSQL. This may not be everybody’s cup of tea, but you might like how this approach us gives advanced analytical powers as a free by-product. Embedding this kind of logic into the database can also help protect against race conditions, but this will perhaps be the topic of a future post 1.

Let’s begin by creating an order_events table which keeps track of all events for a given order_id.

CREATE TABLE order_events (
 id serial PRIMARY KEY,
 order_id int NOT NULL,
 event text NOT NULL,
 time timestamp DEFAULT now() NOT NULL
);

Next, let’s implement the transition function, which is the heart of every FSM:

CREATE FUNCTION order_events_transition(state text, event text) RETURNS text
LANGUAGE sql AS
$$
 SELECT CASE state
 WHEN 'start' THEN
 CASE event
 WHEN 'create' THEN 'awaiting_payment'
 ELSE 'error'
 END
 WHEN 'awaiting_payment' THEN
 CASE event
 WHEN 'pay' THEN 'awaiting_shipment'
 WHEN 'cancel' THEN 'canceled'
 ELSE 'error'
 END
 WHEN 'awaiting_shipment' THEN
 CASE event
 WHEN 'cancel' THEN 'awaiting_refund'
 WHEN 'ship' THEN 'shipped'
 ELSE 'error'
 END
 WHEN 'awaiting_refund' THEN
 CASE event
 WHEN 'refund' THEN 'canceled'
 ELSE 'error'
 END
 ELSE 'error'
 END
$$;

And before we proceed, let’s test with a few examples to make sure the function is working:

SELECT state, event, order_events_transition(state, event)
FROM (VALUES
 ('start', 'create'),
 ('awaiting_payment', 'pay'),
 ('awaiting_payment', 'cancel'),
 ('awaiting_payment', 'ship')
) AS examples(state, event);
 state | event | order_events_transition
------------------+--------+-------------------------
start | create | awaiting_payment
awaiting_payment | pay | awaiting_shipment
awaiting_payment | cancel | canceled
awaiting_payment | ship | error

The above looks correct, but it’s not immediately clear how we could use this function to enforce our FSM on all rows for the same order_id in the order_events table.

The first thing we need is a good way to take a list of events and call our transition function on them recursively to determine the resulting state. There are multiple ways to accomplish this in PostgreSQL, but perhaps the most elegant is a user-defined aggregate.

In a nutshell, a user defined aggregate is a function that has an internal value (state) that is updated for each input value that is being passed into it using a state transition function. This means it fits our FSM model like a glove:

CREATE AGGREGATE order_events_fsm(text) (
 SFUNC = order_events_transition,
 STYPE = text,
 INITCOND = 'start'
);

The above defines a new aggregate called order_events_fsm which takes a text input (one of our events) and calls the order_events_transition state transition function (FSUNC) for each input along with the current state (STYPE) which is also of type text. The initial state is start (INITCOND).

A quick test shows that it works as expected:

SELECT order_events_fsm(event ORDER BY id)
FROM (VALUES
 (1, 'create'),
 (2, 'pay'),
 (3, 'cancel')
) examples(id, event);
 order_events_fsm
------------------
awaiting_refund

Now let’s use our order_events_fsm to create a BEFORE INSERT trigger for our order_events table that makes sure all events of a given order_id are valid and don’t lead to an error state. We do this using a simple plpgsql function that executes our order_events_fsm against all existing events of the current order_id plus the new event. If the final state is error, we raise an exception which causes the current transaction to roll back. It looks like this:

CREATE FUNCTION order_events_tigger_func() RETURNS trigger
LANGUAGE plpgsql AS $$
DECLARE
 new_state text;
BEGIN
 SELECT order_events_fsm(event ORDER BY id)
 FROM (
 SELECT id, event FROM order_events WHERE order_id = new.order_id
 UNION
 SELECT new.id, new.event
 ) s
 INTO new_state;

 IF new_state = 'error' THEN
 RAISE EXCEPTION 'invalid event';
 END IF;

 RETURN new;
END
$$;

CREATE TRIGGER order_events_trigger BEFORE INSERT ON order_events
FOR EACH ROW EXECUTE PROCEDURE order_events_tigger_func();

We can verify that this works by inserting a valid sequence of order events:

INSERT INTO order_events (order_id, event) VALUES
 (1, 'create'),
 (1, 'pay'),
 (1, 'ship');
INSERT 0 3

As well as an invalid sequence of events:

INSERT INTO order_events (order_id, event) VALUES
 (2, 'create'),
 (2, 'ship');
psql:orders.sql:95: ERROR: invalid event
CONTEXT: PL/pgSQL function order_events_tigger_func() line 14 at RAISE

As expected, only the first series of inserts made it into our table:

SELECT id, order_id, event FROM order_events;
 id | order_id | event
----+----------+--------
1 | 1 | create
2 | 1 | pay
3 | 1 | ship

If you’re still on the fence about embedding this kind of logic into your database, let’s see how our approach gives us advanced analytical powers as a free by-product. Let’s consider a new data set of 3 orders:

TRUNCATE order_events;
INSERT INTO order_events (order_id, event, time) VALUES
 (1, 'create', '2017-07-23 00:00:00'),
 (1, 'pay', '2017-07-23 12:00:00'),
 (1, 'ship', '2017-07-24 00:00:00'),

 (2, 'create', '2017-07-23 00:00:00'),
 (2, 'cancel', '2017-07-24 00:00:00'),

 (3, 'create', '2017-07-23 00:00:00'),
 (3, 'pay', '2017-07-24 00:00:00'),
 (3, 'cancel', '2017-07-25 00:00:00'),
 (3, 'refund', '2017-07-26 00:00:00');

Using our order_events_fsm as a window function, we can easily get the state history of a given order:

SELECT time, order_events_fsm(event) OVER (ORDER BY id)
FROM order_events
WHERE order_id = 3;
 time | order_events_fsm
---------------------+-------------------
2017-07-23 00:00:00 | awaiting_payment
2017-07-24 00:00:00 | awaiting_shipment
2017-07-25 00:00:00 | awaiting_refund
2017-07-26 00:00:00 | canceled

But we can go even further and apply our state machines to multiple orders, e.g. by using the generate_series function and a Lateral sub-query to break down the number of orders per state for each day of a given date range:

SELECT date::date, state, count(1)
FROM
 generate_series('2017-07-23'::date, '2017-07-26', '1 day') date,
 LATERAL (
 SELECT order_id, order_events_fsm(event ORDER BY id) AS state
 FROM order_events
 WHERE time < date + '1 day'::interval
 GROUP BY 1
 ) orders
GROUP BY 1, 2
ORDER BY 1, 2;
 date | state | count
------------+-------------------+-------
2017-07-23 | awaiting_payment | 2
2017-07-23 | awaiting_shipment | 1
2017-07-24 | awaiting_shipment | 1
2017-07-24 | canceled | 1
2017-07-24 | shipped | 1
2017-07-25 | awaiting_refund | 1
2017-07-25 | canceled | 1
2017-07-25 | shipped | 1
2017-07-26 | canceled | 2
2017-07-26 | shipped | 1

There you have it, a FSM implemented as a user defined aggregate in PostgreSQL providing both data integrity and advanced analytics.

That being said, your milage may vary, and embedding your business logic into your database is always a tradeoff. But if you want some reassurance: I’ve had great success in applying this approach in combination with eager materialization to implement a realtime analytics dashboard for an application with over a billion event rows.

Anyway, I’m really looking forward to feedback on this, and am more than happy to answer any questions, so please comment.

Join me at Apple: If the article above has you excited, come and join my team at Apple. We’re hiring Go and PostgreSQL developers in Shanghai, China right now. Relocation is possible, just send me an e-mail to find out more about this role.

Thanks to Thorsten Ball and Johannes Boyne for reviewing.


  1. The code in this article is immune to concurrency anomalies when using the SERIALIZABLE transaction isolation level. Alternative you could modify the trigger to aquire an exclusive lock on the order_events table. But as mentioned, this topic deserves a separate post. ↩︎

https://felixge.de/2017/07/27/implementing-state-machines-in-postgresql/
PostgreSQL operations that you can't EXPLAIN

I am currently dabbling in PostgreSQL internals in my spare time, including trying to understand the query planner and executor. It’s a deeply humbling experience, but occasionally I’m delighted to be able to answer simple questions I’m researching.

One of these questions is how PostgreSQL is executing ORDER BY clauses inside of aggregate expressions.

# CREATE TABLE foo AS
SELECT * FROM generate_series(1, 3) i;
SELECT 3
# SELECT * FROM foo;
i
---
1
2
3
(3 rows)

Consider a simple table that holds 3 rows of an int column ranging from 1 to 3:

Now let’s assume we want to convert those rows into a json array. This can be accomplished using the json_agg aggregate function:

# SELECT json_agg(i) FROM foo;
json_agg
-----------
[1, 2, 3]
(1 row)

But what if we want to order the array elements in descending order? No problem, we can simply use the order-by-clause support available to all aggregate expressions:

# SELECT json_agg(i ORDER BY i DESC) FROM foo;
json_agg
-----------
[3, 2, 1]
(1 row)

If you’re like me, you’d probably imagine the EXPLAIN output for this query to contain three nodes: a Seq Scan, a Sort, and an Aggregate. However, when we run EXPLAIN, it turns out there is no Sort node.

# EXPLAIN SELECT json_agg(i ORDER BY i DESC) FROM foo;
QUERY PLAN
-------------------------------------------------------------
Aggregate (cost=41.88..41.89 rows=1 width=32)
-> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=4)
(2 rows)

In fact, the plan even remains unchanged even if we change the sort order or remove the clause entirely:

# EXPLAIN SELECT json_agg(i ORDER BY i ASC) FROM foo;
QUERY PLAN
-------------------------------------------------------------
Aggregate (cost=41.88..41.89 rows=1 width=32)
-> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=4)
(2 rows)
# EXPLAIN SELECT json_agg(i) FROM foo;
QUERY PLAN
-------------------------------------------------------------
Aggregate (cost=41.88..41.89 rows=1 width=32)
-> Seq Scan on foo (cost=0.00..35.50 rows=2550 width=4)
(2 rows)

But how is this possible? If there is no Sort node, how does the aggregate perform its sorting?

Well, it turns out that aggregate plan nodes can perform their own sorting. These sorts do not explicitly show up in EXPLAIN, but we can trace them using the trace_sort developer option:

# SET trace_sort=true;
SET
# SET client_min_messages='log';
SET
# SELECT json_agg(i ORDER BY i DESC) FROM foo;
LOG: begin datum sort: workMem = 4096, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: internal sort ended, 25 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec
json_agg
-----------
[3, 2, 1]
(1 row)

Another way to see that the sort is part of the plan is to use debug_print_plan option, but the output is not for the faint of heart. Specifically you need to look for the :aggorder field below:

# SET debug_print_plan=true;
SET
# SELECT json_agg(i ORDER BY i DESC) FROM foo;
LOG: plan:
DETAIL: {PLANNEDSTMT
:commandType 1
:queryId 0
:hasReturning false
:hasModifyingCTE false
:canSetTag true
:transientPlan false
:dependsOnRole false
:parallelModeNeeded false
:planTree
{AGG
:startup_cost 41.88
:total_cost 41.89
:plan_rows 1
:plan_width 32
:parallel_aware false
:plan_node_id 0
:targetlist (
{TARGETENTRY
:expr
{AGGREF
:aggfnoid 3175
:aggtype 114
:aggcollid 0
:inputcollid 0
:aggtranstype 2281
:aggargtypes (o 23)
:aggdirectargs <>
:args (
{TARGETENTRY
:expr
{VAR
:varno 65001
:varattno 1
:vartype 23
:vartypmod -1
:varcollid 0
:varlevelsup 0
:varnoold 1
:varoattno 1
:location 16
}
:resno 1
:resname <>
:ressortgroupref 1
:resorigtbl 0
:resorigcol 0
:resjunk false
}
)
:aggorder (
{SORTGROUPCLAUSE
:tleSortGroupRef 1
:eqop 96
:sortop 521
:nulls_first true
:hashable true
}
)
:aggdistinct <>
:aggfilter <>
:aggstar false
:aggvariadic false
:aggkind n
:agglevelsup 0
:aggsplit 0
:location 7
}
:resno 1
:resname json_agg
:ressortgroupref 0
:resorigtbl 0
:resorigcol 0
:resjunk false
}
)
:qual <>
:lefttree
{SEQSCAN
:startup_cost 0.00
:total_cost 35.50
:plan_rows 2550
:plan_width 4
:parallel_aware false
:plan_node_id 1
:targetlist (
{TARGETENTRY
:expr
{VAR
:varno 1
:varattno 1
:vartype 23
:vartypmod -1
:varcollid 0
:varlevelsup 0
:varnoold 1
:varoattno 1
:location -1
}
:resno 1
:resname <>
:ressortgroupref 0
:resorigtbl 0
:resorigcol 0
:resjunk false
}
)
:qual <>
:lefttree <>
:righttree <>
:initPlan <>
:extParam (b)
:allParam (b)
:scanrelid 1
}
:righttree <>
:initPlan <>
:extParam (b)
:allParam (b)
:aggstrategy 0
:aggsplit 0
:numCols 0
:grpColIdx
:grpOperators
:numGroups 1
:aggParams (b)
:groupingSets <>
:chain <>
}
:rtable (
{RTE
:alias <>
:eref
{ALIAS
:aliasname foo
:colnames ("i")
}
:rtekind 0
:relid 404407
:relkind r
:tablesample <>
:lateral false
:inh false
:inFromCl true
:requiredPerms 2
:checkAsUser 0
:selectedCols (b 9)
:insertedCols (b)
:updatedCols (b)
:securityQuals <>
}
)
:resultRelations <>
:utilityStmt <>
:subplans <>
:rewindPlanIDs (b)
:rowMarks <>
:relationOids (o 404407)
:invalItems <>
:nParamExec 0
}
LOG: begin datum sort: workMem = 4096, randomAccess = f
LOG: performsort starting: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: performsort done: CPU 0.00s/0.00u sec elapsed 0.00 sec
LOG: internal sort ended, 25 KB used: CPU 0.00s/0.00u sec elapsed 0.00 sec
json_agg
-----------
[3, 2, 1]
(1 row)

In conlusion: While EXPLAIN is a powerful weapon for those seeking to understand query performance, you should be aware of the fact that there are things that you can’t EXPLAIN :).

The research for this article was done using PostgreSQL 9.6.3 and LLDB, but should also apply to recent versions as well as the upcoming PostgreSQL 10.

# SELECT version();
version
--------------------------------------------------------------------------------------------------------------
PostgreSQL 9.6.3 on x86_64-apple-darwin14.5.0, compiled by Apple LLVM version 7.0.0 (clang-700.1.76), 64-bit
(1 row)
https://felixge.de/2017/05/23/postgresql-operations-that-you-cant-explain/
Go + Raspberry Pi powered Alarm Clock

Invented in 1959, the snooze button is the worst invention in the world. Not only does it defeat the point of setting an alarm, it may actually leave you more tired. Unfortunately however, my zombie morning brain doesn’t care about this, and any alarm within reach has a very poor chance of actually getting me out of bed, trapping me into an open-end snooze fest.

For this reason I usually set an alarm below my loft bed, so I’m forced to climb down the stairs to disable it. This works, but sometimes I forget to set this alarm before getting into bed, and my sleepy evening brain convinces itself that setting the alarm on the phone next to me will do just fine.

And this is exactly how I missed an important early morning conference call this week.

To avoid this from happening again, I finally implement an idea I had a while ago for for turning my raspberry pi into a smartphone controlled alarm clock that I can set from my bed, but only disable by getting up. I’ve published the Go code as Rise 1.0 on GitHub, and you can see the result below:

In a way, this is latest incarnation of a previous AirPlay based project of mine, which unfortunatley suffered from a few issues causing me to abondon it. However, I feel pretty good about this implementation, and already woke up successfully with it this morning.

Given enough interest, I’d be happy to polish this up a bit further and make it easy to install on any Raspberry Pi or other Linux machine you have laying around. Just let me know what you think!

https://felixge.de/2015/03/05/go-raspberry-pi-powered-alarm-clock/
Using TCP keepalive with Go

If you have ever written some TCP socket code, you may have wondered: “What will happen to my connection if the network cable is unplugged or the remote machine crashes?”.

The short answer is: nothing. The remote end of the connection won’t be able to send a FIN packet, and the local OS will not detect that the connection is lost. So it’s up to you as the developer to address this scenario.

In Go you have several methods available to you that can help with this. Perhaps the first one to consider is the SetReadDeadline method of the net.Conn interface. Assuming that your connection is expected to receive data at a regular interval, you can simply treat a timed out read as equivalent to an io.EOF error and Close the connection. Many existing TCP protocols support this way of error handling by defining some sort of heartbeat mechanism that requires each endpoint to send PING/PONG probes at a regular interval in order to detect both networking problems, as well as service health1. Additionally such heartbeats may also help dealing with proxy servers that look for network activity to determine the health of a connection.

So if your protocol supports heartbeats, or you have the ability to add heartbeats to your own protocol, that should be your first choice for addressing the unplugged network cable scenario.

However, what happens if you have no control over the protocol, and heartbeats are not supported?

Now it’s time to learn about TCP keepalive and how to use it with Go. TCP keepalive is defined in RFC 1122, and is not part of the TCP specification itself. It can be enabled for individual connections, but MUST be turned off by default. Enabling it will cause the network stack to probe the health of an idle connection after a default duration that must be no less than two hours. The probe packet will contain no data2, and failure to reply to an individual probe MUST NOT be interpreted as a dead connection, as individual probe packets are not reliably transmitted.

Go allows you to enable TCP keepalive using net.TCPConn’s SetKeepAlive. On OSX and Linux this will cause up to 8 TCP keepalive probes to be sent at an interval of 75 seconds after a connection has been idle for 2 hours. Or in other words, Read will return an io.EOF error after 2 hours and 10 minutes (7200 + 8 * 75).

Depending on your application, that may be too long of a timeout. In this case you can call SetKeepAlivePeriod. However, this method currently behaves different for different operating systems. On OSX, it will modify the idle time before probes are being sent. On Linux however, it will modify both the idle time, as well as the interval that probes are sent at. So calling SetKeepAlivePeriod with an argument of 30 seconds will cause a total timeout of 10 minutes and 30 seconds for OSX (30 + 8 * 75), but 4 minutes and 30 seconds on Linux (30 + 8 * 30).

I found that situation rather unsatisfying, so I ended up creating a small package called tcpkeepalive that gives you more control:

kaConn, _ := tcpkeepalive.EnableKeepAlive(conn)
kaConn.SetKeepAliveIdle(30*time.Second)
kaConn.SetKeepAliveCount(4)
kaConn.SetKeepAliveInterval(5*time.Second)

Currently only Linux and OSX are supported, but I’d be happy to merge pull requests for other platforms. If there is interest from the Go core team, I’ll also try to contribute these new methods to Go itself.

Please let me know if you found this article useful, have any questions, or spotted any errors so I can correct them.

Appendix
  1. Tuning a heartbeat mechanism to detect failures early with a low false positive rate is tricky business. Checkout the ϕ Accrual Failure Detector for a statistically sound model, as well as Damian Gryski’s go-failure implementation. Unfortunately there is no way to use it with TCP keepalive that I can think of.

  2. According to RFC 1122 keepalive probes may contain a single garbage octet for compatibility with broken implementations. However, I’m not sure if this is filtered out by the OS network stack or not, please comment if you know.

https://felixge.de/2014/08/26/using-tcp-keepalive-with-go/
GoDrone - A Parrot AR Drone 2.0 Firmware written in Go

Merry Christmas (or Newtonmas if you prefer) everybody.

Today I’m very happy to release the first version of my favorite side project, GoDrone.

GoDrone is a free software alternative firmware for the Parrot AR Drone 2.0. And yes, this hopefully makes it the first robotic visualizer for Go’s garbage collector : ).

At this point the firmware is good enough to fly and provide basic attitude stabilization (using a simple complementary filter + pid controllers), so I’d really love to get feedback from any adventurous AR Drone owners. I’m providing binary installers for OSX/Linux/Windows:

http://www.godrone.io/en/latest/index.html

But you may also choose to install from source.

Depending on initial feedback, I’d love to turn GoDrone into a viable alternative to the official firmware, and breathe some fresh air into the development of robotics software. In particular I’d like to show that web technologies can rival native mobile/desktop apps in costs and UX for providing user interfaces to robots, and I’d also like to promote the idea of using high level languages for firmware development in linux powered robots.

If you’re interested, please make sure to join the mailing list / come and say hello in IRC:

http://www.godrone.io/en/latest/user/community_support.html

https://felixge.de/2013/12/25/godrone-a-parrot-ar-drone-2.0-firmware-written-in-go/
Vim Trick: Open current line on GitHub

Update: The fugitive vim plugin supports the same feature via :Gbrowse - thanks to @CrypticSwarm for the tip!

In my never-ending quest to automate my work flow, I recently came up with a neat little vim trick I’d like to share.

One of the things I do quite frequently, is pasting links to GitHub files/lines in email and chat conversations, e.g. Protocol.js#L144. So far my workflow for this has been to navigate to the GitHub repo, hit “t” to select the file, click on the line I want, and then copy the address bar location into my clipboard.

After doing this several times in a row the other day, I decided to automate it. The first piece of the puzzle was to create a git alias for determining the GitHub URL of the current repository to put into my ~/.gitconfig file:

[alias]
url =! bash -c 'git config --get remote.origin.url | sed -E "s/.+:\\(.+\\)\\.git$/https:\\\\/\\\\/github\\\\.com\\\\/\\\\1/g"'

This allows me to easily get the URL of the current repository I am in, e.g.:

$ git url
https://github.com/felixge/node-mysql

Now I can do cool things like quickly opening this URL in my browser:

$ git url | xargs open
# or
$ open `git url`

But that still requires me to manually navigate to the file / line I am currently interested in, and I’m lazy. So I came up with this key binding for my ~/.vimrc:

nnoremap <leader>o :!echo `git url`/blob/`git rev-parse --abbrev-ref HEAD`/%\#L<C-R>=line('.')<CR> \| xargs open<CR><CR>

Now I can simply press “,o” ("," is my leader key), and my browser will automatically navigate to the file/line underneath my cursor.

Feel free to adopt this for your editor / environment and let me know if you make any improvements. For example, one thing I didn’t get around to yet is opening visual selections as line ranges.

https://felixge.de/2013/08/08/vim-trick-open-current-line-on-github/
Let's Fix File Uploading

tl;dr: We are creating an open source project to provide a turnkey file uploading solution called tus. Today we are happy to show you our first resumable file uploading demo.

It’s 2013, and adding reliable file uploading to your app is still too damn hard. And if content is king, this poses a serious problem to those interested in acquiring it. In this post I’ll describe the current uploading landscape and how we are planning to fix it.

Resumable Uploads

Have you ever tried uploading a big file over an unreliable internet connection? Depending on the site you were using, the experience can be extremely frustrating. Except for a few bigger sites, network errors will force you to restart your upload from the beginning. Even worse, some sites will be unable to detect this problem, leaving it up to you to figure out why the progress bar is stuck.

The solution of course is to offer resumable file uploads, however, this is easier said than done. While there is no shortage of bits and pieces that can help you with building your own solution, there is no simple solution that you can get up and running in a few minutes.

The first thing you’ll need to figure out is a server side API. Google has published several http protocols for resumable uploading (e.g. YouTube, Google Drive), but they all rely on a non-standard http code 308 Resume Incomplete which is on collision course with upcoming standards. Amazon S3 offers multipart uploads (not to be confused with multipart/form-data), but they are hard to use and the minimum chunk size is 5MB which is far too large to be useful under bad networking conditions.

And even if you figure out the server side, you’ll still need to take care of the client side. For HTML5, your best bet is to start out with something like jQuery File Upload which has some basic support for resumable uploads, but you’ll still have a lot of tweaking ahead of you in order to make it work with your server side API. If you are also running native iOS / Android apps, you will be entirely on your own to come up with a solution.

Fuck this. It’s 2013, and you shouldn’t have to waste several days for what will only be a rudimentary solution, hard to scale, and a pain to maintain. This is where we come in with tus. Tus is an open source project that aims to provide you with a turnkey solution to all your file uploading needs. We are designing a protocol, building a server, and are working on clients for HTML5, iOS and Android. Once these are solid, we will also provide a hosted service, making it even easier for people to get going, as well as allowing us to recoup our investment in this project.

Today is a great day to have a first look, as we have just completed our first resumable file uploading demo for HTML5. The demo is pointed at a tusd instance running on EC2, and uses jQuery File Upload under the hood. What’s exciting about it, is that it not only handles network errors, but also browser crashes and user errors like closing the page. No matter what happens, a previous upload will always continue where it left off.

So check it out, and give us your feedback. We are also very interested to hear what other issues you’d like to see solved. If you are not sure about the possibilities, just read on.

A world of possibilities

Resumable file uploads are just the tip of the iceberg, and there are many more problems ripe for a solution. However, we need your help in prioritizing them, so please let us know which of the things below you would find valuable.

  • File upload acceleration: There are many situations where file uploads fail to achieve optimal performance due to latency and TCP limitations. We hope to tackle this from multiple angles: Hosting a reverse CDN to improve latency regardless of user location, utilizing multiple TCP connections to increase the effective tcp window size, and maybe in the far future implement a specialized UDP protocol.
  • Huge files: With HD cameras on the rise, uploads become bigger and bigger. A good solution should be able to handle files > 2GB with ease.
  • Streaming: A lot of files can be processed as a stream, so tus should make it easy to let you download and process an active upload.
  • Service Integrations: It should be easy to let your users select files from their Dropbox, Facebook, Google Drive or other cloud services.
  • Processing pipelines: Most files are media files and require additional processing before being consumed by other users. Tus should easily integrate with existing processing solutions such as Transloadit, Zencoder and others.
  • Scalability: It should be easy to run a cluster of tus instances, allowing you to add or remove capacity as needed.
  • Security: HTTPS is a no-brainer, but we are hoping to raise the bar by supporting file encryption on the fly for applications that can benefit from it.
  • Protocols: HTTP is not the only protocol out there. We are interested in exploring uploads via FTP, E-Mail and other commonly used protocols.

If any of the above sounds exciting to you, please let us know. For now our priority is to release a solid server along with great client libraries for HTML5, iOS and Android, but at the end of the day it will be your feedback that will drive our roadmap.

https://felixge.de/2013/03/27/lets-fix-file-uploading/
The Pull Request Hack

After writing about Open Source And Responsibility, I’d like to share a small collaboration hack I came up with last year. Here it is:

Whenever somebody sends you a pull request, give them commit access to your project.

While it may sound incredible stupid at first, using this strategy will allow you to unleash the true power of Github. Let me explain. Over the past year, I realized that I could not allocate enough time to my open source projects anymore. And I’m not even talking about fixing bugs or adding features, I’m talking about pull requests piling up.

So why wouldn’t I simply merge them? Well, a lot of them were actually not good enough. They were lacking tests or documentation, violated coding standards, or were introducing new issues the contributor had not considered. I would often explain the problems in detail, only to find that the contributor was now lacking the time to make the changes. Of course I could make those adjustments myself, but that would often take as much time as if I had done the patch myself to begin with. So after seeing this pattern play out many times over, I started to neglect most incoming pull requests that couldn’t be merged right away.

And then I came across the hack I mentioned above. I wish I could take credit for designing it, but it really happened by coincidence. Somebody sent a pull request for a project I was no longer using myself, and I could see an issue with it right away. However, since I no longer cared about the project, and the person sending the pull request did, I simply added him as a collaborator and said something like this: “I don’t have time to maintain this project anymore, so I gave you commit access to make any changes you’d like. It would be nice to follow the style used by the rest of the code and add some tests to this patch.”.

The result was pure magic. Within a few hours the same person who had just submitted a rather mediocre patch, had now fixed things up and committed them. This was highly unusual, so I started using the same strategy for a few other small projects I was no longer interested in maintaining. And it worked, over and over again. Of course, sometimes it wouldn’t make a difference, but it was clearly working a lot better than my previous approach.

Given the success for my smaller projects, I eventually decided to also try it for my two most popular projects, node-mysql and node-formidable. Initially I was very worried about giving up control over these projects, but the results speak for themselves. Both projects are now maintained by a bunch of amazing developers, writing much better code than I ever received in the form of pull requests before.

So why does it work? Well, I have a few ideas. Once people have commit access, they are no longer worried that their patch might go unmerged. They now have the power to commit it themselves, causing them to put much more work into it. Doing the actual commit/push also changes their sense of ownership. Instead of handing over a diff to somebody else, they are now part of the project, owning a small part of it.

But the magic does not stop here. In addition to their contribution quality going up, I’ve observed many people continuing to help out with issues and patches sent by other users. This is of course fueled by Github notifying every contributor on a repository of all activity on it.

So should you really do this for all pull requests? Probably not. While I’ve given a large amount of users access to various projects of mine, I’m still looking for:

  • Github profile: Does this user stand to lose a reputation by doing something stupid?
  • Skill: Based on the patch, do I think the user could be a good developer?
  • Usefulness: Is this patch solving a valid problem?

With these checks in place, I think this approach is a fantastic way to keep projects from going stale as well as turning one man projects into small communities. However, you’re obviously free to run your projects any way you’d like.

Last but not least, I’d like to thank a few of the great people who have made significant contributions to some of my projects recently:

You guys are absolutely amazing and I can’t thank you enough for all the help.

Update: Looking back at it, talking to Peter Hintjens at MixIT was definitley an inspiration for this.

Update 2: There is an interesting discussion about this on Hacker News right now, seems like others have used this strategy with success as well!

Update 3: There is even more discussions going on at Reddit.

https://felixge.de/2013/03/11/the-pull-request-hack/
Open Source And Responsibility

Today I read a comment on Github about an important feature that is missing in one of my open source libraries. I won’t link to it, but it basically said: “If you care about your library and the community, you must implement this feature”. These were not his exact words, but author of the comment was clearly demanding more gratis attention for his problem based on my responsibility to the community.

And he is not alone. Several people in the discussion agreed with him, and there are many public examples of people going as far as suggesting that open source authors have parental responsibilities for their projects.

Tom Dale has suggested that you should not “release more than you can realistically maintain”. And that “it takes maturity to realize that open source is a responsibility”.

I disagree.

Open source is not perfect. And we would be better off accepting this.

Don’t get me wrong. I have been on both sides of this table, experiencing critical problems with open source software that I was unable to fix myself, receiving no support. It sucks.

But whose fault is it? We can actually objectively answer this question. When using somebody else’s software, I need to follow the law, more specifically copyright. So I have to check for a license file and comply with it. Which usually means I’ll come across something like this:

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

This is the ugly side of open source. Nobody is responsible for the vast majority of projects.

And unfortunately simply demanding open source authors to become more responsible does not work very well. I mean you can try, but hopefully I can give you some advise that will yield better results.

In my opinion, the key to successfully leveraging open source projects, is to become a responsible consumer. By that I mean acquiring the ability to estimate the “total costs of ownership” that arise from relying on somebody’s open source project. The idea being that you “cannot consume more than you can afford”.

There are many approaches to this, but here are a few questions that help me in this process:

  • How many other people are using this software? Does it work for them?
  • What is the worst case scenario if this software fails on me?
  • Do I have the ability to debug and fix such problems?
  • What does the license say?
  • What is the quality of the code?
  • Is good documentation available?
  • Are there automated tests?
  • Is the project maintained by a community or by a single author?
  • How does the author of the software deal with bug reports?
  • What does my own Github profile look like? Will I be taken seriously?
  • Does the author accept contributions?
  • Is the author available for consulting / support?
  • Who is his employer, how does he make money?
  • What is his motivation for writing this software?
  • Could I pay somebody else to fix issues with this software?
  • Do I have enough money to pay somebody?
  • Can I easily use a modified version of this software in production?
  • Is it viable to fork this software if the author is not cooperating?
  • What features do I actually need?
  • How much would it cost to implement the features I need from scratch?

And while it’s not pleasant, this approach has led me to realize, that in some cases, I simply couldn’t afford certain software, even so it was offered to me at no charge.

Of course this process is problematic, and there has to be a better way. We need to find better incentives for open source authors to take on more responsibility, while continuing to enjoy the benefits of low marginal costs.

Luckily I have a few ideas for this, and going forward, I will explain how we are using them for our next product, and how to make money with open source as a small bootstrapped company.

However, becoming a responsible consumer will continue to pay off, so please consider it.

https://felixge.de/2013/03/07/open-source-and-responsibility/
Hello World

After writing over 300 articles between 2006 - 2009, I went on a bit of a blogging hiatus for the past three years. This was mostly because node.js took off, and being one of the first contributors sent a steady stream of speaking opportunities my way. This was a lot of fun, but at this point I am a bit sick of traveling, and I have also realized that conferences are not a good platform to share some of the things I’m currently excited about. So for 2013 I’ve decided to cut my traveling to a minimum, using this time to do a lot more writing instead.

So what will I be writing about?

Well, if you’re following me on twitter, you may have noticed, that I’ve been talking a lot about Go lately. That’s because after a weekend of fooling around with it, I’ve started using it quite a bit over the past few months, and have become very excited about it. So there are many Go related things I’d like to share here in the future, including how it compares to node.js.

Additionally Go has inspired me to come up with a new product and business model we’re currently pursuing at transloadit. This is exciting because I’ve never written much about the process we used to bootstrap our business, and our new product should be an exciting case study for anybody interested in bootstrapping software businesses without relying on external funding. You may also be interested in the fact that our new product will be 100% open source, and that I am planning to write a lot about the underlaying strategy and business model.

And for those who are curious, the product itself is going to be a dedicated file upload server called tus, so expect to read a lot about the challenges of file uploading, and how we’re tackling them with Go.

Last but not least, I’m still heavily involved with the NodeCopter movement that Robin, Tim, Thorsten, Kida, Matti and I started last year, so as time allows, I’ll write about quad copter hacking and the events we’re organizing.

So, that’s it for now, but please make sure to subscribe to new posts if you’re interested in Making Money with Open Source, Go, File Uploads or Quad Copter hacking - it should be interesting.

https://felixge.de/2013/03/03/hello-world/