Jan 14, 2022

Sep 21, 2021

Min Read

Taming Go’s Memory Usage, or How We Avoided Rewriting Our Client in Rust

by

Mark Gritter

A couple months ago, we faced a question many young startups face. Should we rewrite our system in Rust?

At the time of the decision, we were a Go and Python shop. The tool we’re building passively watches API traffic to provide “one-click,” API-centric visibility, by analyzing the API traffic. Our users run an agent that sends API traffic data to our cloud for analysis. Our users were using us to watch more and more traffic in staging and production—and they were starting to complain about the memory usage.

This led me to spend 25 days in the depths of despair and the details of Go memory management, trying to get our memory footprint to an acceptable level. This was no easy feat, as Go is a memory-managed language with limited ability to tune garbage collection.

Spoiler: I emerged victorious and our team still uses Go. We managed to tame Go’s memory management and restore an acceptable level of memory usage.

Especially since I hadn’t found too many blog posts to guide me during this process, I thought I’d write up some of the key steps and lessons learned. I hope this blog post can be helpful to other people trying to reduce their memory footprint in Go!

How it started

The Akita command-line agent passively watches API traffic. It creates obfuscated traces in Akita’s custom protobuf format to send to the Akita cloud for further analysis, or captures HAR files for local use. The initial version of the CLI preceded my time at Akita, but I became responsible for making sure the traffic collection scaled to our users’ needs. The decision to use Go made it possible to use GoPacket, as described in Akita’s previous blog entry Programmatically Analyze Packet Captures with GoPacket. This was much easier than trying to write or adapt other TCP reassembly code. But, once we started capturing traffic from staging and production environments, instead of just manual testing and continuous integration runs, the footprint of the collection agent became much more important.

One day this past summer, we noticed that the Akita CLI, normally well-behaved while collecting packet traces, would sometimes balloon to gigabytes of memory, as measured by resident set size of the container.

Our memory spikes at the beginning of this endeavor.

We heard from our users shortly after this and the task at hand became clear: reduce the memory footprint to a predictable, stable amount. Our goal was to be similar to other collection agents such as DataDog, which we also run in our environment and could use for a comparison.

This was challenging while working within the constraints of Go. The Go runtime uses a non-generational, non-compacting, concurrent mark-and-sweep garbage collector. This style of collection avoids “stopping the world” and introducing long pauses, which is great! The Go community is justifiably proud that they have achieved a good set of design trade-offs. However, Go’s focus on simplicity means that there is only a single parameter, SetGCPercent, which controls how much larger the heap is than the live objects within it. This can be used to reduce memory overhead at the cost of greater CPU usage, or vice versa. Idiomatic usage of Go features such as slices and maps also introduce a lot of memory pressure “by default” because they’re easy to create.

When I was programming in C++, memory spikes were also a potential problem, but there were also many idiomatic ways to deal with them. For example, we could specialize memory allocations or limit particular call sites. We could benchmark different allocators, or replace one data structure by another with better memory properties. We could even change our behavior (like dropping more packets) in response to memory pressure.

I’d also helped debug similar memory problems in Java, running in the constrained environment of a storage controller. Java provides a rich ecosystem of tools for analyzing heap usage and allocation behavior on a running program. It also provides a larger set of knobs to control garbage collector behavior; the one I really missed is simply setting a maximum size. For our application, it would be acceptable to simply exit when memory usage gets too large, rather than endangering the stability of a production system by requiring a container limit to kick in.

But for my current problem, I could not give hints to the garbage collector about when or how to run. Nor could I channel all memory allocation into centralized control points. The two techniques I had are the obvious ones, but hard to carry out in progress:

Reduce the memory footprint of live objects. Objects that are actively in use cannot be garbage collected, so the first place to decrease memory usage is to reduce the size of those.
Reduce the total number of allocations performed. Go is concurrently garbage collecting as the program runs to reclaim memory that is not in use. But, Go’s design goal is to impact latency as little as possible. Not only does it take a while for Go to catch up if the rate of allocation temporarily increases, but Go deliberately lets the heap size increase so that there are no large delays waiting for memory to become available. This means that allocating lots of objects, even if they’re not all live at the same time, can cause memory usage to spike until the garbage collector can do its job.

As a case study, I’ll walk through the areas in the Akita CLI where I could apply these ideas.

Reducing the memory allocated to persistent objects

Our first profiles, using the Go heap profiler, seemed to point to an obvious culprit: the reassembly buffer.

Profile showing a bottleneck in reassembly.

As described in an earlier blog post, we use gopacket to capture and interpret network traffic. Gopacket is generally very good about avoiding excess allocations, but when TCP packets arrive out-of-order, it queues them in a reassembly buffer. The reassembly code originally allocated memory for this buffer from a “page cache” and maintained a pointer to it there, never returning memory to the garbage collector.

Our first theory is that a packet that is received by the host, but dropped by our packet capture, could cause a huge, persistent spike in memory usage. Gopacket allocated memory to hold data that has been received out-of-order; that is, the sequence numbers are ahead of where the next packet should be. HTTP can use persistent connections, so we might see megabytes, or indeed even gigabytes, of traffic while gopacket is patiently waiting for a retransmission that will never occur. This leads to both high usage immediately (because of the large amount of buffered data) and persistently (because the page cache is never freed).

We did have a timeout that forced gopacket to deliver the incomplete data, eventually. But this was set to a fairly long value, much longer than any reasonable round-trip time for real packet retransmission on a busy connection. We also had not used the settings available in gopacket to limit the maximum reassembly buffer within each stream, or the maximum “page cache” used for reassembly. This meant that the amount of memory that could be allocated had no reasonable upper limit; we were at the mercy of however fast packets arrived before our timeout.

To find a reasonable value to curb memory usage, I looked at some of the data from our system to try to estimate a per-stream limit that would be small but still large enough to handle real retransmissions. One of our incidents where memory spiked had demonstrated 3GB growth in memory usage over a period of 40 seconds, or a data rate of about 75MByte / second. A back of the envelope calculation suggested that at that data rate, we could tolerate even a 100ms round-trip time with just 7.5 MB of reassembly buffer per connection. We reconfigured gopacket to use a maximum of 4,000 “pages” per connection (each 1900 bytes, for reasons I do not understand), and a shared limit of 150,000 total pages—about 200MB.

Unfortunately, we couldn’t just use 200MB as a single global limit. The Akita CLI sets up a different gopacket reassembly stream per network interface. This allows it to process different interfaces in parallel, but our budget for memory usage has to be split up into separate limits for each interface. Gopacket doesn’t have any way to specify a page limit across different assemblers. (And, our hope that most traffic would arrive only over a single interface was quickly disproven.) So this meant that instead of having a 200MB budget to deal with actual packet loss, the actual memory available to the reassembly buffer could be as low as 20MB—enough for a few connections, but not a lot. We didn’t end up solving this problem; we dynamically divided up the 200MB equally among as many network interfaces as we were listening on.

We also upgraded to a newer version of gopacket that allocated its reassembly buffers from a sync.Pool. This class in the Go standard library acts like a free list, but its contents can be eventually reclaimed by the garbage collector. This meant that even if we did encounter a spike, the memory would eventually be reduced. But that only improves the average, not the worst-case.

Reducing these maximums got us away from those awful 5 GiB memory spikes, but we were still sometimes spiking over 1GiB. Still way too large.

Playing with the observations in DataDog for a while convinced me that these spikes were correlated with bursts of incoming API traffic.

Sidebar: Secret Insider Knowledge

To help Akita users get more control over our agent’s memory footprint, we made the network processing parameters tunable via command-line parameters not listed in our main help output. You can use --gopacket-pages to control the maximum size of the gopacket “page cache”, while --go-packet-per-conn controls the maximum number of pages a single TCP connection may use.

‍

We also expose the packet capture “stream timeout” as --stream-timeout-seconds, which controls how long we will wait, just as --go-packet-per-conn controls how much data we will accumulate.

Finally, --max-http-length controls the maximum amount of data we will attempt to capture from an HTTP request payload or response body. It defaults to 10MB.

Reducing the memory allocated to temporary objects

Since fixing the buffer situation didn’t completely solve the memory problem, I had to keep looking for places to improve our memory footprint. No other single location was keeping a lot of memory live.

In fact, even though our agent was using up to gigabytes of memory, whenever we looked at Go’s heap profile we never caught it “in the act” with more than a couple hundred MB of live objects. Go’s garbage collection strategy ensures that the total resident memory is about twice the amount occupied by all live objects—so the choice of Go is effectively doubling our costs. But our profile wasn’t ever showing us 500MB of live data, just a little bit more than 200MB in the worst cases. This suggested to me that we’d done all we could with live objects.

It was time to shift focus and look instead at total allocations. Fortunately, Go’s heap profiler automatically collects that as part of the same dump, so we can dig into that to see where we’re allocating a lot of memory, creating a backlog for the garbage collector. Here’s an example showing some obvious places to look (also available in this Gist):

Profiling the heap for reducing total allocations.

Repeated Regular Expression Compilation

One heap profile showed that 30% of allocations were under regexp.compile. We use regular expressions to recognize some data formats. The module that does this inference was recompiling those regular expressions each time it was asked to do the work:

Discovering a memory bottleneck in regular expression processing.

It was simple to move the regular expressions into module-level variables, compiled once. This meant that we were no longer allocating new objects for the regular expressions each time, reducing the number of temporary allocations.

This part of the work felt somewhat frustrating, because although nodes dropped off the allocation tree, it was harder to observe a change in end-to-end memory usage. Because we were looking for spikes in memory usage, they didn’t reliably occur on demand, and we had to use proxies like local load testing.

Visitor Contexts

The intermediate representation (IR) we use for the contents of requests and responses has a visitor framework. The very top source of memory allocation was allocating context objects within the visitor, which keep track of where in the intermediate representation the code is currently accessing. Because the visitor uses recursion, we were able to replace these using a simple preallocated stack. When we visit one level deeper in the IR, we allocate a new entry by incrementing an index to the preallocated range of context objects (and expanding it if necessary.) This converts dozens or even hundreds of allocations to just one or two.

A profile from before the change showed 27.1% of allocations coming from appendPath. One immediately after the change showed only 4.36% instead. But, although the change was large, it was not as big as I expected. Some of the memory allocation seemed to “shift” to a function that hadn’t been a major contributor before!

Switching go tool pprof to granularity=lines causes it to show line-by-line allocation counts instead of function-level totals. This helped identify a couple sources of allocation that were previously hidden within appendPath, such as creating a slice containing the entire path back to the root. Even though multiple slices can re-use the same underlying array, if there is available capacity in the shared object, it was a big win to construct these slices lazily on demand, instead of every time we switched contexts.

While these preallocations and deferred allocations had a big impact on the amount of memory allocated, as reported by the profiling, it did not seem to do a lot for the size of the spikes we observed. This suggests that the garbage collector was doing a good job reclaiming these temporary objects promptly. But making the garbage collector work less hard is still a win both in understanding the remaining problems, and in CPU overhead.

Hashing

We used deepmind/objecthash-proto to hash our intermediate representation. These hashes are used to deduplicate objects and to index unordered collections such as response fields. We had previously identified this as a source of a lot of CPU time, but it showed up as a large allocator of memory as well. We had already taken some steps to avoid re-hashing the same objects more than once, but it was still a major user of memory and CPU. Without a major redesign to our intermediate representation and on-the-wire protocol, we weren’t going to be able to avoid hashing.

There were a few major sources of allocations in the hashing library. objecthash-proto uses reflection to access the fields in protobufs, and some reflection methods allocate memory, like reflect.packEface in the profile above. Another problem was that in order to consistently hash structures, objecthash-proto creates a temporary collection of (key hash, value hash) pairs, and then sorted it by key hash. That showed up as bytes.makeSlice in the profile. And we have a lot of structures! A final annoyance is that objecthash-proto marshals every protobuf before hashing it, just to check if it’s valid. So a fair amount of memory was allocated and then immediately thrown away.

After nibbling around the edges of this problem, I decided it would be better to generate functions that did the hashing just on our structures. A great thing about objecthash-proto is that it works on any protobuf! But we didn’t need that, we just needed our intermediate representation to work. A quick prototype suggested it would be feasible to write a code generator that produced the same hashes, but did so in a more efficient way:

Precompute all the key hashes and refer to them by index. (The keys in a protobuf structure are just small integers.)
Visit the fields in a structure in the sorted order of their key hashes, so that no buffering and sorting was necessary.
Access all the fields in structures directly rather than with reflection.

All of these reduced memory usage to just the memory required for individual hash computations in the OneOfOne/xxhash library that objecthash-proto was using. For maps, we had to fall back to the original strategy of sorting the hashes, but fortunately our IR consists of relatively few maps.

This was the work that finally made a visible difference in the agent’s behavior under load.

Now the allocation profile showed mainly “useful” work we weren’t going to be able to avoid: allocating space for packets as they came in.

Temporary storage for decompressed data

We weren’t quite done. During this whole process, what I really wanted the heap profile to tell me is “what got allocated just before Go increased the size of its heap?” Then I would have a better idea what was causing additional memory to be used, not just which objects were live afterwards. Most of the time, it was not new “permanent” objects that caused an increase, but allocation of temporary objects. To help answer this question and identify those transient allocations, I collected heap profiles from an agent in our production environment every 90 seconds, using the HTTP interface to the Go profiler.

Then, when I saw a spike in memory usage, I could go back to the corresponding pair of traces just before and just after. I could look at the allocations done within that 90 seconds and see what differed from the steady state. The pprof tool lets you take the difference between one trace and another, simplifying this analysis. That turned up one more place that needed to be limited in its memory usage:

This shows that 200 MB—as large as our whole maximum reassembly buffer—was being allocated within just 90 seconds! I looked at the backtrace from io.ReadAll and discovered that the reason for the allocation was a buffer holding decompressed data, before feeding it to a parser. This was a bit surprising as I had already limited the maximum size of an HTTP request or response to 10MB. But that limit counted compressed size, not uncompressed size. We were temporarily allocating large amounts of memory for the uncompressed version of an HTTP response.

This prompted two different sets of improvements:

For data that we cared about, use a Reader rather than a []byte to move data around. The JSON and YAML parsers both accept a Reader, so the output of the decompression could be fed directly into the parser, without any extra buffer.
For data that we weren’t going to be able to fully parse anyway, we imposed a limit on the decompressed size. (Akita attempts to determine whether a textual payload can be parsed into a recognized format, but the amount of data we need to do that is small.)

Should we have switched to Rust instead?

While these improvements may seem obvious in hindsight, there were definitely times during the Great Memory Reduction when the team and I considered rewriting the system in Rust, a language that gives you complete control over memory.

Our stance on the Rust rewrite had been as follows:

🦀 PRO-REWRITE: Rust has manual memory management, so we would avoid the problem of having to wrestle with a garbage collector because we would just deallocate unused memory ourselves, or more carefully be able to engineer the response to increased load.
🦀 PRO-REWRITE: Rust is very popular among hip programmers and it seems many startup-inclined developers want to join Rust-based startups.
🦀 PRO-REWRITE: Rust is a well-designed language with nice features and great error messages. People seem to complain less about the ergonomics of Rust than the ergonomics of Go.
🛑 ANTI-REWRITE: Rust has manual memory management, which means that whenever we’re writing code we’ll have to take the time to manage memory ourselves.
🛑 ANTI-REWRITE: Our code base is already written in Go! A rewrite would set us back several person-weeks, if not several person-months, of engineering time.
🛑 ANTI-REWRITE: Rust has a higher learning curve than Go, so it would take the team (and, likely, new team members) more time to get up to speed.

And for people who thought I was joking about startups commonly facing the decision to rewrite in Rust, the Rust Rewrite Phenomenon is very real. See here:

Jean recounting an actual conversation we had on our team, with proof of Rust's popularity.

And I will not name names, but if you look closely at startup job postings and even blog posts, you’ll see the “Rust rewrite” posts. 🦀

At the end of the day, I am glad I was able to get the memory footprint in Go to a reasonable level, so that we could focus on building new functionality, instead of spending a bunch of time learning a new language and porting existing functionality. If our agent had initially been written in Python rather than Go this may have been a different story, but Go is sufficiently low-level that I don’t anticipate there being major issues with continuing to develop in it.

How it’s going

Today we’re able to ingest data from our own often-busy production environment while keeping the Akita CLI memory usage low. In our production environment, the 99th percentile memory footprint is below 200MB, and the 99.9th percentile footprint is below 280MB. We’ve avoided having to rewrite our system in Rust. And we haven’t had complaints in over a month.

While these improvements are very specific to the Akita CLI agent, the lessons learned are not:

Reduce fixed overhead. Go’s garbage collection ensures that you pay for each live byte with another byte of system memory. Keeping fixed overhead low will reduce resident set size.
Profile allocation, not just live data. This reveals what is making the Go garbage collector perform work, and spikes in memory usage are usually due to increased activity at those sites.
Stream, don’t buffer. It’s a common mistake to collect the output of one phase of processing before going on to the next. But this can lead to an allocation that is duplicative of the memory allocations you must already make for the finished result, and maybe cannot be freed until the entire pipeline is done.
Replace frequent, small allocations by a longer-lived one covering the entire workflow. The result is not very idiomatically Go-like, but can have a huge impact.
Avoid generic libraries that come with unpredictable memory costs. Go’s reflection capabilities are great and let you build powerful tools. But, using them often results in costs that are not easy to pin down or control. Idioms as simple as passing in a slice rather than a fixed-sized array can have performance and memory costs. Fortunately, Go code is very easy to generate using the standard library’s go/ast and go/format packages.

While the results we achieved are not as good as a complete rewrite in a language that lets us account for every byte, they are a huge improvement over the previous behavior. We feel that careful attention to memory usage is an important skill for systems programming, even in garbage-collected languages.

And if working on hard systems problems like these sounds fun to you, come work with me at Akita. We’re hiring!

‍

Photo by Jaime Spaniol on Unsplash.