Go's New Green Tea Garbage Collector May Improve Performance Up To 40%

Go 1.25 introduces a new experimental garbage collector that delivers up to 40% faster than the current implementation, bringing a significant performance improvement for GC-heavy workloads.

The new garbage collector, called Green Tea, uses the same mark-sweep approach as the existing GC, but with a key difference: instead of operating on individual objects, it works at the memory page level. This means Green Tea scans and tracks entire pages globally, whilt tracking individual objects locally within each page rather than across the heap.

This all adds up to a better fit with the microarchitecture. We can now scan objects closer together with much higher probability, so there’s a better chance we can make use of our caches and avoid main memory. Likewise, per-page metadata is more likely to be in cache. Tracking pages instead of objects means work lists are smaller, and less pressure on work lists means less contention and fewer CPU stalls.

This approach greatly reduces the number of scans required to mark the whole heap, which is significant since “about 90% of the cost of the garbage collector is spent marking, and only about 10% is sweeping”, according to Go contributors Michael Knyszek and Austin Clements.

Knyszek and Clements also explain that Green Tea was developed in response to the challenges posed by modern CPU hardware, which risk to make code slower rather than faster as hardware evolves. In particular, newer CPUs introduce non-uniform memory access, where a subset of cores has privileged access to a subset of memory; reduced memory bandwidth per CPU, due to more cores competing for memory access; and an increasing core count, making it harder for the GC algorithm to work in parallel.

On the other hand, advanced CPU features such as vector instructions and wide registers offer opportunities for significant speedups, provided the GC algorithm can take advantage of them, say Knyszek and Clements.

Vector hardware has long supported basic bit-wise operations on whole vector registers, but starting with AMD Zen 4 and Intel Ice Lake, it also supports a new bit vector “Swiss army knife” instruction that enables a key step of the Green Tea scanning process to be done in just a few CPU cycles. Together, these allow us to turbo-charge the Green Tea scan loop.

As mentioned, Green Tea can reduce garbage collection overhead by 10-40%, depending on the memory workload. For an application that spends 10% of its time in the garbage collector, this translates to an overall CPU reduction of 1-4%.

However, not all workloads benefit from Green Tea:

Green Tea is based on the hypothesis that we can accumulate enough objects to scan on a single page in one pass to counteract the costs of the accumulation process. […] But there are some workloads that often require us to scan only a single object per page at a time. This is potentially worse than the graph flood.

As a point in case, dolthub, maker of version-controlled SQL database dolt, chose not to adopt Green Tea for production builds:

For Dolt, the Green Tea collector doesn’t make any difference in real-world performance numbers. Under the hood, it seems that there’s a small regression in mark time, but this isn’t measurable in our latency benchmarks.

Other early adopters reported that Tea Green runs GC less frequently in their memory-heavy app, but each cycle consumes more CPU. While this reduces the overall GC CPU consumption, it increases latency significantly. However, this behavior has already been fixed for the upcoming Go 1.26.

This variability in results is the main reason why the new garbage collector is not enabled by default, despite being production-ready according to the Go team. To test Green Tea with Go 1.25, you can enable it by setting GOEXPERIMENT=greenteagc at build time.