April 7, 2020
At Channable, we process several billion product data records for our customers every day according to user-customizable rules. Internally, this work is subdivided into jobs, where one job takes a set of products and the rules for how to process them as input, and returns the transformed set of products as output. Those datasets range from a few dozen to tens of millions of products in size. The application responsible for transforming the product data is written in Haskell1, and we have been using it in production for about a year now.
Our experience with Haskell has been overwhelmingly positive so far, but it required some effort to achieve the performance goals that we aimed for. Previously, we have shown how we optimized our Haskell implementation of Aho-Corasick to achieve run times comparable to a Rust implementation. This time, we are going to describe our journey of getting Haskell garbage collection times under control when dealing with heap sizes of up to 100 GB per instance. Ultimately, we managed to reduce the garbage collection time while parsing 320,000 product rows from 68s to 3s, reducing the total parsing time from 110s to 46s.
The Haskell runtime system employs a generational garbage collector (GC) with two generations2. Generations are numbered starting with the youngest generation at zero. Values are always allocated in a special part of the youngest generation called the nursery. During collections, values are promoted to the next higher generation if they are still used by then. The youngest generation is divided into two steps: values remain in the youngest generation for one additional step before getting promoted, called aging.
Garbage collections are performed using a copying algorithm: starting at the so called roots (values referenced from the stack(s) and global variables), the heap is scanned for live data, and every value that is found is copied to another heap. The target heap can either be the next generation (when generation 0 is collected), or a newly allocated heap for the same generation (when generation 1 is collected, or aging within generation 0). Once this process is done, the old heaps are deallocated, and any value that hasn't been copied (because it wasn't reachable) with them.
A drawback of a copying strategy is that it requires double the memory of what you need for the live data, because the old heap can only be deallocated once everything has been copied over. Therefore, for the oldest generation only, a mark-compact algorithm is used when there is a lot of live data in order to reduce memory consumption. The compacting strategy moves the live data within the same heap.
Both algorithms have in common that they traverse the data that is live by chasing any pointers they encounter. Since by definition, there are no reachable pointers to dead data (= the garbage), the run time of these algorithms is proportional to the amount of live data at the time of garbage collection.
Moreover, the more memory is allocated by the process, the more garbage collections are going to happen. In our case, we're retrieving product records from a database, parsing the rows and storing them in an in-memory cache. Hence, a lot of allocations were long lived, which meant we were also getting a lot of generation 1 collections.
The combination of those two factors resulted in a quadratic slowdown. When doubling the amount of product data we would retrieve from the database, we would get about twice as many garbage collections, and each garbage collection would take about twice as long on average.
The following graph demonstrates this problem. On the X-axis we have the number of products parsed, and on the Y-axis we have the number of wall clock seconds spent in GC.
This problem was exacerbated by the fact that a single instance of our application is responsible for evaluating multiple jobs in parallel. The Haskell garbage collector is implemented in a stop-the-world fashion3: no user threads can run while the garbage collector is active. A single thread holding a lot of live data can therefore severely impact the performance of all other threads.
In an effort to reduce the cost of large heaps, compact regions have been added to GHC in version 8.2.14. A compact region is, essentially, a separate manually managed heap where objects can only point to other objects in the same compact region, and they cannot be freed individually. Instead, the whole compact region can only be deallocated at once.
This allows the garbage collector to treat the whole region as opaque. Once it follows a pointer into the compact region, it can treat the whole region as live and stop scanning.
The illustration above shows a few values v0 … v3 in the regular heap and a few values c0 … c5 located inside a compact region. The edges between nodes indicate pointers from one value to another. During a collection, the GC starts at the roots (the thread stacks and global variables), and transitively follows all pointers until it discovered the whole graph of live values. However, it will not follow the dotted edges inside the compact region. The whole region is considered live once the GC reaches just one of the values stored inside it. As a consequence, value c4 stays around even though there are no direct references to it. This is in contrast to v0 which also has no pointers towards it and will get collected.
By trying to keep as much data in compact regions as possible, we could drastically reduce the time spent in GC of certain operations.
Unfortunately, getting things into compact regions efficiently is not always easy. Due to the nature of the data we work with, there are a lot of duplicated values (for instance, a lot of products share the same brand). When parsing the data, we ensure that these duplicates are loaded into memory only once, and then shared across all use sites. This is illustrated by the following image, where we have two rows of product data that have different titles but the same brand. The arrows indicate pointers.
This approach saves a lot of memory, meaning that we can fit more data into our caches living in the application heap, making the latency for our clients shorter. The problems start when trying to preserve this sharing when copying values into a compact region.
ghc-compact package contains two functions that sound like what we want:
compactWithSharing, which copies an object graph into a new compact region, preserving all its internal sharing, and
compactAddWithSharing, which does the same for an existing compact region.
In practice, this sharing-preservation mechanism has some serious limitations that prevented us from using it:
In order to be able to use
compactWithSharing, the whole object graph must first be built outside of the compact region. This means that until it can finally be copied into the compact region, it will still incur garbage collection costs.
A small example demonstrating this is
data Foo = Foo String data Bar = Bar Foo Foo Foo let a = Foo "some foo" let b = Foo "another foo" let complex = (Bar a a b, Bar a b b) -- This will replicate the object graph in the compact region, -- in particular, it will contain `a` and `b` only once. compacted <- compactWithSharing complex -- Compare this to the following, which would contain `a` and `b` -- each three times. compacted' <- compact complex
And while it is possible to use laziness to our advantage while building up the object graph while it is being added, this approach falls short when combined with e.g. IO effects that are required for building it.
It might be an idea to instead add the data incrementally to the compact region, as it becomes available, but
compactAddWithSharing only preserves existing sharing within its argument. It does not consider objects that were previously added:
region <- compact () -- Create an empty compact region -- This will copy `a` once and `b` once into the compact region compactAddWithSharing region (Bar a a b) -- This will again copy `a` once and `b` once into the compact region compactAddWithSharing region (Bar a b b) -- Now `region` contains two copies of each of `a` and `b`, -- instead of just one.
In order to preserve sharing, the runtime system internally allocates a hash table that maps the original addresses of objects to their new addresses inside the compact region. Before copying an object, it will check whether it was already added earlier. However, the addresses outside of compact regions may (and likely will) change during a garbage collection. The hash table therefore needs to be rewritten during a GC, incurring a severe performance penalty. The documentation notes this as being "typically about 10x slower" than the non-sharing preserving variants.
Not all hope is lost, however. When
compactAdd is used to add an object to a compact region and the object contains references to other objects that are already in the same compact region, it will not copy those other objects again. In other words, if we carefully craft the objects with sharing in mind, and add them in the right order to the compact region, we get both sharing and fast garbage collections. We can rewrite our small example from above as follows to exploit this:
let a = Foo "some foo" let b = Foo "another foo" region <- compact () -- First add `a` and `b` to the compact region compactA <- getCompact <$> compactAdd region a compactB <- getCompact <$> compactAdd region b -- `compactA` and `compactB` have the same type as `a` and `b`, -- but they refer to objects inside the compact region. -- This will not copy `compactA` nor `compactB`, because -- both are already present in `region`. compactAdd region (Bar compactA compactA compactB) -- This will again not copy `compactA` nor `compactB` compactAdd region (Bar compactA compactB compactB)
Now that we have a way of preserving sharing while adding things into a compact region fast, what is left to do is getting the sharing in the first place while reading external data.
Our approach is similar to string interning. During parsing, we maintain a hash table mapping raw (unparsed) values to their parsed representation that is stored in the compact region. For each raw value, we do a lookup in a hash table. If it is not present, it means we never saw this value before. We parse it, then add the parsed value to the compact region, and insert the compacted value into the hash table. If it is present, we retrieve the already compacted value from the hash table.
It is important to note that while the values stored in the hash table are living in a compact region, the hash table itself doesn't. This is because we need to mutate the hash table, something that is not possible for things inside compact regions. And since we no longer need the hash table after parsing, it is also desirable that it is still garbage collectable.
Alas, now we're left with a huge object living outside of a compact region again. Even though all the values of the hash table live inside a compact region, the hash table containing the references to those values lives on the regular heap. The garbage collector still has to traverse the millions of references in the hash table. We can achieve a small additional win by storing the raw values used as keys in the hash table in their own compact region, but the general problem remains. Seems like we're stuck, or are we?
Compact regions are pinned in memory. The garbage collector won't move them, or objects inside them. What if, instead of having a hash table that is backed by boxed vectors5 (as they contain just regular Haskell values), we would use a hash table that is backed by unboxed vectors containing the raw pointers to the keys and values?
Then the garbage collector would see the hash table as one big blob of bytes, and wouldn't traverse it. Luckily, GHC provides two primitives that allow roundtripping between Haskell values and raw pointers,
anyToAddr#, that allow us to do just that.
This approach comes with a very big caveat, and that is that the garbage collector no longer knows that the compact region where we're storing things into is actually live, unless we still keep a reference to it somewhere else. That in turn would mean that it potentially gets collected prematurely and all our raw addresses would become dangling pointers.
Instead of relying on conventions for ensuring the liveness of the compact region, we built a safe interface for working with raw pointers into compact regions that guarantees that the region is live when the pointer is read.
The basic idea is based on Ghosts of Departed Proofs. We tie the raw pointers to the compact region they are pointing to with a phantom type parameter (akin to the
newtype Region p = Region (Compact ()) newtype Entry p v = Entry Addr
Entry may only be read from a
Region if their
p parameters match. The
p is the proof that the entry points into that region, and not into any other region. The phantom type
v indicates the type of the value that the address points to. We need to keep it around for safely coercing between addresses and values.
The only way to obtain a
Region is in the callback to the following function (comparable to
withRegion :: Compact () -> (forall p. Region p -> r) -> r
The universally quantified
p in the type of
withRegion ensures that the region given to the callback cannot be used with an
Entry that came from another region, as their phantom types would not match.
The only way of obtaining an
Entry is by adding an object to the compact region.
addEntry :: Region p -> v -> IO (Entry p v)
compactAdd on the wrapped region, thereby guaranteeing that the object actually lives in that region, and it will never move. When dereferencing an
Entry, we require the caller to also pass the region as a proof that it is still live.
getEntry :: Region p -> Entry p v -> IO v
Technically, we don't need the region for calling the
addrToAny# primitive, the address alone is enough. But by pretending to need the region inside using
touch#, after converting the address back, we ensure that the compact region remains live from the GCs point of view.
Putting this all together, with a custom hash table implementation designed to work with the
Entry type for keys and values, we were finally able to get rid of the quadratic run time behavior we observed earlier.
As a result, the application became much more responsive for interactive users as their requests were no longer interrupted as long and often.
Mutability is another cause for slow garbage collections. Without mutability, it would be guaranteed that any object could only point to older objects, not the other way around, because for pointing to an object, it first has to exist. This means that we would only need to look at the objects in the same or younger generations for determining what is live in a given generation, and the GC could stop following pointers when it encounters an object in the older generation.
This changes when mutability enters the playing field. Either directly via e.g.
IORef, but also indirectly through laziness (a thunk that gets evaluated only later). Now we can easily make an object point to a younger object. This is problematic in a generational GC: Whether or not an object in generation 0 (the youngest) is live suddenly depends on (some) objects in generation 1.
In the Haskell runtime system, this is solved by keeping track of which objects were mutated since the last GC in the so called mut list. Those objects are taken as additional roots for the generation 0 collection.
If there are just a few of these, it hardly matters. However, if you have a lot of mutable vectors, it will additionally traverse all these vectors. And while there is a mechanism for skipping the parts of those vector that were not modified, called card tables, where "cards" of 128 elements are marked as clean or dirty6, it still adds considerable load to garbage collections.
The following image illustrates the effects on generation 0 garbage collections. It shows a situation where a mutable vector
Vec has been promoted to generation 1 already, and was updated to point to a freshly allocated value
B. Without the mut list, the GC would only see that
D are live (as it won't look into gen 1 when collecting gen 0). Due to the mutation, it now has to look at all elements from the modified card (for simplicity of size 4, not 128) of the vector. Following those pointers, it will then also find that
B is live.
For certain operations, we used to represent every row of data as one mutable vector. Every single mutation that was performed during the operation caused 128 other entries to be scanned during the next GC. However, most elements were never mutated, they just needed to be read from.
By splitting up the data into pairs of a large immutable array and a small mutable array, we could reduce the overall runtime by another few percent due to the reduced GC time. And since most, if not all, of those mutable arrays have fewer than 128 elements, we can also save the additional card table lookups by using the
SmallMutableArray# primitive type. It was introduced in GHC 7.10, and works just as a regular array, just without the card table optimization. The
primitive package provides a convenient to use wrapper around the primitive operations.
Generally, reducing overall garbage collection time is all about reducing the number of pointers the GC has to follow. This reduction can be achieved in a number of (combinable) ways:
It is safe to say that without compact regions, we would have had a harder time making Haskell work well for our use case while still writing idiomatic Haskell code. Using values from compact regions is no different from using regular Haskell values, only getting values into the compact regions quickly required some extra effort.
Under the hood, GHC Haskell also provides a surprisingly rich set of primitives for low-level operations one would rather expect from a system programming language such as C or Rust. These primitives allowed us to easily circumvent certain limitations using unsafe escape hatches. At the same time, the strong type system made it possible for us to build a safe wrapper around our potentially unsafe workarounds. It is thus possible to write really fast code in Haskell, while still having full type safety and code that is readable and maintainable.
3: In GHC 8.10, an alternative garbage collection strategy was introduced that allows part of it to run concurrently with user threads. At the time of writing, we weren't able to test it yet because not all of our dependencies were compatible with that version. ↩
Are you interested in working at Channable? Check out our vacancy page to see if we have an open position that suits you!Apply now