Tech

So Long Surrogates: How we moved to UTF-8 in Haskell

April 26, 2022

We released a blazingly fast Aho-Corasick implementation, written in Haskell, in 2019. This implementation was based on UTF-16 strings, since Haskell's text library uses that for its internal string representation. However, the most recent major update of text changed its internal string representation from UTF-16 to UTF-8. This is good news for us, since most of our customer’s data is ASCII, this update will cut our memory consumption in half. The big problem though is that our highly optimized string search library alfred-margaret assumes that its input is encoded as UTF-16 and uses that assumption to cut a few corners to improve performance. In this post we will illustrate the challenges we encountered implementing UTF-8 support in alfred-margaret and also give some insights into how we optimized our Haskell code for maximal performance.

Unicode Terms

Before we get started, we will have to specify some of the terms used in this post. As programmers, we usually use the term character differently than our users. Unicode provides us with terms that allow us to be more specific: User-perceived characters may be represented as multiple abstract characters. A visual example for this is the user-perceived character ✌🏿, consisting of the following two abstract characters:

  • U+270C VICTORY HAND
  • U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6 🏿

Similarly, the user-perceived character “x̣” consists of these abstract characters:

  • U+0078 LATIN SMALL LETTER X x
  • U+0323 COMBINING DOT BELOW ◌̣1

When writing programs, we usually work with abstract characters and leave the user-perceived characters to the user’s GUI/font rendering system (of course there are counterexamples, e.g. the character count of a tweet). Abstract characters include simple letters, but also combining marks such as U+0323 COMBINING DOT BELOW above and control characters such as U+000A LINE FEED (LF)*. A code point is a value in the Unicode code space which ranges from 0 to 0x10FFFF. Unicode 14.0.0 (from September 2021) maps 144,697 of those 1,114,112 code points to abstract characters. We will use the notation U+XXXX for code points, where XXXX is a hexadecimal number. In order to avoid confusion, we will use the term code point in this post to refer to abstract characters. For further information, you can check out the (surprisingly readable) Unicode standard itself or the website/manifest UTF-8 Everywhere, which has a section discussing the different concepts one might associate with the term "character".

Other terms defined in Unicode that we will use are:

  • Rules for case conversion, mapping code points to their lowercase or uppercase version.
  • Rules for case folding, mapping code points to a specific case for case-insensitive comparison. Mostly but not fully congruent to lowercasing, e.g.:
    • U+00DF LATIN SMALL LETTER SHARP S: toCasefold(ß) = ss ≠ ß = toLowercase(ß)
    • U+AB7B CHEROKEE LETTER GU: toCaseFold(Ꭻ) = Ꭻ ≠ ꭻ = toLowercase(Ꭻ)
  • Encoding forms, the general term for UTF-8, UTF-16 and UTF-32, which map code point sequences to byte sequences.

Moving on from UTF-16

In Channable’s Haskell codebase, we use the text package for string operations. This package provides a Text type, which uses the UTF-16 encoding form to store Unicode strings in memory. For our use case, Text is much more efficient than Haskell’s String, which is a linked list of boxed 32-bit integers.

In its recent update to version 2.0, the Text type’s encoding was changed to UTF-8. This change has multiple advantages for us:

  • When we load serialized customer data (e.g. from a Postgres database or JSON/CBOR files), we have to decode its UTF-8 strings and encode them again into UTF-16 Text values. This is a lot of overhead for what should essentially be a no-op.
  • Most of our customer data (more than 99% of our benchmark data set used for the first alfred-margaret blog post) uses characters from the Unicode block Basic Latin. This block corresponds to the 128 ASCII characters and can be encoded using just a single byte per character.

The second point is more important: It means that compared to UTF-16, where every code point takes at least two bytes to encode, we save a lot of memory! Consider this example:

The string beeblebröx, encoded using UTF-16 LE and UTF-8

The diagrams show the string "Beeblebröx" encoded using two different encoding forms: UTF-16 and UTF-8. The lowest row of both diagrams shows the code units used by the encoding in form. In UTF-16, a code unit is 16 bits; in UTF-8, it's 8 bits. For UTF-16, we have to keep in mind that endianness is important since code units span two bytes. Text uses UTF-16 Little-Endian, meaning that the least significant byte goes first. The second row in the UTF-16 diagram shows the actual bytes of the encoded string.

This small example already shows an important point: Code points in the block Basic Latin (from U+0000 to U+007F) only take up a single byte of memory in UTF-8 and two bytes in UTF-16. In total, "Beeblebröx" uses 20 bytes in UTF-16 and only 11 bytes in UTF-8.

U+00F6 LATIN SMALL LETTER O WITH DIAERESIS (the letter ö) is part of the second most used block Latin-1 Supplement (from U+0080 to U+00FF). UTF-8 encodes this block as two bytes meaning we don’t gain or lose anything here compared to UTF-16. Less than 0.1% of the benchmark data set needs more bytes using UTF-8 than using UTF-16. Compared to the memory saved for Basic Latin code points, their contribution is tiny. In total, our benchmark data set weighs 3.08 GiB using UTF-16 and 1.55 GiB using UTF-8 (this excludes the 16 bytes used to store offset and length of the byte slice in each case). This means we save almost exactly 50% of memory for these strings by simply using another encoding.

Variable-width Encoding

When working with Text values, we usually just import the Data.Text module. This module acts as an interface which hides implementation details such as the encoding form and provides a view onto the string as a sequence of Char, which is Haskell's built-in Unicode code point type. The upside of this abstraction is that it makes it easy for us to update to text-2.0 even though the underlying encoding changes from UTF-16 to UTF-8.

However, using this interface exclusively also has a downside: If the encoding used for these Chars is variable-width, meaning that not all Chars encode to the same number of bytes, random access into a Text has a linear complexity. This example shows that both UTF-16 and UTF-8 are variable-width encodings:

The string hello×✌🏿, encoded using UTF-16 LE and UTF-8

In UTF-16, all valid code points in the range U+0000 to U+FFFF are encoded as a single code unit consisting of two bytes. Code points outside that range such as the last one, U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6, need two code units and therefore take up four bytes. In UTF-8, a code point can be encoded as one, two, three or four single-byte code units, e.g. U+006F LATIN SMALL LETTER O, U+00D7 MULTIPLICATION SIGN, U+270C VICTORY HAND and U+1F3FF EMOJI MODIFIER FITZPATRICK TYPE-6, respectively.

Regardless of whether we use UTF-8 or UTF-16, in order to find the nth code point in a string, we need to do a linear search from its start. For example, in order to find out that the 8th code point in the UTF-8 string begins at byte 10, we need to iterate through "hello×" first and count the number of bytes for each code point.

Interlude: Aho-Corasick

In a previous article, we discuss our fast Haskell implementation of the Aho-Corasick string search algorithm. If you’ve already read that article, you can skip this section.

In order to conceptualize string search, we use the "Needle in a Haystack" idiom. We are searching for all occurrences of one or more strings (the needles) in another string (the haystack). The use case where Aho-Corasick shines is searching different haystacks for the same set of multiple needles of which the matches possibly overlap. In order to do this, we construct a finite state automaton where each state represents the progress we’ve made to a full match of one of the needles. Feeding this automaton the input string code point by code point is faster than simply searching for each needle separately because we only need to scrutinize each code point of the haystack once.

Consider this example matching the needles "shirts", "shorts" and "tshirt":

An Aho-Corasick automaton matching the needles shirts, shorts and tshirt

Suppose our haystack is "tshorts". Reading the code points "t", "s" and "h", we reach state 3. Because the next code point, "o", doesn’t match any transitions from state 3, we fall back to state 8, where there is a transition for "o", leading us to state 13. Reading the remaining code points "r", "t" and "s" we end up in state 16, having reached a match for the needle "shorts". This small example demonstrates that Aho-Corasick enables us to switch between needles without starting the search over. For the following sections, we assume that the code points used for transitions are in lowercase.

UTF-16 in alfred-margaret

For our implementation of the Aho-Corasick algorithm in alfred-margaret we went past the external Char-based Text interface in order to be able to iterate through the input efficiently2. Instead, we accessed the underlying UTF-16 encoded byte array directly, viewing it as a 16-bit code unit sequence.

For an exact string search, comparing strings on the code unit level is sufficient. However, in order to power all necessary features in Channable, alfred-margaret provides functionality to not only match strings exactly, but also case-insensitively. A common approach for this type of problem is to simply run the case-sensitive search algorithm on a lowercase or case folded copy of the input. As we don’t want to keep what is essentially two copies of the same string in memory, our implementation lowercases each code point individually while looping through the input.

In order to do this, we first need to decode code points from our code unit sequence, because Unicode case conversions and case foldings are defined on code points, not code units. In UTF-16, the first 216 code points are encoded as a single 16-bit code unit containing the code point itself, shown on the left side of this diagram:

Visualisation of the UTF-16 encoding form

Other code points are encoded as two code units called a surrogate pair, shown on the right side. In order to tell apart a code unit that encodes a code point and the beginning of a surrogate pair (the high surrogate), Unicode reserves a range of code points that are to be used for surrogates only, namely U+D800 to U+DFFF (in other words, the code units with the prefix 0b11011). Surrogates may never appear outside a surrogate pair: A high surrogate must always be followed by a low surrogate. As an interesting aside, this means that UTF-16 can not actually encode all Unicode code points, while UTF-8 and UTF-32 can. However, such Unicode strings are considered ill-formed (see for example Unicode Section 3.9 D92).

A very useful property of UTF-16 is the fact that the code points which can be encoded as a single code unit (the first 216 code points) coincide with the Basic Multilingual Plane (BMP, from U+0000 to U+FFFF) of Unicode. According to the introduction of the Unicode standard, the BMP encodes the "majority of the common characters used in the major languages of the world". Since the bulk of our customers' data falls into the BMP, we can get away with ignoring surrogates and lowercasing BMP code points only. This saves us the overhead of decoding multiple code units at a time. While it may seem like a questionable hack, the Unicode standard actually acknowledges this approach in Section 5.4 "Handling Surrogate Pairs in UTF-16":

UTF-16 enjoys a beneficial frequency distribution in that, for the majority of all text data, surrogate pairs will be very rare; non-surrogate code points, by contrast, will be very common. Not only does this help to limit the performance penalty incurred when handling a variable-width encoding, but it also allows many processes either to take no specific action for surrogates or to handle surrogate pairs with existing mechanisms that are already needed to handle character sequences.

Reflecting these facts we chose to treat the input as a sequence of fixed-width 16-bit code points, skipping surrogates. Consider how we would step the Aho-Corasick automation using as input the code unit sequence of the string "FOÖBAR🙈":

A showcase of the way alfred-margaret lowercases UTF-16 strings

For each code unit, we determine whether it is a surrogate by checking if it is in the range U+D800 to U+DFFF. If it isn’t, we interpret it as a code point and use its lowercase version to step the automaton (black arrows). For example, U+0046 LATIN CAPITAL LETTER F lowercases to U+0066 LATIN SMALL LETTER F. If the code unit is a surrogate, as in the case of the two code units U+D83D and U+DB48 constituting U+1F648 SEE-NO-EVIL MONKEY, we simply use it to step the automaton directly (gray arrows).

While the point of this post is not to explain our Aho-Corasick implementation, this should give you some context for the considerations we made when we updated alfred-margaret to work with UTF-8 strings. As opposed to most of our codebase, where we view a Text value as an opaque sequence of Chars, alfred-margaret depends on the encoding used for the internal byte slice.

On-the-fly Lowercasing in UTF-8

So what’s the problem with switching to UTF-8? In a nutshell, using UTF-8 means that we can’t treat the input as a sequence of fixed-width BMP code points any more. Consider this example, where only code points consisting of a single code unit are converted to lowercase in the UTF-8-encoded string "FOÖBAR🙈":

A showcase of the way alfred-margaret does not lowercase UTF-8 strings

This figure illustrates what happens if we just reuse the UTF-16 approach. In the case of UTF-8, only Basic Latin code points are encoded as a single code unit. While we can convert these, for any other code points such as U+00D6 LATIN CAPITAL LETTER O WITH DIAERESIS above, we can’t, since it is encoded using two code units. At least on the BMP, the UTF-8 version should have the same lowercasing behavior as the UTF-16 version in order not to break compatibility. With UTF-16, BMP code points would always be encoded as two bytes; with UTF-8, BMP code points may be encoded as one, two or three bytes. This means that in order to stay compatible, we need to actually decode the UTF-8 sequence in order to lowercase code points on the fly.

Decoding the BMP in UTF-8

If we want to stay compatible with the UTF-16 version, we need to lowercase BMP code points and leave any other code points as they are. Let's take a look at how the BMP is encoded in UTF-8.
Unlike UTF-16, UTF-8 does not use surrogates. Rather, depending on the range the code point falls into, it is encoded as one, two, three or four code units. The most significant bits (MSB) of the first code unit tell us how long this sequence is. This table from Unicode Section 3.9 visualizes how UTF-8 code points are encoded:

Table 3-6 of the Unicode standard, visualisation of the UTF-8 encoding form

The first column shows the binary representation of a 21-bit code point in a specific range. We can read the table as:

  • Code points in the range [0, 27) are encoded as a single code unit.
  • Code points in the range [27, 211) are encoded as two code units.
  • Code points in the range [211, 216) are encoded as three code units.
  • Code points in the range [216, 221) are encoded as four code units.

How exactly the code units are composed is not relevant to our use case. What’s important is that, in order to know how many code units we need to read in order to read a full code point, we only need to scrutinize the first code unit of a code unit sequence:

UTF-8 encoding example

Consider the UTF-8 encoding of the string "FÖ✓🙈":

  • The MSB of the first code unit 0x46 is 0. This means that the code point at position 0, U+0046 LATIN CAPITAL LETTER F, is one code unit long.
  • The MSBs of the code unit C3 at position 1 are 110 ⇝ 2 code units.
  • The MSBs of the code unit E2 at position 3 are 1110 ⇝ 3 code units.
  • The MSBs of the code unit F0 at position 6 are 11110 ⇝ 4 code units.

Code units that are not at the beginning of a code point always have the MSBs 10, which means that it is trivial to skip to the previous or next code point from any position in a UTF-8 encoded string. Ultimately we need to decode any code points consisting of up to three code units in order to decode all BMP code points.

Alternative: Change Automaton Construction

Knowing all this, we assumed that decoding every input code point would add a lot of processing overhead. In order to avoid this, we tried to change the way the Aho-Corasick automaton is built instead. Rather than lowercasing code points on the fly, we would add two transitions for each needle code point to the automaton: One for its lowercase version and another for its uppercase version. The advantages of this idea are:

  • Most of the changes would be in the automaton construction code. Compared to the heavily optimized automaton stepping code, there is a smaller risk of breaking things in this part of the code.
  • By building an automaton specifically for case-insensitive matching, we could even simplify the automaton stepping code quite a bit because it wouldn’t need to lowercase the input characters any more.

The following example illustrates this using the needle "Größe" (the German word for size, appearing quite often in the context of E-Commerce). Encoded using UTF-8, it becomes the code unit sequence 0x47 0x72 0xC3 0xB6 0xC3 0x9F 0x65. This Aho-Corasick automaton matches that sequence:

Whenever we read a character without a designated transition at any state, we fall back into state 0. To show how we can extend the automaton, consider transitioning from state 2 to state 4 in this automaton. This transition means "we have read the code point ö (code units 0xC3 0xB6)". We can extend this automaton by adding extra transitions and intermediate states where necessary to make it mean "we have read either the code point ö (0xC3 0xB6) or the code point Ö (0xC3 0x96)":

We could now use the same stepping code we used for the previous automaton to match our needles case-insensitively, removing the on-the-fly lowercasing code entirely. This can’t be implemented in a way that stays compatible with the old BMP-based solution though. The reason is that for any given c :: Char, the property

   (isUpper c || toUpper (toLower c) == c)
&& (isLower c || toLower (toUpper c) == c)

doesn’t necessarily hold. Here are a few examples:

  • From our "Größe" example: While the property holds for the Basic Latin code points as well as "Ö" and "ö", it doesn’t for "ẞ" (U+1E9E LATIN CAPITAL LETTER SHARP S, the uppercase "Eszett"): Char.toLower 'ẞ' == 'ß' (U+00DF LATIN SMALL LETTER SHARP S) but Char.toUpper 'ß' == 'ß' again. This means that the UTF-16 version would match "GRÖẞE", but the new approach would not3.
  • Another infamous example is "İ" (U+0130 LATIN CAPITAL LETTER I WITH DOT ABOVE). In Haskell, Char.toLower 'İ' == 'i' (U+0069 LATIN SMALL LETTER I) and Char.toUpper 'i' == 'I' (U+0049 LATIN CAPITAL LETTER I). The letter is also one of the few case conversions that are locale-sensitive; in a turkish locale, Char.toUpper 'i' == 'İ' holds, breaking the property for "I".
  • Some others, e.g. "K" (U+212A KELVIN SIGN) to "k" (U+006B LATIN SMALL LETTER K) to "K" (U+004B LATIN CAPITAL LETTER K). In total, there are around 30 such code points in Unicode.

An avid reader and Unicode buff will be pulling their hair out by now. That's because the official way to do case-insensitive string comparisons is to apply Unicode case folding to both the needles and the haystack instead of applying a case conversion to a specific case. There are a few reasons alfred-margaret does not do this:

  • Case folding is not a Char -> Char mapping but Char -> [Char] mapping. This makes it a bit harder to construct the automaton since it adds a few edge cases.
  • Moving from lowercasing to case folding now would break the API in subtle ways.

Unicode defines a simple case folding algorithm that is a Char -> Char mapping solving at least the first problem. It is also mostly compatible with lowercasing. However, Haskell's base package does not implement it. In an internal Hackathon, we hacked together an implementation but didn't have enough time to polish it to a point where it's usable.

Finally we discarded this approach. While nice in theory, the subtle bugs it certainly would’ve caused down the road weren’t worth it in practice. However, we will probably examine this again at some point in the future.

Stepping the Automaton Using Code Points

At this point we caved and decided to just decode the UTF-8 input into code points for lowercasing. In order for this to work we’d have to change the automaton construction: So far, a single transition in our Aho-Corasick state machine has the following representation in memory using a 64-bit unsigned integer:

Each transition has a code unit that it activates on as well as the state it leads to. When the wildcard bit is set, the transition activates on every code unit. Each state is represented by an array of these transitions. The last transition in this array always has the wildcard bit set and leads to a fallback state. To illustrate this, let's loop back to the "shirts", "shorts" and "tshirts" example:

Using the layout above, states 3 and 8 would be represented like this:

State 3 has two transitions: Transition 0 leads to state 0b100 (= 4) with code unit 0b1101001 (= "i"). Transition 1 is a wildcard transition that falls back into state 0b1000 (= 8). State 8 has three transitions: Transitions 2 and 3 leads to states 9 and 13 on the code units corresponding to "i" and "o" respectively, and transition 4 is a wildcard transition that back into state 0.

Because the UTF-8 code uses code points rather than code units for transitions, we had to change this layout just a little bit, increasing the number of bits available for the automaton input:

Other than that, the layout stays the same. In Haskell, the code that handles transitions depends on a few type aliases:

type Transition = Word64
type State      = Word32
type CodeUnit   = Word16
type CodePoint  = Char

Those make the following function definitions a bit nicer to read. To work with transitions, we exclusively use these functions:

transitionCodeUnit    :: Transition -> CodeUnit
transitionState       :: Transition -> State
transitionIsWildcard  :: Transition -> Bool
newTransition         :: CodeUnit -> State -> Transition
newWildcardTransition :: State -> Transition

This means that changing the layout of the transitions is purely a matter of updating their implementation and changing CodeUnit to CodePoint. GHC helps us find any other places where we need to change the code by pointing out type errors where CodePoint is to be expected according to the updated interface but CodeUnit is given. Working backwards from there we updated the remaining code. Finally, we decided to handle code points encoded as 4 code units correctly instead of staying 100% true to the UTF-16 implementation, which ignored those since they were encoded using surrogates. In practice this doesn’t make much of a difference since most of our input data is in the BMP anyways.

Haskell Performance Tuning

By now it should be clear that we care a lot about the performance of alfred-margaret. Optimizing Haskell code can be a little tricky, so we added this section to show some of the things that you should watch out for.

Enforce Strictness to Improve Worker/Wrapper Results

Laziness is a neat tool for writing high-level code while also saving a few CPU cycles here and there by avoiding unnecessary evaluation. However, it can also get in your way if you’re not careful, as it can keep the compiler from applying specific optimizations. Let’s take a look at consumeInput, the entry point of the automaton stepping code4:

consumeInput :: CodeUnitIndex -> CodeUnitIndex -> a -> State -> a
consumeInput !_offset 0 !acc !_state = acc
consumeInput !offset !remaining !acc !state =
  let
    ...
  in
    followCodePoint (offset + codeUnits) (remaining - codeUnits) acc state casedCodePoint

Each parameter of this function is prefixed by an exclamation point. This syntax, enabled by the extension bang patterns, effectively allows us to give GHC a helping hand during demand analysis. Demand analysis (a GHC-specific term that includes strictness analysis) determines how a binding is used after its definition. Consider the following example:

f :: Int -> Int -> Int
f x y = case x of
          0 -> y
          _ -> x

In f, the function parameter x is always evaluated as it needs to be pattern-matched on. We call x a strict parameter. Meanwhile y is not used in every code path, making it a lazy parameter. Adding a bang pattern to y forces GHC to evaluate y whenever f is entered, which means it will be treated as strict in any case. While GHC's demand analysis handles most cases quite well it is sometimes necessary to add these annotations manually when working with performance-sensitive code. You can find more information of demand analysis in the GHC commentary and this example-driven blog article.

Demand analysis is interesting to us because its results drive some optimizations in GHC, most notably the so-called worker/wrapper transformation. This transformation splits a function into a worker function that has an improved type and a wrapper function that observes the original type. In particular, the wrapper typically unboxes boxed arguments in order to reduce allocations and memory accesses.

Why is this so important if we are working with 32-bit Chars anyways? The reason is that in GHC's implementation of Haskell, even simple values such as Chars, Ints or Bools are per default "boxed", i.e. they are pointers to some data on the heap. For example, in this square function, x is actually a pointer to a heap-residing Int object:

square :: Int -> Int
square x = x * x

This means that this function has to (at least) dereference x, perform the multiplication and allocate a new heap object for the result. The worker-wrapper transformation saves us here by splitting this function into something akin to:

-- wrapper
square :: Int -> Int
square (I# x) = I# ($wsquare x)

-- worker
$wsquare :: Int# -> Int#
$wsquare x = x *# x

The worker function $wsquare now uses unboxed types and operations only (those marked with #) and performs no allocations. In the wrapper, you can read the pattern match (I# x) as a dereference to an Int# and the call to I# as an allocation on the heap. Wherever the unboxed arguments to square are available in the remaining codebase, GHC can now simply substitute for $wsquare to minimize intermediate allocations.

Let’s backtrack to alfred-margaret now and check out the real-world implications. What happens to consumeInput when we leave GHC’s demand analysis to its own devices and remove some of those bang patterns?

consumeInput :: CodeUnitIndex -> CodeUnitIndex -> a -> State -> a
consumeInput _offset 0 acc _state = acc
consumeInput !offset !remaining !acc !state =
  let
    ...
  in
    followCodePoint (offset + codeUnits) (remaining - codeUnits) acc state casedCodePoint

In this piece of code, the bang patterns in the first branch of consumeInput have been removed. We can make GHC dump the results of demand analysis by passing the -ddump-stranal option. GHC then prints out an intermediate representation of our program in the Core language, which is a glorified lambda calculus that it uses for optimizations specific to functional languages. In the dumped Core, the declaration of consumeInput's worker looks like this:

consumeInput [...]
  :: Int -> Int -> a -> Int -> a
[...
 Str=<L,_><S,_><S,_><L,_>,
 ...]

CodeUnitIndex and State have been resolved to Int. The Str field in consumeInput's metadata gives us the results of demand analysis: In order, it lists strictness and usage information (which is not important for our case and therefore elided) for each of the function's parameters. While remaining and acc are used in both branches of consumeInput, offset and state are not and are therefore treated as lazy parameters. To see the results of the worker/wrapper transformation, we use -ddump-worker-wrapper. The worker for consumeInput looks like this:

$wconsumeInput [...]
  :: Int -> Int# -> a -> Int -> a

Because offset and state are lazily used, they must always be provided as boxed Ints even to the worker function. consumeInput is called with a different offset for each code point in the input, so this implies a lot of small allocations, which is terrible for garbage collection! For this reason, we added a bang pattern to all patterns in the first branch as well:

consumeInput :: CodeUnitIndex -> CodeUnitIndex -> a -> State -> a
consumeInput !_offset 0 !acc !_state = acc
consumeInput !offset !remaining !acc !state =
  ...

The demand information and the worker declaration now look like this:

-- -ddump-stranal
consumeInput [...]
  :: Int -> Int -> a -> Int -> a
[...
 Str=<S,_><S,_><S,_><S,_>,
 ...]

-- -ddump-worker-wrapper
$wconsumeInput [...]
  :: Int# -> Int# -> a -> Int# -> a

Because they are dubbed to be strict parameters, offset and state are unboxed Int#s now as well! Overall, this three-character change cut our benchmark times in half and saved loads of allocations.

Use Tail Recursion where Possible

Basic functional programming courses often place a lot of emphasis on tail recursion – and for a good reason: Tail recursive function calls can be compiled into jumps, thereby avoiding the overhead that usually comes with a call. This is already described in the original alfred-margaret blog article in the section "From State Machines to Mutual Recursion".

Mind your Data Structures

Finally, we did another low-level optimization that had a surprisingly big impact. We represent a Aho-Corasick state machine using the following data structure:

data AcMachine v = AcMachine
  { machineValues               :: !(Vector [v])
  -- machineValues can’t be an UnboxedVector because we don’t have Unbox v
  , machineTransitions          :: !(UnboxedVector Transition)
  , machineOffsets              :: !(UnboxedVector Int)
  , machineRootAsciiTransitions :: !(UnboxedVector Transition)
  } deriving (Generic)

An UnboxedVector (canonical name: Data.Vector.Unboxed.Vector) is an array-based data structure that doesn't store pointers to its elements but the elements itself. In C, we'd declare uint64_t machine_transitions[]. Using unboxed vectors provides a way to store those transitions from before neatly packed adjacently in memory. However, on a hunch, we replaced them by a thin wrapper around ByteArray from the primitive package:

newtype TypedByteArray a = TypedByteArray ByteArray

As it turns out, this simple change led to a speedup of around 15% for array reads:

Results of the UnboxedVector vs TypedByteArray microbenchmark

We can see why by checking out the definition of UnboxedVector:

data Vector a = Vector Int Int ByteArray

An UnboxedVector is actually just a slice into a ByteArray! This means that every time we read an element, we save an addition by simply using a ByteArray instead. Since we know that our fields are not slices but occupy a whole ByteArray which we don’t need to resize, this is a sensible choice. The TypedByteArray wrapper made it easy to replace the type where necessary:

data AcMachine v = AcMachine
  { machineValues               :: !(Vector [v])
  , machineTransitions          :: !(TypedByteArray Transition)
  , machineOffsets              :: !(TypedByteArray Int)
  , machineRootAsciiTransitions :: !(TypedByteArray Transition)
  } deriving (Generic)

In exchange for just a small amount of additional code, this gave us a noticeable performance boost not only in the microbenchmark above but also across our whole benchmark data set.

Results

By using UTF-8 in our updated version of alfred-margaret, we now save close to 50% of memory when working with Text values. However, it was also important to preserve the library's performance. To measure this we ran three versions of alfred-margaret on our benchmark data set of roughly 1.6 billion code points:

  • UTF-16 uvec: Original implementation which uses UTF-16 Text and UnboxedVectors.
  • UTF-8 tba: UTF-8 implementation which decodes all code points and uses the TypedByteArray improvement.
  • UTF-16 tba: Updated UTF-16 implementation that incorporates the ByteArray improvement.

The task is the same for each version: Across 242 test cases, count the number of occurrences for a set of needles in a haystack. Of course we made sure that all versions deliver the same results. The distribution of code points across the whole benchmark looks like this:

Unicode Block Share Number of Code Points
Basic Latin 99.83% 1,652,853,274
Latin-1 Supplement 0.14% 2,314,640
Latin Extended-A 0.01% 242,392
Latin Extended-B 0.00% 746
Other Block, 2 UTF-8 Code Units 0.00% 4,114
Other Block, 3 UTF-8 Code Units 0.02% 371,955
Other Block, 4 UTF-8 Code Units 0.00% 0

Here’s how our candidates did; red bars indicate time needed for case-insensitive search and blue bars indicate time needed for case-sensitive search:

Benchmark results

Overall, running time actually improved compared to the original UTF-16 implementation. Unsurprisingly, the UTF-8 implementation takes a bit longer than the improved UTF-16 implementation with the same TypedByteArray improvement because it needs to decode each code point, even if it’s just a single code unit. One curious observation is that in spite of that, the UTF-8 implementation is actually faster than the improved UTF-16 implementation doing a case-insensitive search. At this point we haven’t found out why that is and are still investigating.

Conclusion

We showcased some internals of alfred-margaret and some approaches of moving it to UTF-8. Having done so enables us to move our codebase to text-2.0, cutting our memory consumption for strings in half! Even though the UTF-8 version has additional decoding overhead, we managed to speed it up compared to the old UTF-16 version. All in all our journey makes it clear that it is absolutely possible to write low-level code in Haskell to tackle performance-sensitive problems. This is especially nice since Haskell offers us a vast amount of tools to hide this complexity and integrate it cleanly into high-level code.

The section Haskell Performance Tuning shows that it is crucial to have at least some understanding of what happens inside GHC though. It helps a lot to have colleagues well-versed in GHC internals and the tooling that is available in the Haskell ecosystem. I'd like to thank mine for their guidance and the opportunity to work on this project.

Moving all of Hackage to text-2.0 is likely going to take a while, since it contains a lot of open-source contributions. However, since the module Data.Text is encoding-agnostic, most updates shouldn't need as much deliberation as this one.


1: The user-perceived character ◌̣ actually consists of multiple abstract characters as well, U+25CC DOTTED CIRCLE and U+0323 COMBINING DOT BELOW. It just looks nicer that way.

2: Of course Texts can be traversed using the Data.Text module only, for example using any one of these Folds or iter. However, both of these methods construct intermediate data structures on the heap and we don't want to rely on GHC to optimize those away.

3: Unicode actually defines toUppercase(ß) = SS. This is because U+1E9E LATIN CAPITAL LETTER SHARP S is a very recent addition to the German writing system, officially established in 2017. In Germany, you will find a lot of street signs where "Straße" (street) is written "STRASSE", see for example this article. However, Haskell's implementations of toUpper and toLower have the type Char -> Char, only ever mapping to single code points. They use the simple case conversion algorithms which Unicode provides exactly for this use case.

4: Working with UTF-encoded strings, we have (at least) two ways to index a string: By code points and by code units. Using CodeUnitIndex here makes it clear what kind of index we are dealing with: Although we ultimately want a CodePoint, we want to use the fixed-width code units to access the string in constant time. CodeUnitIndex is a newtype which means that GHC throws errors if we use an Int instead.

Paul BrinkmeierSoftware Development