Tech

March 13, 2019

In this post, we will describe our quest to create Alfred–Margaret, the fastest Haskell implementation of the Aho–Corasick string searching algorithm, which powers string search in Channable.

Channable is a feed processing tool where users can define rules to optimize their product feeds. Oftentimes these rules rely on substring searching, for example to categorize products based on keywords that occur in the title, or to exclude products that use inappropriate words in their description. Substring searching is also a subproblem of string replacement, which is another common operation for us. Because substring searching makes up such a big chunk of our workload, we need it to be fast in order to keep our tool fast and responsive.

Typically, a rule will have anywhere from one to a few hundred *needles* (strings to search for), which will be used to search through possibly millions of products. *Aho–Corasick* is a good string matching algorithm in this situation: it constructs a data structure once, which can then be used to search through many *haystacks* (strings to search through). In theory this algorithm is efficient: the time complexity is linear in the length of the haystack plus the number of matches, and independent of the number of needles. Although algorithmic complexity definitely matters, in practice a fast string matching implementation is even more important.

In the sections to follow, we will touch upon the algorithm briefly, and then explain how we got from high-level Haskell to tight machine code that outperforms the fastest Rust implementation we could find.

Before embarking on our own implementation, we decided to check what libraries are out there, to see what kind of performance is possible, and to evaluate if we could use an off-the-shelf implementation directly. Our feed processing system is written in Haskell, so we were open to binding a library with a C ABI through the foreign function interface. And if we would have to write our own, having a fast implementation to take inspiration from would also be useful.

We constructed a benchmark to evaluate string matching implementations. For a realistic comparison, we took a sample of the 242 slowest rules that our users had created, and dumped their needles and haystacks. This amounted to 3.5 gigabytes of UTF-16 data, with the number of needles per sample ranging from 1 to 5136, and a median needle length of 22 bytes. Because we use the `Text`

type for strings in Haskell, which is an array of UTF-16 code units in native endianness, the benchmarks would take UTF-16 data as input. The goal was to count the number of matches as quickly as possible. The measured time includes the time to construct the automaton.

We considered the following contenders:

- A naive Python script that loops over all needles and calls
`str.find`

in a loop to find all matches. This is not Aho–Corasick, but it provides a baseline for timing, and it provides a reference implementation which proved very valuable for checking correctness. We use Python 3.6.7. - A Haskell implementation by yuttie. This is the library we used before we started our optimization journey. We used version 0.2.0.5 of this library on GHC 8.2.2.
- A Java implementation by hankcs. We used this library in the precursor to our current feed processing system, which was written in Scala. We use version 1.2.0 of this library on OpenJDK 10.0.2.
- A C implementation by mischasan. This library looked promising, but we could not get its test suite to build, so we moved on.
- A Rust implementation by BurntSushi, which is used in Ripgrep. We used version 0.6.8 of the library on Rust 1.29.2.
^{1} - Finally, to establish a lower bound, we benchmarked a Rust program that just calls
`to_vec()`

on the haystack, the equivalent of a`malloc`

and`memcpy`

.

Our benchmarks established the following times for working through all 3.5 gigabytes of test data once:

The times shown are the means over 25 runs, the error bars indicate standard deviation.

A few things to note here:

The library we use currently is far from optimal!

The Python implementation decodes the UTF-16 input at load time. The runtime representation [need not be][pep393] UTF-16, so it has less bytes to traverse than the other contenders.

The Rust library only supports byte slices. If we would use this library, transcoding UTF-16 to UTF-8 on every FFI call would be expensive, and for string replacement we need match positions in the UTF-16 text. Therefore we match on the UTF-16 bytes directly. This means that the program is effectively doing twice the work: it treats every code unit as two characters.

The naive Python implementation is surprisingly fast, given that Python is an interpreted language, and it implements a suboptimal algorithm. There are two explanations for this. Firstly, the program spends most of its time in the heavily-optimized

`str.find`

function, which is implemented in C. Interpreter overhead is small. Secondly, most samples have few needles, so looping over all needles is not so bad. Half of the samples have only 1 or 2 needles.

Clearly there was plenty of room for speeding up our string matching. We considered binding the Rust library, but due to the way that Haskell’s foreign function interface interacts with the garbage collector, passing a string to a foreign function requires copying it. That copy alone puts any foreign library at a significant disadvantage, so we decided to see how close we could get in Haskell alone.

Before we can explain our other optimizations, we need to explain a bit of how Aho–Corasick works. At the heart of the algorithm is an automaton. The construction of this automaton is beyond the scope of this post. All we need to know for now is that the automaton consists of *states and transitions* between them. Transitions between states correspond to input characters. Additionally, every state apart from the initial state has a *failure transition* to follow when there is no transition for the current input character. Every state can have zero or more needles associated with it. When we enter a state, we report all of the associated needles as matched. There can be multiple needles per state when needles overlap.

To traverse the haystack, we take one or more steps in the automaton for every input character. If at the current state, there is a transition for the current input, then we take it, and move on to the next input character. If there is no transition, we first follow the failure transition and try again from there, until we end up in the initial state. If there is no transition from the initial state, we move on to the next input character.

To give an example, suppose we we wanted to match *tshirt, shirts, and shorts*. We would use the following automaton:

State 0 is the initial state. The solid arrows indicate transitions associated with an input character, whereas the dashed arrows indicate a failure transition. For the sake of readability, failure transitions back to the initial state are not drawn.

When matching on the text *short tshirts*, we would step through the following states:

```
Input: s h o r t _ t s h i r t s
State: 0 7 8 13 14 15 0 1 2 3 4 5 6 12
Match: ^ ^
```

For the first few input characters, we follow the input arrows. In state 15, when there is no transition for space, we first fall back to state 1, which has no transition for space either, and then we fall back to state 0. (Recall that the dotted line to state 0 is not drawn in the diagram.) In state 6, we report having matched *tshirt* and in state 12 we report having matched *shirts*. Now that we have seen how to step the automaton conceptually, we can start to consider implementation details.

As is often the case in performance sensitive code, the key to our fast implementation was to use the right representation of the automaton in memory. To step the automaton, we need to keep track of three things for each state:

- A transition table: a mapping from input character to next state.
- The state where the failure transition leads.
- A (possibly empty) set of needles to report as matched in this state.

The state itself is an integer, and in building the automaton we can assign consecutive unsigned integers to states. In the worst case, the number of states is the sum of the length of the needles, so we opted for a 32-bit integer.

For the transition table, a hash map is an obvious data structure, and it is also what yuttie’s library uses. However, unlike array-backed mutable hash tables in Rust or C, Haskell’s `HashMap`

is implemented as a tree, a pointer-heavy data structure that is less cache-friendly than a flat array. The tree admits efficient inserts with structural sharing (after the insert, both the old *and* the new hash map are available), but we do not need any of that. We only care about the performance of lookups. Building the automaton is a one-time task, whereas we need to traverse the automaton over and over.

We could build our own hash tables on top of unboxed vectors, a safe Haskell abstraction with the same memory representation as a Rust slice or C array. But do we need a hash table at all? It turns out that most transition tables are small. In the example in the previous section, 12 of the 17 states have a single transition (not counting the failure transition), three states have only the failure transition, and the remaining two have two non-failure transitions. So we tried an unordered array with linear search instead. We packed the 16-bit input code unit and the 32-bit go-to state together in a 64-bit integer, and now the transition table was a cache-friendly array of 64-bit integers.

For every state, we now have this transition table, implemented as an array, and one failure transition. If we are going to do a linear scan over the array anyway, we can store the failure transition at the end. This has three advantages:

- The failure go-to state does not need to be stored in a separate table, which is good for memory locality.
- It acts as a terminator for the linear scan, so we do not need to know the size of the table. For now
`Vector`

still tracks a length for bounds checks, but later we will pack multiple tables in one vector. - The
*lookup with default*operation can be implemented in one go: when the linear scan terminates,

we either found a regular transition, or the failure transition.

Replacing `HashMap`

with these custom transition tables got us a 66% reduction in running time. Memory layout trumps algorithmic complexity for our workload.

Although we had made great progress, our Haskell program still took twice as long as the Python version. Fiddling with strictness annotations and unboxed types helped to shave off a few percent of the running time, but we needed another order of magnitude improvement. We knew how fast string matching could be, and our memory layout was tight. Why was our program still so slow? We needed to dig into GHC’s compilation pipeline to learn what was going on. If you are not interested in Haskell optimization, you can skip this part and jump straight to the results.

Stepping the automaton is fairly simple, it can be written as a triply nested loop. To stick with the earlier example, suppose we had the automaton set up like so, in pseudo-Python to illustrate the idea without too much noise:

```
haystack = 'short tshirts'
failure_marker = 0x10000
transitions = {
0: [('t', 1), ('s', 7), (failure_marker, 0)],
1: [('s', 2), (failure_marker, 0)],
2: [('h', 3), (failure_marker, 7)],
...
8: [('i', 9), ('o', 13), (failure_marker, 0)],
...
}
needles_ending_at_state = { 0: [], 1: [], ... 6: ['tshirt'], ... }
```

Then we could step the automaton with a triply nested loop: one over the input characters, one to follow transitions, and one for the linear scan through the transition table.

```
initial_state = 0
current_state = initial_state
for i, input_char in enumerate(haystack):
is_done = False
while not is_done:
for (at_input, goto_state) in transitions[current_state]:
if at_input == input_char:
is_done = True
current_state = goto_state
for needle in needles_ending_at_state[goto_state]:
report_match_ending_at(i, needle)
break
elif at_input == failure_marker:
is_done = current_state == initial_state
current_state = goto_state
# No break required, failure terminates transition table.
```

It would not be very hard to do this in a low-level language such as Rust, in a way that compiles down to efficient machine code, but in Haskell we have a problem: Haskell does not have loops.

The way to loop in Haskell, is to make a tail recursive function. GHC compiles these down to loops, and we can verify that with `-ddump-simpl`

. This flag tells GHC to dump optimized Core, a basic, explicitly typed language which is an intermediate target in GHC’s compilation pipeline.

For example, the following functions implement transition table lookup:

```
import qualified Data.Vector.Unboxed as UVector
type State = Int32
type CodeUnit = Word16
-- A transition is a manually packed tuple of code unit, goto state, and
-- a flag bit that indicates whether the transition is the failure transition.
type Transition = Word64
transitionCodeUnit :: Transition -> CodeUnit
transitionCodeUnit t = fromIntegral (t .&. 0xffff)
transitionState :: Transition -> State
transitionState t = fromIntegral (t `shiftR` 32)
transitionIsFailure :: Transition -> Bool
transitionIsFailure t = (t .&. 0x10000) == 0x10000
-- Find the transition for the code unit by doing a linear scan over all
-- transitions. Returns `Right nextState` if a transition exists, or `Left
-- failState` in case we encountered the failure transition. It is assumed
-- that the transitions vector is terminated by a wildcard transition.
lookup :: CodeUnit -> UVector.Vector Transition -> Either State State
lookup !input ts = go 0
where
go !i = case ts UVector.! i of
t | transitionIsFailure t -> Left (transitionState t)
t | transitionCodeUnit t == input -> Right (transitionState t)
_ -> go (i + 1)
```

Note that infix `!`

is the vector indexing operator. `!`

as prefix is unrelated; it is a strictness annotation. The dumped Core for `lookup`

, after renaming and omitting a few things for readability, looks like this:

```
lookup :: Word# -> Int# -> Int# -> ByteArray# -> Either State State
lookup = \input offset len ts ->
joinrec {
go :: Int# -> Either State State
go i = case tagToEnum# @ Bool (>=# i 0#) of {
False -> crashIndexOutOfBounds;
True -> case tagToEnum# @ Bool (<# i len) of {
False -> crashIndexOutOfBounds;
True -> case indexWord64Array# ts (+# offset i) of t {
__DEFAULT -> case and# t 65536## of {
__DEFAULT -> case tagToEnum# @ Bool
(eqWord# (narrow16Word# (and# t 65535##)) input) of
{
False -> jump go (+# i 1#);
True -> Data.Either.Right @ State @ State
(I32# (narrow32Int# (word2Int# (uncheckedShiftRL# t 32#))))
};
65536## -> Data.Either.Left @ State @ State
(I32# (narrow32Int# (word2Int# (uncheckedShiftRL# t 32#))))
}
}
}
};
} in jump go 0#
```

A few things stand out:

`#`

is all over the place. An`Int#`

is an unboxed signed integer, with a target-dependent size, like`int`

in C.`Word#`

is the unsigned variant. This is in contrast to`Int`

and`Word`

in the original code, which are boxed: pointers to integers or thunks.`+#`

compiles to an ordinary add instruction,`and#`

to a bitwise and. An integer literal with`#`

suffix is an unboxed signed integer constant; a`##`

suffix makes it unsigned.- The inliner inlined
`transitionCodeUnit`

and the other helper functions. - A
`Vector`

is a base pointer, offset, and length (to support slicing), and GHC deconstructed its fields into function arguments to avoid indirection. - The tail call has been replaced with a jump.
- The function does construct either a
`Left`

or`Right`

on the heap as return value.

The middle loop for automaton stepping follows transitions. If a transition exists for the input character, the loop terminates after one step, but if the transition was a failure transition, we try again from the failure state. This translates to `followEdge`

:

```
-- Follow the edge for the input code unit from the current state. If no such
-- edge exists, follow the "failure" edge, and try again.
followEdge :: Vector (UVector.Vector Transition) -> CodeUnit -> State -> State
followEdge transitionsFrom !input !state =
case lookup input (transitionsFrom Vector.! (fromIntegral state)) of
-- Note: recursion halts eventually, because the failure transitions are
-- not cyclic. Eventually we get back to the initial state, and there we
-- stop the recursion.
Right nextState -> nextState
Left failState -> if state == stateInitial
then state
else followEdge transitionsFrom input failState
```

After inspecting the Core for this function, the culprit became clear: `lookup`

was not inlined. This meant that for every transition lookup, `followEdge`

would do a function call, `lookup`

would allocate an `Either`

return value on the heap, and then `followEdge`

would immediately match on it. We could get `lookup`

to be inlined in `followEdge`

, but then `followEdge`

would not be inlined into the calling `step`

function, the function that implements the outer loop. At this point we were stuck. We could avoid a function call for one of the loops, but not both. We could not manually inline these functions by copying the bodies either, because they were recursive. We knew exactly what kind of machine code we wanted, but not how to tell the compiler to generate it.

My colleague Radek offered a suggestion: what if — rather than implementing the triple loop as three functions — we would write *one* function with three loop variables? The rough idea is this:

```
type Matches = [Match]
emptyMatches = []
data Step
= ConsumeInput Matches State HaystackIndex
| FollowEdge Matches State HaystackIndex
| LookupTransition Matches State HaystackIndex TransitionTableIndex
find :: Automaton -> Haystack -> Matches
find automaton haystack = go (ConsumeInput emptyMatches initialState 0)
where
go :: Step -> Matches
go (ConsumeInput matches state i) = if i < length haystack
then go (FollowEdge matches state i)
else matches
go (FollowEdge matches state i) = ...
go (LookupTransition matches state i k) = ...
```

Every constructor corresponds to one of the three loops, and all of the local variables have been moved into fields. `ConsumeInput`

carries the matches found so far, the current state, and the index of the current input code unit. When looking up a transition, we also need to track the index into the transition table. The `go`

function would now first match on the current step, perform a part of the loop, and then do a tail call. This way we can do three loops in one tail recursive function!

The new approach got rid of function calls in the compiled program. The tail call in the `go`

function compiles to a jump. This helped, but now we were allocating a `Step`

value on the heap at every iteration. Could we somehow avoid the allocation too?

The next insight was that we do not need the `Step`

type at all. Why construct it right before the tail call, only to match on it immediately? We could instead specialize the `go`

function for every step:

```
find :: Automaton -> Haystack -> Matches
find automaton haystack = goConsumeInput emptyMatches initialState 0
where
goConsumeInput matches state i = if i < length haystack
then goFollowEdge matches state i
else matches
goFollowEdge matches state i = ...
goLookupTransition matches state i k = ...
```

We ended up with a set of local mutually recursive functions. None of these is self-recursive, but they do all make a tail call to one of the other `go`

functions. Optimization often comes at the cost of clarity, but in this case, what remains is still very readable Haskell. The code reads a lot like labels and jumps: every function is a label, and a call is a jump to that label. In fact, labels and jumps are exactly what the code compiles down to.

After discovering what we needed to write to get GHC to generate good code, we could optimize further. With strictness annotations we eliminated almost all allocations from the `find`

function, save from building up the list of matches. Later, by modifying `find`

to act as a fold over all matches, we eliminated those allocations too. After a struggle, the power of a high-level language paid off.

As a fold, `find`

looks like this:

```
data Next a = Step a | Done a
find :: a -> (a -> Match -> Next a) -> Automaton -> Haystack -> a
find seed onMatch automaton haystack = ...
```

In our benchmark program we fold with an increment function. The `Next`

type also allows us to early out, for example when testing whether any of the needles occurs in the haystack:

```
countMatches :: Automaton -> Haystack -> Int
countMatches = find 0 (\count _match -> Step (count + 1))
containsAny :: Automaton -> Haystack -> Bool
containsAny = find False (\_anyYet _match -> Done True)
allMatches :: Automaton -> Haystack -> Matches
allMatches = find emptyMatches (\matches match -> Step (match : matches))
```

For `countMatches`

, the generated code is specialized with a strict integer accumulator, without any allocations. `containsAny`

is similarly efficent, without allocations, returning `True`

at the first match. We added a few further optimizations, such as a sparse lookup table for the initial state, and packing all transition tables into one vector for better cache utilization, but this post is already quite long, so let’s move on.

The Glasgow Haskell Compiler uses its own code generation backend by default. By passing `-fllvm`

, GHC will use the newer LLVM backend, which benefits from the low-level optimizations that LLVM is good at. Build times suffered, but in return we got about a 20% reduction in running time. Enabling LLVM was by far the most worthwhile optimization in our entire journey, in terms of speedup per effort invested.

With our fully optimized implementation, we are now faster than any other implementation we could find on UTF-16 text:

Keep in mind that the Rust version matches UTF-16 text bytewise, whereas our Haskell version acts on 16-bit code units. For completeness, we also show how the Rust library would perform if our inputs would have been UTF-8, roughly halving the data volume. We did not benchmark our own implementation on UTF-8 strings, as it is specialized to 16-bit code units for Haskell’s `Text`

type. Compared to what we started with, we reduced string matching time by 94%.

We set out to speed up string searching, a common operation for Channable. We investigated several Aho–Corasick implementations, and concluded that our best bet was to build a fast Haskell implementation, which we are open-sourcing today. Though Haskell is not the best tool for doing low-level optimization, it is definitely possible. In the end our implementation came out faster than any of the others we had tried before.

Finding the right way to instruct GHC to generate fast code remains a challenge. Although the end result is code that is clear and fast, the journey to get there was quite the adventure. Relying so much on optimizations to ensure that very high level code compiles to the right low-level code is fragile. Case in point: after upgrading our Stackage snapshot from LTS 10 with GHC 8.2.2 to LTS 13 with GHC 8.6.3, we saw the running time of our benchmark double. After some investigation we got the times back down, but this involved a subtle combination of code changes and inlining pragmas. Still, our journey was definitely worth the trouble, for in the end we can service users with even faster previews and a more responsive tool.

Discuss this post on Reddit or on Hacker News

1: The library offers a, *sparse*, a *dense*, and a *full* automaton representation. The full representation contained a bug that caused it to miss matches, and of the sparse and dense representation, dense was faster for our use case, so we benchmarked the dense automaton. ↩