Bottlenecked on Haskell's text library

October 15, 2020

Channable is a feed processing tool where users can define rules to optimize their product feeds. In one particular feed, we found that processing took much longer than expected. We set out to investigate, and the bottleneck we found came from an unlikely place: the text library.
In this blog post we describe the bottleneck and compare four different solutions. It is accompanied by a small project that contains benchmarks for all code shown below. Try it out!

The source of the bottleneck

The bottleneck turned out to be caused by the following seemingly innocuous function:

{-# INLINE slice #-}
-- | Substring from @offset@ to @offset + len@
slice :: Int -> Int -> Text -> Text
slice offset len = T.take len . T.drop offset

At first this seemed highly unlikely. The text library is known for its performance characteristics. Besides, the feed processing software runs much more complicated algorithms than the above mentioned slice1. Yet, replacing the body of this function with id sped up the processing of this feed by about 30%. Why would this function be so slow?

Inner workings

Let's take a look at how the take and drop functions work. As it currently stands2, the text library uses UTF-16 as its internal representation. In UTF-16, a character is represented as either one or two 16-bit code units. When a character takes up two 16-bit code units, the first one is called a "high surrogate", and the second one is called a "low surrogate"3. For text specifically, this means that the Text data structure holds an array of Word16 representing code points. Besides that it holds an offset :: Int and a length :: Int, to indicate which part of the array is represented by the Text. This is very convenient for our slice function, for the only thing that needs to happen is to calculate the new offset and length. Sadly, basic arithmetic cannot be used to calculate this new offset and length. This is because some characters take up two entries in the array. Instead, both offset and length are calculated by iterating over the array, counting characters rather than code points. This makes both take and drop O(n). Still, an answer can be returned without modifying the underlying array, so they should not be that expensive. Below are the implementations of take and drop (as of text-

take :: Int -> Text -> Text
take n t@(Text arr off len)
   | n <= 0    = empty
   | n >= len  = t
   | otherwise = text arr off (iterN n t)
{-# INLINE [1] take #-}
"TEXT take -> fused" [~1] forall n t.
   take n t = unstream (S.take n (stream t))
"TEXT take -> unfused" [1] forall n t.
   unstream (S.take n (stream t)) = take n t
drop :: Int -> Text -> Text
drop n t@(Text arr off len)
   | n <= 0    = t
   | n >= len  = empty
   | otherwise = text arr (off+i) (len-i)
 where i = iterN n t
{-# INLINE [1] drop #-}
"TEXT drop -> fused" [~1] forall n t.
   drop n t = unstream (S.drop n (stream t))
"TEXT drop -> unfused" [1] forall n t.
   unstream (S.drop n (stream t)) = drop n t

In these implementations we can indeed see some calls to iterN, modifying the offset and len, and passing the underlying array unchanged. There are some rewrite rules, though, and they rewrite to something completely different, and then back again. Perhaps the problem lies there.

Stream, unstream and fusion

With rewrite rules, GHC can be instructed to replace inefficient function calls with equivalent, but more efficient ones. One famous example of this is that map f (map g xs) can be written as map (f.g) xs. The end result is semantically the same, but the second one only maps over the data structure once, making it faster. More info on rewrite rules can be found in the GHC manual.
The documentation of Data.Text give a hint as to why these specific rewrite rules exist:

Most of the functions in this module are subject to fusion, meaning that a pipeline of such functions will usually allocate at most one Text value.

To realize this fusion, there is a second representation of text values, namely Stream. Every function that is subject to fusion has rewrite rules that convert to and from an alternative implementation that uses Stream instead of Text. Depending on how these rules are applied, the compiled code might end up being completely different. Luckily, there are some compiler options that show which rules are being applied (-ddump-rule-firings), and even how they rewrite the code (-ddump-rule-rewrites). This shows that there are three important rules being applied:

Rule fired: TEXT drop -> fused (Data.Text)
Rule fired: TEXT take -> fused (Data.Text)
(lots of other, unrelated rules)
Rule fired: STREAM stream/unstream fusion (Data.Text.Internal.Fusion)

The first two rules are the ones shown above, the third rule is as follows:

{-# RULES "STREAM stream/unstream fusion" forall s. stream (unstream s) = s #-}

This rule means that converting a Stream to Text and then back to Stream again is the same as just leaving the Stream as is. This can be done because Stream and Text are different representations of the same data. The application of the three rewrite rules happens as follows (simplified):

slice offset len
 = \t -> take len (drop offset t)
 -- Applying rule "TEXT drop -> fused"
 ~ \t -> take len (unstream (S.drop offset (stream t)))
 -- Applying rule "TEXT take -> fused"
 ~ \t -> unstream (S.take len (stream (unstream (S.drop offset (stream t)))))
 -- Applying rule "STREAM stream/unstream fusion"
 ~ \t -> unstream (S.take len (S.drop offset (stream t)))

In the end we have an expression that converts the Text to Stream, applies some Stream variants of drop and take, and then converts back to Text again. This delivers on the promise of fusion: only a single new Text is created, even though there are multiple operations in the pipeline that alter the original value. For many situations this would be an optimization, as it can eliminate the creation of intermediate copies. For our slice function, however, the original implementation doesn't create intermediate copies in the first place. The optimization here is significantly more inefficient than the original because it creates a completely unnecessary copy of the input string.
Now that we understand the source of the bottleneck, we can work on a solution: making sure fusion does not happen in this particular case.


In the internal discussion of this issue, several solutions were mentioned. Each tackles the issue in their own way. They are listed below, with a description as to why they solve the problem. For easy access, the benchmark results of these solutions are placed in the benchmark repo's

Don't inline take

The first is to apply a NOINLINE pragma to take:

noInlineTakeSlice :: Int -> Int -> Text -> Text
noInlineTakeSlice offset len = noInlineTake len . T.drop offset
-- Disable inlining for Text's take
{-# NOINLINE noInlineTake #-}
noInlineTake :: Int -> Text -> Text
noInlineTake = T.take

This will cause the compiler to fail to rewrite the take function. Specifically, the TEXT take -> fused rule fails to match at the call site because the function is given a different name, and it fails to match in the definition of noInlineTake because the rule only matches when the take function is called with its two arguments.
Instead, two rules fire on drop, namely TEXT drop -> fused and TEXT drop -> unfused. Rewriting drop to the Stream version, and then back to the Text version. Note that it does not actually apply those two rules to take.
Why do this with take and not drop? Arbitrary choice. Doing this with drop or both take and drop would have essentially had the same effect.

Force sequential evaluation

The second solution forces evaluation of the intermediate result:

sequencedSlice :: Int -> Int -> Text -> Text
sequencedSlice offset limit text =
   !suffix = T.drop offset text
   T.take limit suffix

When forcing evaluation, both drop and take will be rewritten to the Stream version. However, the STREAM stream/unstream fusion rule will fail to match, because the evaluation is forced in between. Unable to combine the streams, both drop and take are rewritten back to the Text version using the TEXT drop -> unfused and TEXT take -> unfused rules.

Add a new rewrite rule

The third solution adds a new (orphan) rewrite rule:

{-# OPTIONS_GHC -Wno-orphans #-} -- Suppress warning about orphan rule
sliceWithRule :: Int -> Int -> Text -> Text
sliceWithRule offset limit = T.take limit . T.drop offset
"TEXT take . drop -> unfused" [1] forall len off t.
   S.unstream (S.take len (S.drop off ( t))) = T.take len (T.drop off t)

This rule matches on the specific problematic Stream created after rewriting the slice function, and turns it all back to the original implementation. This solution will affect all code that either directly or indirectly imports this module, so that should be kept in mind.

Reimplement the function and cheat

The last solution implements a slightly different version of the slice operation using internal functions. To get the most performance, we cheat:

reimplementedSlice :: Int -> Int -> Text -> Text
reimplementedSlice offset len t@(Text.Internal.Text u16data off prevLen)
   | offset2 >= prevLen = Text.Internal.empty
   | len2 <= 0          = Text.Internal.empty
   | otherwise          = Text.Unsafe.takeWord16 len2 $ Text.Unsafe.dropWord16 offset2 t
   offset1 = min prevLen $ max 0 offset
   len1 = min (prevLen - offset1) $ max 0 len
   offset2 = if isLowSurrogate offset1 then offset1 + 1 else offset1
   len2 = if isHighSurrogate (offset2 + len1 - 1) then len1 - 1 else len1
   -- | Return whether the code unit at the given index starts a surrogate pair.
   -- Such a code unit must be followed by a low surrogate in valid UTF-16.
   isHighSurrogate :: Int -> Bool
   isHighSurrogate !i =
       w = Text.Array.unsafeIndex u16data (off + i)
       i >= 0 && i < prevLen && w >= 0xd800 && w <= 0xdbff
   -- | Return whether the code unit at the given index ends a surrogate pair.
   -- Such a code unit must be preceded by a high surrogate in valid UTF-16.
   isLowSurrogate :: Int -> Bool
   isLowSurrogate !i =
       w = Text.Array.unsafeIndex u16data (off + i)
       i >= 0 && i < prevLen && w >= 0xdc00 && w <= 0xdfff

As said earlier, characters take up either one or two entries in the underlying array because of the UTF-16 encoding. This implementation mostly ignores that and calculates a new offset and length without iterating over the array to properly count characters. It assumes that all characters take up only one Word16. That assumption makes this implementation O(1) rather than O(n). It does make sure to at least not cut between a high and low surrogate, though, as that would create invalid UTF-16.
The assumption that the underlying array holds one Word16 per character is blatantly wrong. Still, there are some reasons to seriously consider this implementation. Besides speed, this implementation is closer to the substring operation of many common programming languages, such as Java and C#. While pointing at other programming languages is hardly an excuse, it could become a reasonable argument when having to interface with third party programs or services. This is because in such cases, matching bad behavior could be more important than having correct behavior.
Luckily, most commonly used characters are in the Basic Multilingual Plane (BMP), which take up a single Word16. This could mean that this implementation may not make much of a difference in practice. Where it does make a difference, strings are cut earlier and shorter than they should be. While not ideal, it may be sufficient for our particular use case.

Benchmark results

Below is a graph of the benchmark results. The source of the benchmark can be found here. Note that the Y-axis is logarithmic. This is because of the large difference between the values of naiveSlice and that of the other functions.


Which solution we chose

As the only constant time implementation, reimplementedSlice handily beats all other implementations in the benchmark. Still, we did not choose for this implementation. It turns out that in our real world situation, the performance difference among all solutions is negligible. The tremendous gains of reimplementedSlice shown in the benchmark are lost in the presence of other bottlenecks. Besides, reimplementedSlice demands a significant amount of work before use in production: testing for bugs, measuring actual effects on customers, not to mention future work when the text library changes its implementation. This amount of effort combined with the knowledge that it does not make a real performance difference makes its incorrect implementation impossible to defend.
The choice came down to the other three. The noInlineTake, sequencedSlice and sliceWithRule all performed very similarly in the benchmark. The sliceWithRule solution was then ruled out as its rewrite rule rule will affect anything that imports the containing file directly or indirectly. While this may make other code faster, it becomes unclear when exactly this happens, especially when the file containing the rule is several imports down the line. It is better to have this rule in the text library itself, where it would always be present wherever Text is.
The last two solutions are hard to pick from, though noInlineTake has a slight edge over sequencedSlice in the benchmark. This slight edge may not mean anything, as it could be attributed to how this particular benchmark was compiled5. Still, it allowed us to decide upon the winner. So, noInlineTake it is.

Lessons learned

Overall, this experience has reminded us that bottlenecks can be found in unlikely places. More remarkable, though, is that the implementation of a function you see when looking it up is not necessarily the implementation that ends up in the compiled program. The presence of rewrite rules should alert the viewer to investigate deeper. Lastly, not all optimizations always make code faster. Sometimes there are edge cases where they end up causing harm.
The real solution to this problem is to upstream the rule shown in sliceWithRule to the text library. A pull request for this change can be found at the haskell/text repository. Since substring operations can be quite common, this particular optimization could prove beneficial for many projects. Perhaps there are other combinations of functions that would benefit from similar rules. We look forward to having that discussion.

1: One of these algorithms is Alfred-Margaret, an implementation of Aho-Corasick mentioned in an earlier blog post.
2: The hackage page of the text library says that an investigation is ongoing about switching to UTF-8 for its internal representation.
3: For the sake of brevity, this explanation is minimal. For a more thorough explanation of Unicode and the UTF encodings I recommend reading UTF-8 Everywhere.
4: The code is reproduced here in accordance with its license, hosted on GitHub under haskell/text.
5: While writing the benchmark we hit this ghc issue, which was solved with a flag in package.yaml. Microbenchmarks like the one used here are affected by nuanced implementation details of the compiler.

Falco PeijnenburgLead Development

We are hiring

Are you interested in working at Channable? Check out our vacancy page to see if we have an open position that suits you!

Apply now