January 11, 2023
This is part 2 in our series about the parallel streaming evaluator that we use at Channable, where we use Haskell with the Conduit library to produce both result values and parallel work units in a single stream. Part 1 explained the basics of our evaluator, why it's useful and some of the use cases for it. If you have not done so yet, read part 1 first:
Parallel streaming in Haskell: Part 1 - Fast, efficient, and fun!
In part 1 we've mainly worked with
filter. They work on each item individually, so they can easily be applied on parallel streams as well.
It becomes more difficult when you need to interact with multiple items at the same time. We will refer to these types of actions as aggregations, mimicking the terminology from SQL databases. In other languages you would typically express these as a
reduce. In this blog post, we'll explain how we parallelized such aggregations.
We will use deduplication as the running example. Deduplication works by removing duplicates from the incoming items by a certain expression (such as a field). In Haskell, the type signature of deduplication would look like
deduplicate :: Ord key => (Item -> key) -> [Item] -> [Item]
Our implementation also sorts the items along the way, to make it easier to write an
O(n log n) algorithm. It's similar in spirit to
We use deduplication in the examples, but we also support other actions such as:
All of our aggregations work on the same principle: Group all items and join them in a certain way, possibly removing them. In a Haskell function, this would have the following type signature.
aggregation :: Ord key => (Item -> key) -- ^ The key on which to group the items -> (Item -> Item -> [Item]) -- ^ The function to combine 2 items, return 0, 1, or 2 items -> [Item] -- ^ The input items -> [Item] -- ^ The resulting items
In the examples below, we use numbers instead of concrete items. You can imagine that these numbers are the result of applying
(Item -> a).
A naive attempt at a parallelized deduplication algorithm might work like this:
In this case, we were unlucky and the 3 showed up in both chunks and was therefore not recognized as a duplicate. Clearly, a deduplication function needs to look at all inputs to work, so it sounds like we should just accept that we can't do this in parallel.
But never fear, we actually can. Mostly. The algorithm to do this is a classic divide-and-conquer style algorithm very similar to merge-sort. The high level approach contains the following steps:
Deduplicating and sorting the individual chunks can be done in parallel, but joining them has to be sequential. For performance, it's best to do as much as possible in the dedup+sort step and as little as possible in the joining step.
The key to joining efficiently is the Sort step, which happens in parallel. We join 2 lists that are already sorted, making the joining much easier. This is almost the same as the 'merge' step in a merge sort algorithm, with a small difference when duplicates occur.
We start the
Join step with 2 lists of deduplicated items, sorted by the deduplication key. The result of this
Join is also a sorted, deduplicated list. Start looking at the lists
x:xs means a list where
x is the first element of the list and
xs is everything else on the list. There are 3 cases:
l == r: We found a duplicate. We can either decide to always return
lor always return
ror have a special
joinfunction. In that case we return
join(l, r). We recurse on
l < r: As
lis smaller than
ris the smallest element of
r:rs(due to the sorting), we know that there is no occurrence of
r:rs. We return
lfor the new list. As
lis smaller than any items in
ls, it keeps the resulting list sorted. We recurse on
l > r. We continue with
ras the head of the result list by the same logic as
l < r, but now with the operands reversed. We recurse on
This join step is
n is the sum of the length of both lists. Once any of the list is empty, we can append the other list to the end of the resulting list.
In the image below, we walk through 2 lists step-by-step. The comparison step as outlined below is shown above each step The arrow contains the value of the head of the list.
As you might imagine, making this work with only 2 streams is a bit pointless. However, we can use the Join step multiple times, as the result of Join has the same format as the processing of an individual chunk: a sorted and deduplicated list. If we have 4 chunks, the steps are:
Scaling this up further is more of the same.
In general, the algorithm can be described as follows:
We want the result to be deterministic, independent of timing issues due to multithreading and chunk sizes. For this we have to ensure that the join operation is assocative (
a <> (b <> c) == (a <> b) <> c) and that we always keep all the inputs in the right order. It's also worth noting that if we would do
((a <> b) <> c) <> d we can't parallelize anything, but with
(a <> b) <> (c <> d) we can evaluate
a <> b and
c <> d at the same time.
Together these requirements mean that we have fixed pairs of blocks (for joining), solely based on their position in the layer before. Actual evaluation of those joins can happen in any order due to multithreading.
We've been talking about running these sort+dedup and joining steps in parallel. Remember though that an aggregation action by itself is just a conduit that will only do anything when the evaluator tries to pull data from it. In order to run this parallel work, it will wrap them in a
WOWork constructor and yield them to the next action. They'll get propagated to the end of the conduit and then
runConduitWithWork will ensure that the work gets evaluated.
WOWork units are responsible for deduplicating, sorting and joining all blocks. They will put their results during the deduplication process in an internal data structure, local to the aggregation. Every time that the conduit of the aggregation action gets pulled on, it looks in the data structure to see if it can produce any new work, for example by combining 2 existing chunks. Only after all the blocks are produced, the deduplication action can finally stream the deduplicated items.
Lets go through an example of a deduplicating and joining step, using these work units.
We'll pick a specific section from the example above. We are zooming into the highlighted section from the larger example above. In this case, we will deduplicate 2 blocks and join them. The yellow sections indicate the internal structure above. The colored blocks indicate the different producers of work.
1., the evaluator asks (via the other producers) the join work producer (purple) for some work. Purple doesn't have any work, as it doesn't have all of its inputs. It asks its input (blue) if it has some work. Blue has some work (step
2.), so it gives that to Purple. Purple moves that work along (step
3.), so it will reach the evaluator, which it can execute on a separate worker.
The blue work is running on the parallel worker. The evaluator doesn't need to wait for the work to complete. It can already ask purple for some more work (step
4.). As before, purple doesn't have all the work yet, so it asks its input for some more work. The answer comes back that there is more work by Green. The work green has (step
5.) is send to purple. Purple sends the work to the evaluator (step
The evaluator asks purple for some work for the third time (step
7.). The inputs of Purple don't have any more work. Assuming the work was already complete, Purple can make the join of the 2 blocks (step
8.) and yield it to the evaluator (step
9.). In the next blog, we explain how we deal with this situation if this assumption is not yet true.
The evaluator asks purple for the final time if it has something for us (step
10.). This time, the items are already computed by step
8. and available in the internal datastructure (step
After this, the resulting items are either placed back into the resulting datastructure to be combined with another block or are streamed towards the next conduit, if these were the final blocks that needed to be joined.
While we did not show a concrete implementation, we hope to have convinced you that our work-propagation system introduced in part 1 is flexible enough to implement aggregations, as well as the simpler
The implementations for these actions are similar to what we run on our production servers. We still have some future work in parallelizing some other minor actions. Our current system supports both sequential and parallel actions interleaved. Having this freedom is a nice benefit of our system, but we would like to parallelize all implementations as much as possible. We can port these last sequential implementations to parallel versions, but as these actions are rarely bottlenecks, this doesn't have the highest priority.
Finally, aggregations as described are not always the most efficient approach. The described algorithm works decently for all cases, but there are specific cases that might benefit from their own, specialized, implementation. An example would be when counting numbers, where the
join step is nothing more than adding two numbers together. It would be more efficient to do this within the producer instead of spawning work for this that can be executed on a separate thread, as the overhead of producing some
Work trumps the benefit of executing
+ in parallel.
Next in this four-part series is Parallel streaming in Haskell: Part 3 - A parallel work
consumer. We will show a concrete implementation for the
runConduitWithWork function from
runConduitWithWork :: Int -> ConduitT () (WorkOr Void) IO r -> IO r
This function runs the conduit and will consume and run any
WOWork parallel work units that are produced in the meantime. The implementation is suprisingly simple!
You can discuss this blog post on hackernews and on reddit.
[blog2]: "Parallel streaming in Haskell: Part 2 - Optimized parallel aggregations"
[blog4]: "Parallel streaming in Haskell: Part 4 - Conditionals and non-blocking evaluation"
Are you interested in working at Channable? Please take a look at our vacancy page to see if we have an open position that suits you!Apply now