February 24, 2017
Channable is a tool for marketeers and webshop owners that connects shops to market places, affiliate platforms, and comparison websites. We download product data through a feed or API connection, clean it up using our configurable rules engine, and we send the transformed data to any platform.
The process consists of various stages. We have “jobs” for every stage, and execute those on a pool of worker servers. There are some constraints on which jobs are allowed to run: product feed downloads must be complete before we can process any data, and jobs that access the same resource are not allowed to run in parallel. All of this was coordinated by a system that we called “the scheduler”. Around October last year, this scheduler was starting to show its limits. Channable had been growing steadily, and occasionally jobs were being submitted faster than the scheduler could schedule them. The time had come to replace it with something that scaled.
Our scheduler was implemented in Python, with most of the core logic done by a Python implementation of Datalog (a fork of pyDatalog), a declarative query language based on Prolog. We figured back then, that deciding which jobs can and cannot run would be a great usecase for Datalog. You describe the dependencies as logical statements, and Datalog returns the jobs for which these statements are true. Also, because we had the strength of a Prolog-like language at our disposal, very complicated dependencies could be described in the future. It turned out, however, that we never actually needed any of that strength. We have had the same dependency constraints ever since we implemented them.
The Python/Datalog scheduler served us well for a long time, but as the company grew larger, it was becoming a bottleneck. It would occasionally take minutes to resolve dependencies within the job graph, when jobs were being submitted at a rate of a few dozen jobs per second. The system was slow for three reasons:
Aside from performance issues, our pyDatalog fork was ugly and nobody understood how it worked anymore. Furthermore, we had encountered a few bugs in the scheduler that a static type system would have prevented. So we decided that it was time for a rewrite. This time we would take complexity of the algorithms as a primary concern, to ensure that our system could easily scale to a million jobs in the system. And while we were at it, we considered using a compiled language with a strong static type system as well.
Ever since Arian (our Haskell ninja) has been at Channable, he has been evangelizing about the beauty of Haskell, and many of our colleagues are enthousiastic about functional programming. But our main codebase is written in Python, and you don’t simply start converting parts to Haskell. Replacing the scheduler was the perfect opportunity: it was a small self-contained project, and Haskell suits the problem domain of constraint resolution well. So one morning, Ruud ran
stack init. Jobmachine was born.
Our lead developer Robert usually comes in a bit later, so we had about an hour to build a working prototype. There were some concerns about using a non-mainstream language, but if we could get it done, there would be no going back. And if this small project was a success, we might get more colleagues on board.
Indeed we had the core algorithm implemented quickly, and in fact it has changed very little since. Haskell really is a great language to express this kind of thing: determining which jobs can be executed is a pure function
[Job] -> [Job]. But of course, a real application needs to interact with the outside world. We need persistence, logging, and a REST API. We need to terminate gracefully when systemd sends a
SIGTERM, and we need to be able to recover from non-graceful termination. It took us (Arian and Ruud) about two months to get Jobmachine into a semi-usable state. At that point we decided to take the pragmatic route: fix all blocking issues and postpone the less important ones. After all, at some point you need to ship something.
Now that the project was nearly production-ready, it was time to get other developers involved. With Stack, getting things running in a development environment was easy:
git clone email@example.com:channable/jobmachine.git && cd jobmachine stack setup stack build stack exec jobmachine & stack exec worker
No messing around with virtualenvs and requirement files. Everybody builds with the same dependency versions. No more “works on my machine” and “did you install the latest requirements?”. We marked a few issues as “good first issue”, and tried to get more developers to work on the codebase to lower the bus factor. We took care to keep the code readable and approachable. It is said that Haskell programmers are a rare species, but actually the majority of developers at Channable had used Haskell before. With a bit of help and code review, language was not a barrier.
Then on January 4, we deployed the first version into production. With zero downtime. But that is a story for a different post. That version had contributions by three developers besides us (Arian and Ruud). At the time of writing, the repository has six contributors. Jobmachine has been running reliably since. In the few occasions where the queue filled up quickly, the system remained perfectly responsive, and when our hosting provider suffered a power outage, we lost no data and our application recovered correctly when it came back online.
This last point is more of a consequence of the architecture, but using a pure functional language definitely guided us into the right direction there.
Building, packaging, and shipping Python packages -- in a reproducible way -- has always been somewhat painful. We managed to do it with various freeze files and virtualenv hacks, but it was not pretty. Stackage snapshots completely solved this problem for us. With a curated set of packages that work, the only thing we have to do is to upgrade to a new snapshot occasionally. We are confident that when we do a
stack build, we get a working artifact, always. And we get a single binary to ship, which depends only on a few system libraries.
Furthermore, Haskell gave us great confidence in the correctness of our code. Many classes of bugs simply do not show up because they are either caught at compile time, or by our test suite. Some statistics: since we deployed, we have had 2 actual bugs in our Haskell code. In the mean time, we fixed more than 10 bugs in the Python code that interfaces with the Jobmachine API.
Moreover, we are able to refactor without fear. We have been experimenting with adding some new features to the job assignment algorithm recently, and the procedure of extending it is rather mechanical: extend the types to reflect information you need, fix all type errors, write a test, and change code until the tests are green.
"Btw, Haskell and QuickCheck are awesome! Changing this was a really smooth process: check tests are green, edit, check tests fail, change test, done :)"
— Laurens after implementing a feature
QuickCheck was a great aid in testing our code. For example, the following snippet caught many bugs in serialization before they were even committed:
doesRoundtrip decode encode x = (deserialize decode . serialize encode) x == Right x
We also use QuickCheck for larger tests. A big part of our application can be summarized as a fold over events:
Event -> State -> State.
This means we can test invariants of our system by generating random events:
handleEvent :: Event -> State -> State prop "the scheduler state is always consistent" $ \events -> let -- We do a scan so we observe all intermediate states. states :: [State] states = scanl (flip handleEvent) State.empty events isGood state = State.recomputeRedundantFields state === state in conjoin (fmap isGood states)
State contains a few data structures to enable efficient lookup of jobs with certain properties. These are redundant and can be recomputed from other fields. With this test we ensure that the acceleration structures are always in sync with the main data.
Finally, Jobmachine is several orders of magnitude faster than the old scheduler. Scheduling overhead is well below a millisecond per job even with a million jobs in the system. This is mostly due to better algorithms, but GHC -- a mature optimizing compiler -- deserves some of the credit too.
Of course, there were some issues that we ran into as well. There are some well-known Haskell gripes. Some were actually a problem in practice, others were just annoying.
In one case, Haskell did not quite deliver on its promise. Haskell is one of the few languages that can encode effects. Things like “Can this function access Redis?” and “Does this function log to the console?” can be encoded in the type system. We ended up with a mix of free monads and monad transformers, where we should have picked one. In practice the monad transformers are imposed on us by various libraries, and we should probably rewrite our code in the same style. The real issue, however, is that all of this breaks down when used with threads, because functions such as
async operate in the IO monad. Suppose we have these functions (slightly adapted from our source):
runWorkerLoop :: (MonadBackingStore m, MonadLogger m, MonadIO m) => WorkerState -> m WorkerState runFetchLoop :: (MonadBackingStore m, MonadIO m) => m () runWorker :: (MonadBackingStore m, MonadLogger m, MonadIO m) => m ()
runWorker we want to start two threads: one that runs the worker loop, and one that runs the fetch loop. (The fetch loop retrieves jobs from the backing store and puts them in a
TBQueue, and the worker loop drains this queue. The queue is used to sequence events; for example a job being completed also enqueues an event.) The issue here is that we cannot run
forkIO, because we don’t have the right transformer stack. We tried various solutions, but in the end we removed most of the
MonadX m constraints, and changed everything back to
Another concern is keeping code readable, and approachable for new team members who might not be very experienced with Haskell. Because Haskell is so expressive, this is perhaps more of an issue than in other languages. We mitigate this using code review. Sometimes you just have to say, “Yes that is a really nice bifunctor instance there Arian, but please just write the two lines yourself.” Or remark that
(submittedCount m) + 1 is probably more readable than
succ . submittedCount $ m. Keeping code readable should always be a priority, and this project was no different.
At Channable we handle a lot of jobs, such as downloading and optimizing feeds. Our Python-based scheduler that coordinated executing these jobs on worker servers, was starting to show its limits. Besides, we had a few reasons to consider Python alternatives. We built a replacement in Haskell, which has been running perfectly since. Most improvements are architectural, but a pure functional language definitely guided us here. We caught many bugs at compile time which in Python might have made it into production. Local development and packaging became simpler. And most importantly, the responses of other team members were very positive.