Channable

Tech

Haskell vs Zombies

August 16, 2023

Channable is a tool for marketeers and web shop owners that connects shops to market places, affiliate platforms, and comparison websites. We download product data through a feed file or API connection, process it and send the transformed data to any platform. Our processing pipeline is composed of several jobs, which are executed on a pool of servers.

From the point of view of our worker servers, a job is simply a command line for executing a process that performs the job, together with some metadata for identifying them. They spawn those commands as subprocesses, and wait for them to finish – or kill them if they exceed the timeout.

Easy enough, or so it seemed. However, occasionally we would see some undead zombie-processes: some of our job executables would just keep on running, rather than being killed by the timeout after a few hours. Moreover, they were running directly under init, despite having been started by our supervisor. Somehow, processes were slipping through the cracks, becoming unsupervised.

On Linux, the only way for processes to move up in the hierarchy is when their parent process dies. In that case, init will adopt the orphaned processes. Since the supervisor process was still intact, the sub-processes must have been started by the jobs themselves. We had two situations where we didn’t take this behavior into account:

  1. The first one was accidental: a wrapper script was changed from using exec to invoking the executable as a regular command. This meant that the real workload was no longer replacing the shell process running the wrapper script, but instead ran as a subprocess of the shell.
  2. The second one was intentional (without thinking about the consequences): one particular workload had been changed to invoke a subprocess for doing the heavy lifting (which was too slow to do in Python).

In both cases, we had sub-processes running that the supervisor was not aware of. So when it terminated their parent due to e.g. a timeout, it would not terminate the subprocesses. Hence, these child processes were now orphaned, and consequently adopted by init (or the first subreaper they encountered on the way up - more about that later).

We therefore wanted to build a system that

  1. prevents orphaned subprocesses from escaping, and
  2. makes sure terminating a job also terminates all subprocesses that are potentially associated with it.

Containers, except not

One might argue that all the problems identified above are what containerization has been created for. After all, spawning a process in a tight and controlled environment that it cannot escape from seems a perfect use-case for the likes of Docker, systemd-nspawn or LXC. While this is true to some extent, going down this path is not without its own problems.

Above all, fully fledged container solutions have to be actively managed from a security and DevOps perspective. This would create an additional, significant burden on our fellow engineers. We’d also opt for rootless (non-privileged) containers, which in many cases makes the configuration even more difficult.
Besides - when you think about the characteristics of our operation - we spawn new on-demand tasks based on a set of well-defined requests, our own scheduling algorithm and load reporting. It’s very well-tailored to our needs - we take into account, for example:

  • job dependencies,
  • which tasks can run concurrently and which cannot,
  • our own system of priorities based on our knowledge whether the job is interactive or not,
  • “load-balancing” based on what project the job is running for and what type it is (e.g. an import or an export).

We also monitor job output and memory usage statistics to provide users and administrators with logs. Using fully-fledged containers would undoubtedly imply scraping it all and trying to figure out from scratch how to integrate it with the container we would have chosen. A lot of work with no visible profit. What we need must be something that we can easily integrate with our application - more like a container-as-a-library.

Moreover, we do not really want or need complete isolation. Our foremost requirement is simply to control the lifetime of running processes, but we do not care that much about e.g. an isolated network environment or resource access.

To summarize, we’re looking for a solution that:

  • does not require additional DevOps support (ideally - no extra daemons, privilege escalations, or complex configuration),
  • does not force us to rework large parts of our system,
  • (following the above) can be called from our Haskell codebase.

Making Linux do the work for us

There are three main ways on Linux for getting subprocesses back under the control of the supervisory process.

The first is to use the PR_SET_PDEATHSIG prctl for every subprocess that is spawned. PDEATHSIG is short for “parent death signal”, i.e. the signal that the kernel sends to this process once its parent process dies. For instance, we could set it to SIGTERM and have the subprocess be gracefully terminated when the parent is terminated (e.g. by the supervisor upon hitting a timeout).

The downside of this approach is that it is something that each subprocess has to do itself – rather than being a solution we can implement once that covers all the cases.

Another option is to designate our supervisor process to be a so-called “subreaper”. By default, when a process dies, all its children are adopted by the init process. But that role can instead be taken by a subreaper. When processes die in the tree underneath the subreaper, the orphaned children will be adopted by the subreaper rather than init.

Unfortunately, preventing the orphaned subprocesses from escaping is only half of the goal. The other is making sure we terminate all abandoned processes. Since that is not straightforward to do with the subreaper approach, we looked into a different way:

One neat feature of Linux are namespaces. Many system resources (like processes, users, network interfaces, mount points, …) can be scoped with namespaces1. Of particular interest for us are PID namespaces. They allow us to create a distinct process hierarchy where the top-level process within that namespace takes on the role of init. This means that

  1. All orphaned subprocesses are adopted by this init process within the namespace
  2. When the “init” process within the namespace exits, the kernel will send SIGKILL to all remaining processes in the namespace (just like stopping the real init would).

Being init immediately solves our problems: no process can escape it! It is the ultimate parent of all processes (even our zombies” are ultimately re-parented to init) and if init dies, all other processes die as well. Sadly, you can’t be init on a regular system where that role is already taken (usually by systemd). But dig this: if you are in an isolated PID namespace, you can very well become the init process (PID 1) within that namespace! After all, there is (almost) nothing special about the init process, except that it takes up the very first process id.

With this solution, we can fulfill both of our requirements in one go: orphaned processes are captured by the init process within the namespace, and terminating the init process of the namespace kills the other remaining processes in the namespace for good: no more zombies!

Implementation

There are three ways to manipulate namespaces in the kernel, but only one suitable for our purposes (and, in general, Haskell’s multi-threaded runtime-system). The first one is setns

int setns(int fd, int nstype);

Given a file descriptor of a namespace and its type, it will reassociate the calling thread with that namespace. The crux is the _calling thread_ part. Since we are using the multi-threaded RTS which uses the green-threads model, we do not know which OS thread we end up with, and we cannot guarantee that we will stay on the same thread when interacting with the namespace. Precisely speaking, the documentation says:

Reassociating the calling thread with a PID namespace changes only the PID namespace that subsequently created child processes of the caller will be placed in; it does not change the PID namespace of the caller itself.

So if we called setns in one Haskell thread, it would reassociate some OS thread, but we could not control if it was the same thread doing the fork. Another problem we saw, regardless of threading issues, is a chicken-and-egg situation where the namespace must exist before the call. Therefore, we cannot use it.

Another one is unshare

int unshare(int flags);

It will move the calling thread to a new namespace specified in the flags, for example the CLONE_NEWPID flag will create a PID namespace and move the calling thread there. The problem with this call is more or less the same as with the previous one: it has complex interactions with threads if the caller is not single-threaded. For example, CLONE_NEWPID implies the CLONE_THREAD flag, which will result in an error if the caller is not single-threaded.

We could have used this after the fork when the new subprocess is single-threaded (and we’re actually using it for the _unbecoming root_ trick that we’ll be covering later on), but there is another option that allows us to do the same before the fork, plus it gives us more control over various details: the absolute workhorse of Linux process creation - the clone family. It provides the most sophisticated control over what is shared between parent and child, so it can be safely used in multi-threaded programs. It is similar to fork() or vfork() (in fact in modern systems they are implemented in terms of calling clone()).

In a nutshell, it creates a new child process with a complete memory clone of the parent process, which can either be shared (effectively creating a thread) or not. In the latter case you get a new process that will start its life as an independent child process, possibly after some setup. Since the parent's memory clone will be marked as COW (Copy-on-Write), you generally do not pay the price of making a copy of the parent process virtual memory contents when calling clone() (but you do pay for copying the page mappings which can be a substantial cost for a large heap). There are two syscalls that may be of interest - clone and clone3:

long clone(unsigned long flags, void *stack,
           int *parent_tid, int *child_tid,
           unsigned long tls);

long clone3(struct clone_args *cl_args, size_t size);

The flags argument can contain a combination of various constants that govern some of the properties of the newly created child. For example, that it is to be moved to the new namespaces, which is exactly what we want to achieve! The second variant is a superset of the first, essentially allowing more control and some new flags, hidden in the clone_args struct that the old one did not allow. See its man page for all the details.
The first signature has had a _glibc wrapper_ for a long time but with a different signature:

int clone(int (*fn)(void *), void *stack, int flags, void *arg, ...
          /* pid_t *parent_tid, void *tls, pid_t *child_tid */ );

For the other, the _glibc wrapper_ has only been available since the newest 2.35 release, but we don’t have that _glibc_ on our servers, and even if we had it, support for it ends up being disabled in various distributions. The wrapper gives the illusion of parent-child separation by wrapping the child code in a function to which you pass the pointer as the first argument. Why is this so handy? Well, since you can allocate a system stack for the child (the second argument), the parent stack-allocated variables that are referred to as offsets from the stack pointer are inaccessible when the call returns - the compiler does not know what is going on. Glibc nicely prevents you from making this mistake by letting you just share a pointer to some data (this pointer will be valid as the child’s memory is a clone of parent’s memory) and does some ASM plumbing work to save the function pointer and the argument to the new stack and call the function pointer in the child immediately after it has restored it.

Pretty good so far. We can take advantage of all the kernel’s namespaces features with the clone() call, and perhaps even use the nice _glibc wrapper_. Unfortunately, this system call is quite difficult to use correctly, especially in the presence of Haskell’s RTS, as it has some idiosyncrasies to it. Consider that after a successful clone, the child gets a copy of parent (among others):

  • file descriptors,
  • semaphores and mutexes,
  • memory mappings,
  • signal handlers.

This implies that between clone() and exec() (where we completely wipe the child process memory to start our job) we have to be very careful. If, for example, you were to call a function that takes a lock, it might happen that the lock was held at the time of clone(), so it will be held in the child - a recipe for deadlock. Ditto for any kind of global resource. Even a seemingly innocuous printf() is dangerous because it manipulates the global stdio buffer, so in the child you might see partial contents of some parent printf() operation that was occurring during cloning. Even Lennart Poettering himself learnt it the hard way. Let’s call it - problem 1.

In addition, the Haskell RTS allows you to install signal handlers (and installs some of its own by default - SIGINT, SIGQUIT, SIGPIPE and SIGTSTP). This is bad for us because if one of these signals were received after cloning and before the exec call (exec resets signal dispositions), it would trigger the _cloned_ RTS code wreaking havoc - when a Haskell signal handler is called, it uses a pipe to communicate with the IO scheduler. Since this pipe is shared with the child after cloning, any signal caught this way would magically appear as if sent to the parent.

That’s problem 2.

We have to deal with these problems somehow. Problem 1 cannot be avoided, you just have to be careful - but we can be careful as we control the implementation. Problem 2 can be dealt with just like other libraries do, especially the process library - by blocking user signals right before cloning and restoring them afterwards. It would work, but since we spawn new jobs very often, we will have to call {block,unblock}UserSignals very frequently - and it does not prevent the _default_ signal handlers from running. It’s a better idea to rely on modern kernel features and we can do that because luckily we know that this code will run only on modern kernels. This modern approach is to use the CLONE_CLEAR_SIGHAND flag (available since 5.15) when calling clone(). This resets all the signal handlers for the child, which solves (2) without any fuss.

There is one more reason why it is important to start with a fresh set of signal handlers. When the first process joins a PID namespace, it becomes the init process for that namespace, with all the usual rules governing init processes. One is that:

The only signals that can be sent to process ID 1, the init process, are those for which init has explicitly installed signal handlers. This is done to assure the system is not brought down accidentally.

Therefore, we definitely do not want to inherit signal handlers from the parent process as we do not want to take down the namespace unless we specifically want to.

Alas, CLONE_CLEAR_SIGHAND is beyond the scope of the old clone() (and _glibc wrapper_). It is not recognized as a valid constant, and we must use the new clone3() call to be able to use it. As mentioned before, there is no _glibc-wrapper_ for clone3() to ease the burden of calling it properly, so it must be called via:

long syscall(long number, ...);

This means that we’re going back to a very old UNIX technique called _fork-exec_ (we cannot rely on _glibc_ function pointer goodies, as we don’t want to code in assembly to set it up properly). It will look like this:

child_pid = syscall(SYS_clone3, &cl_args, sizeof(cl_args));

switch (child_pid)
{
case -1:
  goto bad_clone;
case 0:
  _exit(run_init(&child_args));
default: /* cleanup */

  return child_pid;
}

But there are some pitfalls we must avoid. First, passing arguments works as long as you don’t remap the stack. If the stack is not remapped, i.e. its virtual address has not been changed after clone(), its memory contents will be the same as in the parent, so all the assumptions made by the compiler for addressing variables on the stack, such as their relative positions in relation to the stack frame or the stack pointer value, will remain intact. Second, the run_init() part has to be careful not to call any _glibc_ function that could mess with the inherited state. The diagram helps to explain it better.

Clone
(The parts marked with skulls are tainted, we have to be careful)

So we’re ready to build our process isolation with the following steps:

  1. Configure the execution environment for the child process, i.e.
    • communication pipes intercepting its std{in,out}
    • environment
    • path and arguments of task to be executed
    • UID/GID mappings between namespaces
  2. Clone the parent process and entering appropriate namespaces in the child
  3. Then
    • (in parent) Grab the child PID and its read/write pipes and return it
    • (in child) setup _std{in, out}_, environment and whatever else might be needed and exec() the job

What namespaces do we want to utilize? Definitely the PID namespace - that’s what it’s all about. But also while designing this system, we came across some interesting things that we wanted to have:

  • We want to be the root in our little “container”, but we do not want the executed job to think it’s running as root - ideally it should think that it is running as a normal user. For monitoring purposes (htop, ps) we also want to see that the job is being run as our service user. For that, we need a user namespace. So in the _environment_ we'll pass the “real” outer user {g,u}id to the child.
  • We would like the executed job to think that it is _alone_ in the system, as it allows us to monitor resource usage per job (e.g. memory) very easily - for that we need a mount namespace (to re-mount the /proc pseudo-filesystem within the child hierarchy)

First, let’s setup a preamble with the necessary variables:

pid_t exec_in_new_ns(char *const args[], //must end with NULL pointer
                     char **environment, //must end with NULL pointer
                     int *in,
                     int *out,
                     int *err,
                     // error handling (omitted)
                     char **what_failed)
{
  int comm_fds_out[2] = {-1, -1}, comm_fds_err[2] = {-1, -1}, comm_fds_in[2] = {-1, -1};

  struct child_args child_args;
  struct clone_args cl_args = {0};
  pid_t child_pid;

  int last_error = 0;

Then we need to set up communication channels with the spawned subprocess. We chose pipes for input/output.

  /* setup pipes to comm with spawned child
  Note: we' are using o_CLOEXEC here to spare ourselves the troubles of closing these in init,
  but also to prevent leaking to child processes. They can still be used as std{out,err}
  in spawned children because of dup2() magic
  */
  if (pipe2(comm_fds_out, O_CLOEXEC) != 0 ||
      pipe2(comm_fds_err, O_CLOEXEC) != 0 ||
      pipe2(comm_fds_in, O_CLOEXEC) != 0)
  {
    last_error = errno;
    goto bad_pipe;
  }

Next, we will establish the environment that will be used by the init procedure - it includes the pipes we want for stdin/out/err, the path to the executable we want to run and its arguments, environment variables and the necessary arguments for clone: what namespaces we will enter, signal handling details… Please note that this structure is passed on the execution stack. We know its address relative to the stack register will be the same in both the parent and the child, and from the compiler's point of view, so it will work.

  /* clone! */

  /* prepare argument array for the child */
  child_args.stdout_fd = comm_fds_out[1];
  child_args.stderr_fd = comm_fds_err[1];
  child_args.stdin_fd = comm_fds_in[0];
  child_args.path = args[0];
  child_args.argv = args;
  child_args.environment = environment;
  child_args.euid = geteuid();
  child_args.egid = getegid();

  /* prepare clone arguments */
  cl_args.flags = CLONE_NEWUSER        // init new user ns
                | CLONE_NEWPID         // init new PID ns (in that user ns)
                | CLONE_NEWNS          // init new mount namespace (for /proc)
                | CLONE_CLEAR_SIGHAND  // signal handlers are not inherited from the parent
      ;

  cl_args.exit_signal = SIGCHLD; // signal to get emitted when run_init is terminated
  // Whatever you do, DO NOT TRY to set '.stack' here as that will mess up everything without proper care !!!

Finally, we call clone3() via _syscall_. Based on the observed result, we execute the rest of parent’s code or child’s init. Most of the error handling is omitted in these snippets, but you know what to expect - painstaking C-style _errno_ checking that drives everyone crazy.

  /* fire it up! */

  child_pid = syscall(SYS_clone3, &cl_args, sizeof(cl_args));

  switch (child_pid)
  {
  case -1:
    last_error = errno;
    goto bad_clone;
  case 0:
    /* here everything hangs by a thread (or, actually, a register). All this works only because
    kernel set up %esp to point to the stack which happens to be the address of the COWed copy of
    the old stack. That's why we can refer to &child_args. Let's get out of here as soon as
    possible. */
    //  exit() is not thread-safe and might mess with the parent's signal handlers and other stuff
    //  when exec() fails, so we use _exit() instead.
    _exit(run_init(&child_args));
  default: /* cleanup */
    close(comm_fds_out[1]);
    close(comm_fds_err[1]);
    close(comm_fds_in[0]);

    *out = comm_fds_out[0];
    *err = comm_fds_err[0];
    *in = comm_fds_in[1];

    return child_pid;
  }

  // error handling (omitted)

  return -1;
}

Let’s write the child’s entry point - run_init

static int run_init(struct child_args *args)
{
  char errstr[4096];

First - we need to configure parent-child communication by installing opposite ends of parent’s pipes as child’s stdin, stdout and stderr.

  /* install stdin, stdout and stderr */
  if (dup2(args->stderr_fd, STDERR_FILENO) == -1)
  {
    _perror(args->stderr_fd, "No stderr logging", errno);
  }
  if (dup2(args->stdout_fd, STDOUT_FILENO) == -1)
  {
    _perror(args->stderr_fd, "No stdout logging", errno);
  };
  if (dup2(args->stdin_fd, STDIN_FILENO) == -1)
  {
    _perror(args->stderr_fd, "No stdin", errno);
  }

Next - mounting /proc helps create the perfect illusion of isolation.

  /* set up a new mount for procfs */

  // retry a few times if EBUSY
  int retries = 5, ret;
  while (retries-- &&
         (ret = mount("proc", "/proc", "proc", MS_NOSUID | MS_NOEXEC | MS_NODEV, NULL)) < 0 &&
         errno == EBUSY)
  {
    sched_yield();
  }
  if (ret < 0)
  {
    _perror(args->stderr_fd, "/proc: mount failed", errno);
    return EX_UNAVAILABLE;
  }

Now we create a perfect inception where the process that is Mr. X in the Real World becomes _root (0)_ in the newly created “container”, but then we make it believe it is no longer root by creating another nested namespace in which it is Mr. X again. Crazy, but it helps if you use ps or htop to check tasks (you can see the correct UID / GID) and it cheats scripts that check which user is executing them and e.g. refuse to run as root (even though the permissions of the namespaced root do not actually extend beyond the namespace).

  /* finally, prepare user namespace we're in
  It's important because otherwise we'd act as 'nobody' to the outside world .e.g when opening files
  not owned by the user-ns. Also, after execvp all the capabilities would have been lost.
  */

  // here we're becoming the root of our little "container"
  if (_prep_user_mappings(0, args->euid, 0, args->egid) < 0)
  {
    _perror(args->stderr_fd, "Failed to write user/group ID mapping in user namespace", errno);
    return EX_NOPERM;
  }

  /* do a little "unshare" trick. */
  if (unshare(CLONE_NEWUSER) == -1)
  {
    _perror(args->stderr_fd, "Continue without unsharing (uid = 0)", errno);
  }
  else
  {
    if (_prep_user_mappings(args->euid, 0, args->egid, 0) < 0)
    {
      _perror(args->stderr_fd, "Failed to write user/group ID mapping in unshared user namespace", errno);
    }
  }

Ready! We have prepared the environment, let's execute the job.

  /* exec job ! */
  execvpe(args->path, args->argv, args->environment);

  //if we're here then surely an error occurred because a successful exec never returns
  snprintf(errstr, 4096, "Cannot exec '%s'", args->path);
  _perror(args->stderr_fd, errstr, errno);
  return EX_OSERR;
}

That’s it!

If you have trouble following the user mappings logic and unshare() “tricks” then we recommend reading about it on LWN where everything is nicely explained. Please also note how all the communication is done here via pipes (printing to console would be unsafe, plus we also want the parent to know what’s going on). _perror() is a little helper we wrote for this purpose:

static void _perror(int fd, const char *s, int errnum)
{
  char buffer[1024];
  const char *errstring;

  errstring = strerror_r(errnum, buffer, sizeof(buffer));
  dprintf(fd, "%s : %s\n", s, errstring);
}

But all in all - we’re doing exactly what we described in the blog.

OK, enough of C! Let’s finally bring namespaces to Haskell!

First, we’ll use FFI to make exec_in_new_ns() callable in Haskell:

{-# LANGUAGE ForeignFunctionInterface #-}
{-# LANGUAGE DerivingStrategies #-}

data ProcessHandles = MkProcessHandles
  { phStdIn :: Handle
  , phStdOut :: Handle
  , phStdErr :: Handle
  } deriving stock (Show, Eq)

type Env = [(String, String)]

type EnvF = Env -> Env

withCEnvironment
  :: Env
  -> (Ptr CString -> IO a)
  -> IO a
withCEnvironment env act = withMany (withCAString . envString) env $ \pEnv ->
    withArray0 nullPtr pEnv act
  where
    envString (name, val) = name ++ ('=' : val)

withCArgs
  :: [String]
  -> (Ptr CString -> IO a)
  -> IO a
withCArgs args act = withMany withCString args $ \pArgs ->
  withArray0 nullPtr pArgs act

-- | Spawns the process found under @argv !! 0@ under new unique pid-, uid- and mount- namespace.
--
-- Returns the PID of the new process along with its standard handles, if successful. On error
-- throws an instance of 'IOError' mapped from the actual @errno@
execProcessInNewNamespace
  :: EnvF -- ^ environment transformer
  -> [String] -- ^ argv
  -> IO (CPid, ProcessHandles)
execProcessInNewNamespace envF args = do
  environ <- getEnvironment
  alloca $ \pFdStdIn ->
    alloca $ \pFdStdOut ->
      alloca $ \pFdStdErr ->
        withCEnvironment (envF environ) $ \pEnv ->
          withCArgs args $ \pArgs ->
            alloca $ \pWhatFailed ->
              doExecProcessInNewNamespace
                pArgs
                pEnv
                pFdStdIn
                pFdStdOut
                pFdStdErr
                pWhatFailed

doExecProcessInNewNamespace
  :: Ptr CString -- ^ args
  -> Ptr CString -- ^ environment
  -> Ptr POSIX.FD -- ^ [out] stdin
  -> Ptr POSIX.FD -- ^ [out] stdout
  -> Ptr POSIX.FD -- ^ [out] stderr
  -> Ptr CString -- ^ [out] what failed
  -> IO (CPid, ProcessHandles) -- ^ pid
doExecProcessInNewNamespace pArgs pEnv pFdStdIn pFdStdOut pFdStdErr pWhatFailed = …

foreign import ccall unsafe "exec_in_new_ns" c_ExecInNewNs
  :: Ptr CString -- ^ args
  -> Ptr CString -- ^ environment
  -> Ptr POSIX.FD -- ^ [out] stdin
  -> Ptr POSIX.FD -- ^ [out] stdout
  -> Ptr POSIX.FD -- ^ [out] stderr
  -> Ptr CString -- ^ [out] what failed
  -> IO CPid-- ^ pid

Second, we need a higher-level interface built around the low-level FFI where we can easily wait for the process termination, terminate it when needed etc. We are going to mimic Haskell's process library to make migrating existing code as easy as possible. To achieve this, we’ll create several records to represent the process about to be created:

data CreateProcess = CreateProcess
  { cpPath :: FilePath -- ^ Path to executable e.g. @\"\/bin\/sh\"@
  , cpArgs :: [String] -- ^ Arguments e.g. @[\"-c\", \"sleep 10\"]@
  , cpEnvF :: EnvF -- ^ Function computing the environment for the new process from the current environment
  }

… and to represent the spawned child process:

data ProcessHandleInternal = Alive CPid | Reaped ProcessStatus
  deriving stock (Eq, Show)

-- | A handle to a process, which can be used to wait for termination of the process using 'waitForProcess'.
data ProcessHandle = ProcessHandle
  { unWrap :: MVar ProcessHandleInternal
  , reaperLock :: MVar ()
  }

-- | Represents a created process
data CreatedProcess = MkCreatedProcess
  { processStdin :: Handle
  , processStdout :: Handle
  , processStderr :: Handle
  , processHandle :: ProcessHandle
  }

type SuperviseProcess a = CreatedProcess -> IO a

The purpose of all of this will be revealed shortly, so keep on reading :-) One thing we can already do with it is to create a new process:

-- | Spawn an external process
createProcess :: CreateProcess -> IO CreatedProcess
createProcess cp = wrapHandles =<<
  execProcessInNewNamespace (cpEnvF cp) (cpPath cp : cpArgs cp)
  where
    wrapHandles (pid, hdls) = do
      handle <- newMVar (Alive pid)
      lock <- newMVar ()
      return MkCreatedProcess
        { processStdin = phStdIn hdls
        , processStdout = phStdOut hdls
        , processStderr = phStdErr hdls
        , processHandle = ProcessHandle { unWrap = handle, reaperLock = lock }
        }

As you can see, we track the lifetime of processes in ProcessHandleInternal (and we also have a mysterious lock that we don’t use yet), but otherwise we just repack stuff. This interface leaves process management entirely to the caller. A better interface would provide a well-defined scope of process lifetime, with supervision and guaranteed termination. We will use functions of type SuperviseProcess and the bracket pattern to ensure that we do not leak any hanging processes.

-- | A bracket-style resource handler for createProcess.
--
-- Ensures that the process gets terminated, all 'CreatedProcess' handles are closed and no zombie
-- is left behind. See 'System.Process.withCreateProcess'.
--
-- __/NOTE/__ The important difference between this call and `withCreateProcess` from the process
-- library is that it will /uninterruptibly/ block while reaping the child during cleanup
-- (for at most 15 seconds) if the child refuses to terminate after @SIGTERM@ (if it's still alive
-- and/or has not been waited for).
-- The other important difference is that it will finally send @SIGKILL@ to this misbehaving
-- process.
withCreateProcess
  :: CreateProcess
  -> (CreatedProcess -> IO a)
  -> IO a
withCreateProcess cp = bracket (createProcess cp) (terminateProcess 15)

So now we have to just implement terminateProcess correctly. It turns out to be a little tricky :-)

The main problem to watch out for is that after killing a process, the PID it was using can be reused by the kernel. So you have to be absolutely sure that you’re reaping the right one. Also, you do not want two threads waiting on a process handle and only one receiving information about the child's death while the other waits indefinitely (or, even worse, waits for an unrelated process that happens to get the same PID later). This certainly reveals to the attentive reader the purpose of the previously introduced reaperLock. The lock simply guards the reapers from stepping on each other's toes. The other lock - processHandle lock - guards our “source of truth” about the state of the process - is it alive or reaped? (Technically, we could get away with one, but this separation makes it easier to write the correct code and also allows us to query the state of the process even if someone is reaping the process at the same time).

-- | Terminate the specified @process@. This function should not be used under normal circumstances
-- , use 'withCreateProcess'. It guarantees to reap @process@ if it is still alive. It does nothing
-- if it has already been reaped e.g. by `waitForProcess`.
--
-- It works by sending @SIGTERM@ first and waiting for at most @timeout@ time for @process@ to exit.
-- If @process@ didn't exit after this time, it will promptly send @SIGKILL@ __/NOTE/__ This call
-- blocks _uninterrutibly_ for at most @timeout@ time, if @process@ refused to exit.
terminateProcess
  :: Delay -- ^ timeout (sends @SIGKILL@ if @p@ didn't exit after the delay)
  -> CreatedProcess -- ^ process
  -> IO ()
terminateProcess timeout p = do
  let handle = processHandle p
  withReaperLock handle $ do
    withProcessHandle handle kill
    modifyProcessHandle handle reap
  closeProcess p
  where
    kill (Alive pid) = ignoreEChild $ POSIX.signalProcess sigTERM pid
    kill (Reaped _) = pure ()

    delay = 0.0005 :: Delay -- second

    reap (Alive pid) = Reaped <$> Exception.uninterruptibleMask_ (
      do
        initialTime <- getMonotonicTime
        let deadline = initialTime + delayToTime timeout
        untilJustM $ do
          now <- getMonotonicTime
          if now <= deadline
          then do
            threadDelay (delayToMicroSeconds delay)
            waitForProcessNoHang_ pid
          else do
            killWithFire pid
            Just <$> waitForProcess_ pid
      )
    reap reaped = pure reaped

    killWithFire pid = POSIX.signalProcess sigKILL pid

With bunch of usual MVar-helpers:

withProcessHandle
  :: ProcessHandle
  -> (ProcessHandleInternal -> IO b)
  -> IO b
withProcessHandle = withMVar . unWrap

modifyProcessHandle
  :: ProcessHandle
  -> (ProcessHandleInternal -> IO ProcessHandleInternal)
  -> IO ()
modifyProcessHandle = modifyMVar_ . unWrap

withReaperLock :: ProcessHandle -> IO b -> IO b
withReaperLock p = withMVar (reaperLock p) . const

and exception helpers:

handleSomeErrors
  :: (IOErrorType -> Bool)
  -> IO b
  -> IO b
  -> IO b
handleSomeErrors p = Exception.handleJust (guard . p . ioeGetErrorType) . const

ignoreSomeErrors
  :: (IOErrorType -> Bool)
  -> IO ()
  -> IO ()
ignoreSomeErrors p = handleSomeErrors p (pure ())

ignoreEChild :: IO () -> IO ()
ignoreEChild = ignoreSomeErrors isEChildErrorType

isEChildError :: IOError -> Bool
isEChildError = isEChildErrorType . ioeGetErrorType

isEChildErrorType :: IOErrorType -> Bool
isEChildErrorType = (== NoSuchThing)

Termination proceeds as follows:

  1. Take the reaper’s lock
  2. If the process is alive
    1. Send SIGTERM
    2. Mask async exceptions and make the reaper uninterruptible *)
    3. Wait 500 us and check the process status
    4. If it died, mark it as dead and remember its exit status.
    5. If it didn’t, go back to 1.3), but when you’ve had enough - send it SIGKILL
    6. Unmask and close the remaining pipes
  3. If it’s already dead - do nothing
  4. Release the lock

*) That’s because you do not want to half-finish the job, but you guarantee that the blockade will be up for at most a specific amount of time.

Having resolved these issues, we can implement things like waitForProcess etc. and finally replace the _process_ library with our version - which basically does the same thing but supports namespaces. Hurray!

- Process.withCreateProcess createProcess superviseProcess
+ Process.withCreateProcess (Process.withSameEnv createProcess) superviseProcess

The final commit looked a bit like this, which made it very nice :-)

Conclusion

We started off with a problem: orphan processes with the wrong parent that stick around forever. Finding a solution was not easy. A common solution would be to introduce containerization, however, that has its own trade-offs and is in general a complicated solution to a more simple problem. After looking at our options, we used Linux namespaces and the latest APIs available to our Linux servers to make sure our job supervisor kills the right processes at the right time.
By implementing the OS-level stuff in C and creating a simple Haskell wrapper, we created a compact and elegant solution to this problem that plagued our worker servers.

When you execute a script on one of our workers, it'll observe an environment very similar to what you would see on Docker:

# worker: [Info] Executing job 'P1 X 2', command ["/bin/bash","-c","ps ax & sleep 7200 ; exit 1"].
# worker: P1 X 2 |  PID TTY    STAT   TIME   COMMAND
# worker: P1 X 2 |    1 pts/0  S+     0:00   /bin/bash -c ps ax & sleep 7200 ; exit 1
# worker: P1 X 2 |    2 pts/0  R+     0:00   ps ax
# worker: P1 X 2 |    3 pts/0  S+     0:00   sleep 7200

Jobs are isolated from each other because we have implemented the execution of child processes in the namespaces of our choice in C. In addition, we were able to integrate it seamlessly with Haskell and make it available as a replacement for the haskell-process library. If there is enough interest, we can try to open-source it on Hackage, so let us know what you think. Thank you!

1: For a complete overview of these APIs and their capabilities, I recommend reading the great article series on LWN.

avatar
Marcin RzeźnickiSoftware Development
avatar
Fabian ThorandLead Development

We are hiring

Are you interested in working at Channable? Check out our vacancy page to see if we have an open position that suits you!

Apply now