Jug as nix-for-Python

In this post, I want to show how Jug can be understood as nix for Python pipelines.

What is Jug?

Jug is a framework for Python which enables parallelization, memoization of results, and generally facilitates reproducibility of results.

Consider a very classical problem framework: you want to process a set of files (in a directory called data/) and then summarize the results

from glob import glob

def count(f):
    # Imagine a long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

def mean(partials):
    final = sum(partials)/len(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')


inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

This works well, but if the count function takes a while (which would not be the case in this example), it would be great to be able to take advantage of multiple processors (or even a computer cluster) as the problem is embarassingly parallel (this is an actual technical term, by the way).

With Jug, the code looks just a bit differently and we get parallelism for free:

from glob import glob
from jug import TaskGenerator

@TaskGenerator
def count(f):
    # Long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

@TaskGenerator
def mean(partials):
    final = sum(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')


inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

Now, we can use Jug to obtain parallelism, memoization and all the other goodies.

Please see the Jug documentation for more info on how to do this.

What is nix?

Nix is a package management system, similar to those used in Linux distributions or conda.

What makes nix almost unique (Guix shares similar ideas) is that nix attempts perfect reproducibility using hashing tricks. Here’s an example of a nix package:

{ numpy, bottle, pyyaml, redis, six , zlib }:

buildPythonPackage rec {
  pname = "Jug";
  version = "2.0.0";
  buildInputs = [ numpy ];
  propagatedBuildInputs = [
    bottle
    pyyaml
    redis
    six
    zlib
  ];

  src = fetchPypi {
    pname = "Jug";
    version = "2.0.0";
    sha256 = "1am73pis8qrbgmpwrkja2qr0n9an6qha1k1yp87nx6iq28w5h7cv";
  };
}

This is a simplified version of the Jug package itself and (the full thing is in the official repo). Nix language is a bit hard to read in detail. For today, what matters is that this is a package that depends on other packages (numpybottle,…) and is a standard Python package obtained from Pypi (nix has library support for these common use-cases).

The result of building this package is a directory with a name like /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0

You may be able to guess that the bit in the middle there w8d485y2vrj9wylkd5w4k4gpnf7qh3qk is a computed hash of some sort. In fact, this is the hash of code to build the package.

If you change the source code for the package or how it is built, then the hash will change. If you change any dependency, then the hash will also change. So, the final result identifies exactly what was used to the get there.

Jug as nix-for-Python pipelines

Above, I did not present the internals of how Jug works, but it is very similar to nix. Let’s unpack the magic a bit

@TaskGenerator
def count(f):
    ...

@TaskGenerator
def mean(partials):
    ...
inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

This can be seen as an embedded domain-specific language for specifying the dependency graph:

partials = [Task(count, f)
                for f in inputs]
Task(mean, partials)

Now, Task(count, f) will get repeatedly instantiated with a particular value for f. For example, if the files in the data directory are name 0.txt1.txt,…

From the jug manuscript

Jug works by hashing together count and the values of f to uniquely identify the results of each of these tasks. If you’ve used jug, you will have certainly noticed the appearance of a magic directory jugfile.jugdata with files named such as jugfile.jugdata/37/b4f7f68c489a6cf3e62fdd7536e1a70d0d2b87. This is equivalent to the /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0 path above: it uniquely identifies the result of some computational process so that, if anything changes, the path will change.

Like nix, it works recursively, so that Task(mean, partials), which expands to Task(mean, [Task(count, "0.txt"), Task(count, "1.txt"), Task(count, "2.txt")]) (assuming 3 files, called 0.txt,…) has a hash value that depends on the hash values of all the dependencies.

So, despite the completely different origins and implementations, in the end, Jug and nix share many of the same conceptual models to achieve something very similar: reproducible computations.

NIXML: nix + YAML for easy reproducible environments

The rise and fall of bioconda

A year ago, I remember a conversation which went basically like this:

Them: So, to distribute my package, what do you think I should use?

Me: You should use bioconda.

Them: OK, that’s interesting, but what about …?

Me: No, you should use bioconda.

Them: I will definitely look into it, but maybe it doesn’t fit my package and maybe I will just …

Me: No, you should use bioconda.

That was a year ago. Mind you, I knew there were some issues with conda, but it had a decent UX (user-experience), and more importantly, the community was growing.

Since then, conda has hit scalability concerns, which means that running it is increasingly frustrating: it is slow (an acknowledged issue, but I have had multiple instances of wait 20 minutes for an error message, which didn’t even help me solve the problem); mysterious errors are not uncommon, things that used to work now fail (I have had this more and more recently).

Thus, I no longer recommend bioconda so enthusiastically. What before seemed like some esoteric concerns about guaranteed correctness are now biting us.

The nix model

nix is a Linux distribution with a focus on declarativereproducible builds.

You write a little file (often called default.nix) which describes exactly what you want and the environment is generated from this, exactly the same each time. It has a lot going for it in terms of potential for science:

  1. Can be reproducible to a fault (Byte-for-Byte reproducibility, almost).
  2. Declarative means that the best practice of store your environment for later use is very easy to implement1

Unfortunately, the UX of nix is not great and making the environments reproducible, although possible is not so trivial (although it is now much easier). Nix is very powerful, but it uses a complicated domain-specific language and a semi-documented, ever evolving, set of build conventions which makes it hard for even experienced users to use it directly. There is no way that I can recommend it for general use.

The stack model

Stack is a tool for Haskell which uses the following concept for reproducible environments:

  1. The user specifies a list of packages that they want to use
  2. The user specifies a snapshot of the package directory.

The snapshot determines the versions of all of the packages, which automated testing has revealed to work together (at least up to the limits of unit testing). Furthermore, there is no need to say “version X.Y.Z of package A; version Q.R.S of package B,…”: you specify a single, globally encompassing version (note that this is one of the principles we adopted in NGLess, as we describe in the manuscript).

I really like this UX:

  • Want to update all your packages? just change this one number.
  • Didn’t work? just change it back: you are back where you started. This  is the big advantage of declarative approaches: what you did before does not matter, only the current state of the project.
  • Want to recreate an environment? just use this easy to read text file (for technical reasons, two files, but you get the drift).

Enter NIXML

https://github.com/luispedro/nixml

This is an afternoon hack, but the idea is to combine nix’s power with stack‘s UX by allowing you specify a set of packages in nix, using YaML

For example, start with this env.nlm file,

nixml: v0.0
snapshot: stable-19.03
packages:
  - lang: python
    version: 2
    modules:
      - numpy
      - scipy
      - matplotlib
      - mahotas
      - jupyter
      - scikitlearn
  - lang: nix
    modules:
      - vim

Now, running

nixml shell

returns a shell with the packages listed. Running

nixml shell –pure

returns a shell with only the packages listed, so you can be sure to not rely on external packages.

Internally, this just creates a nix file and runs it, but it adds the stack-like interface:

  1. it is always automatically pinned: you see the stable-19.03 thing? That means, the version of these packages that was available in the stable branch on March 2019.
  2. the syntax is simple, no need to know about python2.withPackages or any other nix internals like that. This means a loss of power for the user, but it will be a better trade-off 99% of the time.

Very much a Work in progress right now, but I am putting it out there as it is already usable for Python-based projects.


  1. There are two types of best practices advice: the type that, most people, once they try it out, adopt; and the type that you need to keep hammering into people’s heads. The second type should be seen as a failure of the tool: “best practices” are a user-experience smell.

Packaging ngless with nix & brew

I spent a little bit of time packaging ngless in these last few days. I currently have packages for nix and homebrew. I used homebrew because I wanted to get a package for Mac OS, but there is also an easy-to-install Linux port, so this could be an option for Linux users too.

Because ngless is Haskell based, I was afraid this was going to be complicated. Before stack appeared, getting all your Haskell dependencies right for a Haskell project was a non-trivial affair (The google query cabal hell returns >500k results).

Contrary to expectations, it was surprisingly easy to package ngless with both nix and brew.

Nix

The result looks like this:

{ nixpkgs ? import  {}, compiler ? "ghc801" }:
nixpkgs.pkgs.haskell.packages.${compiler}.callPackage ./ngless.nix { }

plus the file ngless.nix, which was 95% autogenerated with cabal2nix.

Now, with nix, you can install ngless with a single command:

nix-env -i -f https://github.com/luispedro/ngless-nix/archive/master.tar.gz

At the moment, you need to be on the unstable channel for this to work (because some of the dependencies were only added recently).

The only non-trivial bit was tweaking the dependencies so that ngless can be compiled with GHC 8. I am still using the previous version (7.10) for development and could have just used it as well for this distribution (in fact you can still use it by specifying --arg compiler '"ghc7103"' on the command line).

However, (1) eventually GHC 8 will take over so we might as well see the issues now and (2) ngless compiled with GHC 8 runs 10~20% faster (in part because of a change in GHC introduced to make ngless faster¹).

Homebrew

Since I already use nix on my laptop and have experience navigating its quirks, I was confident that I could get nix to work, but I had never used brew.

Still, about 30 minutes after starting to google for brew, I had a formula that mostly worked. One or two bug fixes later and it just works. I was able to install ngless on both a Linux and a Mac machine using it. Below I show the whole thing so you can see how small the resulting file is.

On the one hand, it was much simpler than nix; dependencies are implicit and there were no weird error messages from the nix language or any need to follow a long trail of references to figure out where something is defined. On the other hand, though, I did run into a few issues which would not have existed with nix related to versions and caching.

After brew downloads the file corresponding to a specific version, it will not redownload even if you change its URL/SHA256. It just assumes that the previous file is correct and you have to manually delete it from the cache. Not a big deal, but nix would not let you make this mistake.

This seems to be the nix tradeoff: a bit rough on the user experience over the short term but easier to understand and fewer errors over the long term. If the efforts to improve nix’s usability bear fruit, it’ll be superior to alternatives. Right now, brew is still nicer for casual users.

Here is the complete brew formula:

require "language/haskell"

class Ngless < Formula
include Language::Haskell::Cabal

desc "Domain Specific Language for NGS Processing"
homepage "http://ngless.readthedocs.io/"
url "https://github.com/luispedro/ngless/archive/34e7e79be62f76b1a4d61d94a59a1e42d9111308.zip"
sha256 "1ac04e535390b0a6189bcfc1c486b54443dc46b7f6b74bf5221b8e4c98c463bc"
version "0.0.0"

head "https://github.com/luispedro/ngless.git"

depends_on "ghc" => :build
depends_on "cabal-install" => :build

depends_on "bwa" => :install
depends_on "samtools" => :install

def install
system "m4 NGLess.cabal.m4 > NGLess.cabal"
install_cabal_package
end
end

¹ This change really did fix a performance bug in GHC, which should help all multi-threaded programmes, but it was ngless that triggered it.