Jug as nix-for-Python

In this post, I want to show how Jug can be understood as nix for Python pipelines.

What is Jug?

Jug is a framework for Python which enables parallelization, memoization of results, and generally facilitates reproducibility of results.

Consider a very classical problem framework: you want to process a set of files (in a directory called data/) and then summarize the results

from glob import glob

def count(f):
    # Imagine a long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

def mean(partials):
    final = sum(partials)/len(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')


inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

This works well, but if the count function takes a while (which would not be the case in this example), it would be great to be able to take advantage of multiple processors (or even a computer cluster) as the problem is embarassingly parallel (this is an actual technical term, by the way).

With Jug, the code looks just a bit differently and we get parallelism for free:

from glob import glob
from jug import TaskGenerator

@TaskGenerator
def count(f):
    # Long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

@TaskGenerator
def mean(partials):
    final = sum(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')


inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

Now, we can use Jug to obtain parallelism, memoization and all the other goodies.

Please see the Jug documentation for more info on how to do this.

What is nix?

Nix is a package management system, similar to those used in Linux distributions or conda.

What makes nix almost unique (Guix shares similar ideas) is that nix attempts perfect reproducibility using hashing tricks. Here’s an example of a nix package:

{ numpy, bottle, pyyaml, redis, six , zlib }:

buildPythonPackage rec {
  pname = "Jug";
  version = "2.0.0";
  buildInputs = [ numpy ];
  propagatedBuildInputs = [
    bottle
    pyyaml
    redis
    six
    zlib
  ];

  src = fetchPypi {
    pname = "Jug";
    version = "2.0.0";
    sha256 = "1am73pis8qrbgmpwrkja2qr0n9an6qha1k1yp87nx6iq28w5h7cv";
  };
}

This is a simplified version of the Jug package itself and (the full thing is in the official repo). Nix language is a bit hard to read in detail. For today, what matters is that this is a package that depends on other packages (numpy, bottle,…) and is a standard Python package obtained from Pypi (nix has library support for these common use-cases).

The result of building this package is a directory with a name like /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0

You may be able to guess that the bit in the middle there w8d485y2vrj9wylkd5w4k4gpnf7qh3qk is a computed hash of some sort. In fact, this is the hash of code to build the package.

If you change the source code for the package or how it is built, then the hash will change. If you change any dependency, then the hash will also change. So, the final result identifies exactly what was used to the get there.

Jug as nix-for-Python pipelines

Above, I did not present the internals of how Jug works, but it is very similar to nix. Let’s unpack the magic a bit

@TaskGenerator
def count(f):
    ...

@TaskGenerator
def mean(partials):
    ...
inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

This can be seen as an embedded domain-specific language for specifying the dependency graph:

partials = [Task(count, f)
                for f in inputs]
Task(mean, partials)

Now, Task(count, f) will get repeatedly instantiated with a particular value for f. For example, if the files in the data directory are name 0.txt, 1.txt,…

Jug works by hashing together count and the values of f to uniquely identify the results of each of these tasks. If you’ve used jug, you will have certainly noticed the appearance of a magic directory jugfile.jugdata with files named such as jugfile.jugdata/37/b4f7f68c489a6cf3e62fdd7536e1a70d0d2b87. This is equivalent to the /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0 path above: it uniquely identifies the result of some computational process so that, if anything changes, the path will change.

Like nix, it works recursively, so that Task(mean, partials), which expands to Task(mean, [Task(count, "0.txt"), Task(count, "1.txt"), Task(count, "2.txt")]) (assuming 3 files, called 0.txt,…) has a hash value that depends on the hash values of all the dependencies.

So, despite the completely different origins and implementations, in the end, Jug and nix share many of the same conceptual models to achieve something very similar: reproducible computations.

Release: jug 1.2.2

I released a new version of jug, which contains a few bugfixes. Upgrading should cause absolutely no changes (except one fewer exception on a weird corner case).

Introduction to Jug: Parallel Tasks in Python

Next Tuesday, I’m giving a short talk on Jug for the Heidelberg Python Meetup.

If you miss it, you can hear it in Berlin at the BOSC2013 (Bioinformatics Open Source Conference) in July. I will take this opportunity to write a couple of posts about jug.

Jug is a cross between the venerable make and Python. In Make tradition, you write a jugfile.py. Perhaps, this is best illustrated by an example.

We are going to implement the dumb algorithm for finding all primes under 100. We write a function to check whether a number is prime:

def is_prime(n):
    from time import sleep
    # Sleep a little bit so that this does not run ridiculously fast
    sleep(1.)
    for j in xrange(2,n-1):
        if (n % j) == 0:
            return False
    return True

Then we build tasks out of this function:

from jug import Task
primes100 = [Task(is_prime, n) for n in range(2,101))

Each of these tasks is of the form call “is_prime“ with argument “n“. So far, we have only built the tasks, nothing has been executed. One important point to note is that the tasks are all independent.

You can run jug execute on the command line and jug will start executing tasks:

jug execute &

The nice thing is that it is fine to run multiple of these at the same time:

jug execute &
jug execute &
jug execute &
jug execute &

They will all execute in parallel. We can use jug status to check what is happening:

jug status

Which prints out:

Task name                                    Waiting       Ready    Finished     Running
----------------------------------------------------------------------------------------
primes.is_prime                                    0          74          20           5
........................................................................................
Total:                                             0          74          20           5

74 is_prime tasks are still in the Ready state, 5 are currently running (which is what we expected, right?) and 20 are done.

Wait a little bit and check again:

Task name                                    Waiting       Ready    Finished     Running
----------------------------------------------------------------------------------------
primes.is_prime                                    0           0          99           0
........................................................................................
Total:                                             0           0          99           0

Now every task is finished. If we now run jug execute, it will do nothing, because there is nothing for it to do!

The introduction above has a severe flaw: this is not how you should compute all primes smaller than 100. Also, I have not shown how to get the prime values. On Monday, I will post a more realistic example.

It will also include a processing pipeline where later tasks depend on the results of earlier tasks.

(Really weird thing: as I am typing this, WordPress suggests I link to posts on feminism and Australia. Probably some Australian reference that I am missing here.)

	PLOS ONE：我該不該投這本期刊？… on How Long Does Plos One Take to…
	Ravindra Puli on ANN : Diskhash. Disk-based, pe…
	Covid-falcões, covid… on Did anyone in Santa Clara Coun…
	Marcelo Huerta on Grit: a non-existent alternati…
	luispedro on Friday Links (one day late,…