In this post, I want to show how Jug can be understood as nix for Python pipelines.
What is Jug?
Jug is a framework for Python which enables parallelization, memoization of results, and generally facilitates reproducibility of results.
Consider a very classical problem framework: you want to process a set of files (in a directory called data/
) and then summarize the results
from glob import glob
def count(f):
# Imagine a long running computation
n = 0
for _ in open(f):
n += 1
return n
def mean(partials):
final = sum(partials)/len(partials)
with open('results.txt', 'wt') as out:
out.write(f'Final result: {final}\n')
inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)
This works well, but if the count
function takes a while (which would not be the case in this example), it would be great to be able to take advantage of multiple processors (or even a computer cluster) as the problem is embarassingly parallel (this is an actual technical term, by the way).
With Jug, the code looks just a bit differently and we get parallelism for free:
from glob import glob
from jug import TaskGenerator
@TaskGenerator
def count(f):
# Long running computation
n = 0
for _ in open(f):
n += 1
return n
@TaskGenerator
def mean(partials):
final = sum(partials)
with open('results.txt', 'wt') as out:
out.write(f'Final result: {final}\n')
inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)
Now, we can use Jug to obtain parallelism, memoization and all the other goodies.
Please see the Jug documentation for more info on how to do this.
What is nix?
Nix is a package management system, similar to those used in Linux distributions or conda.
What makes nix almost unique (Guix shares similar ideas) is that nix attempts perfect reproducibility using hashing tricks. Here’s an example of a nix package:
{ numpy, bottle, pyyaml, redis, six , zlib }:
buildPythonPackage rec {
pname = "Jug";
version = "2.0.0";
buildInputs = [ numpy ];
propagatedBuildInputs = [
bottle
pyyaml
redis
six
zlib
];
src = fetchPypi {
pname = "Jug";
version = "2.0.0";
sha256 = "1am73pis8qrbgmpwrkja2qr0n9an6qha1k1yp87nx6iq28w5h7cv";
};
}
This is a simplified version of the Jug package itself and (the full thing is in the official repo). Nix language is a bit hard to read in detail. For today, what matters is that this is a package that depends on other packages (numpy
, bottle
,…) and is a standard Python package obtained from Pypi (nix has library support for these common use-cases).
The result of building this package is a directory with a name like /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0
You may be able to guess that the bit in the middle there w8d485y2vrj9wylkd5w4k4gpnf7qh3qk
is a computed hash of some sort. In fact, this is the hash of code to build the package.
If you change the source code for the package or how it is built, then the hash will change. If you change any dependency, then the hash will also change. So, the final result identifies exactly what was used to the get there.
Jug as nix-for-Python pipelines
Above, I did not present the internals of how Jug works, but it is very similar to nix. Let’s unpack the magic a bit
@TaskGenerator
def count(f):
...
@TaskGenerator
def mean(partials):
...
inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)
This can be seen as an embedded domain-specific language for specifying the dependency graph:
partials = [Task(count, f)
for f in inputs]
Task(mean, partials)
Now, Task(count, f)
will get repeatedly instantiated with a particular value for f
. For example, if the files in the data directory are name 0.txt
, 1.txt
,…
Jug works by hashing together count
and the values of f
to uniquely identify the results of each of these tasks. If you’ve used jug, you will have certainly noticed the appearance of a magic directory jugfile.jugdata
with files named such as jugfile.jugdata/37/b4f7f68c489a6cf3e62fdd7536e1a70d0d2b87
. This is equivalent to the /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0
path above: it uniquely identifies the result of some computational process so that, if anything changes, the path will change.
Like nix
, it works recursively, so that Task(mean, partials)
, which expands to Task(mean, [Task(count, "0.txt"), Task(count, "1.txt"), Task(count, "2.txt")])
(assuming 3 files, called 0.txt
,…) has a hash value that depends on the hash values of all the dependencies.
So, despite the completely different origins and implementations, in the end, Jug and nix share many of the same conceptual models to achieve something very similar: reproducible computations.