Jug as nix-for-Python

In this post, I want to show how Jug can be understood as nix for Python pipelines.

What is Jug?

Jug is a framework for Python which enables parallelization, memoization of results, and generally facilitates reproducibility of results.

Consider a very classical problem framework: you want to process a set of files (in a directory called data/) and then summarize the results

from glob import glob

def count(f):
    # Imagine a long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

def mean(partials):
    final = sum(partials)/len(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')


inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

This works well, but if the count function takes a while (which would not be the case in this example), it would be great to be able to take advantage of multiple processors (or even a computer cluster) as the problem is embarassingly parallel (this is an actual technical term, by the way).

With Jug, the code looks just a bit differently and we get parallelism for free:

from glob import glob
from jug import TaskGenerator

@TaskGenerator
def count(f):
    # Long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

@TaskGenerator
def mean(partials):
    final = sum(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')


inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

Now, we can use Jug to obtain parallelism, memoization and all the other goodies.

Please see the Jug documentation for more info on how to do this.

What is nix?

Nix is a package management system, similar to those used in Linux distributions or conda.

What makes nix almost unique (Guix shares similar ideas) is that nix attempts perfect reproducibility using hashing tricks. Here’s an example of a nix package:

{ numpy, bottle, pyyaml, redis, six , zlib }:

buildPythonPackage rec {
  pname = "Jug";
  version = "2.0.0";
  buildInputs = [ numpy ];
  propagatedBuildInputs = [
    bottle
    pyyaml
    redis
    six
    zlib
  ];

  src = fetchPypi {
    pname = "Jug";
    version = "2.0.0";
    sha256 = "1am73pis8qrbgmpwrkja2qr0n9an6qha1k1yp87nx6iq28w5h7cv";
  };
}

This is a simplified version of the Jug package itself and (the full thing is in the official repo). Nix language is a bit hard to read in detail. For today, what matters is that this is a package that depends on other packages (numpybottle,…) and is a standard Python package obtained from Pypi (nix has library support for these common use-cases).

The result of building this package is a directory with a name like /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0

You may be able to guess that the bit in the middle there w8d485y2vrj9wylkd5w4k4gpnf7qh3qk is a computed hash of some sort. In fact, this is the hash of code to build the package.

If you change the source code for the package or how it is built, then the hash will change. If you change any dependency, then the hash will also change. So, the final result identifies exactly what was used to the get there.

Jug as nix-for-Python pipelines

Above, I did not present the internals of how Jug works, but it is very similar to nix. Let’s unpack the magic a bit

@TaskGenerator
def count(f):
    ...

@TaskGenerator
def mean(partials):
    ...
inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]
mean(partials)

This can be seen as an embedded domain-specific language for specifying the dependency graph:

partials = [Task(count, f)
                for f in inputs]
Task(mean, partials)

Now, Task(count, f) will get repeatedly instantiated with a particular value for f. For example, if the files in the data directory are name 0.txt1.txt,…

From the jug manuscript

Jug works by hashing together count and the values of f to uniquely identify the results of each of these tasks. If you’ve used jug, you will have certainly noticed the appearance of a magic directory jugfile.jugdata with files named such as jugfile.jugdata/37/b4f7f68c489a6cf3e62fdd7536e1a70d0d2b87. This is equivalent to the /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0 path above: it uniquely identifies the result of some computational process so that, if anything changes, the path will change.

Like nix, it works recursively, so that Task(mean, partials), which expands to Task(mean, [Task(count, "0.txt"), Task(count, "1.txt"), Task(count, "2.txt")]) (assuming 3 files, called 0.txt,…) has a hash value that depends on the hash values of all the dependencies.

So, despite the completely different origins and implementations, in the end, Jug and nix share many of the same conceptual models to achieve something very similar: reproducible computations.

Big Data Biology Lab Software Tool Commitments

Cross-posting from our group website.

Preamble. We produced two types of code artefacts: (i) code that is supportive of results in a results-driven paper and (ii) software tools intended for widespread use.

For an example of the first type, see the Code Ocean capsule that is linked to (Coelho et al., 2018). The main goal of this type of code release is to serve as an Extended Methods section to the paper. Hopefully, it will be useful for the small minority of readers of the paper who really want to dig into the methods or build upon the results, but the work aims at biological results.

This document focuses on the second type of code release: tools that are intended for widespread use. We’ve released a few of these in the last few years: JugNGLess, and Macrel. Here, the whole point is that others use the code. We also use these tools internally, but if nobody else ever adopts the tools, we will have fallen short.

The Six Commitments

  1. Five-year support (from date of publication) If we publish a tool as a paper, then we commit to supporting it for at least five years from the date of publication. We may stop developing new features, but if there are bugs in the released version, we will assume responsibility and fix them. We will also do any minor updates to keep the tool running (for example, if a new Python version breaks something in one of our Python-based tools, we will fix it). Typically, support is provided if you open an issue on the respective Github page and/or post to the respective mailing-list.
  2. Standard, easy to install, packages Right now, this means: we provide conda packages. In the future, if the community moves to another system, we may move too.
  3. High-quality code with continuous integration All our published packages have continuous integration and try to follow best practices in coding.
  4. Complete documentation We provide documentation for the tools, including tutorials, example data, and reference manuals.
  5. Work well, fail well We strive to make our tools not only work well, but also “fail well”: that is, when the user provides erroneous input, we attempt to provide good quality error messages and to never produce bad output (including never producing partial outputs if the process terminates part-way through processing).
  6. Open source, open communication Not only do we provide the released versions of our tools as open source, but all the development is done in the open as well.

Note for group members: This is a commitment from the group and, at the end of the day, the responsibility is Luis’ responsibility. If you leave the group, you don’t have to be responsible for 5 years. If you leave, your responsibility is just the basic responsibility of any author: to be responsive to queries about what was described in the manuscript, but not anything beyond that. What it does mean is that we will not be submitting papers on tools that risk being difficult to maintain. In fact, while the goals above are phrased as outside-focused, they are also internally important so that we can keep working effectively even as group members move on.

Towards typed pipelines

In the beginning, there was the word. The word was a 16 bit variable and it could be used to store any type of information: you could treat it as an integer (signed or unsigned) or as a pointer. There was no typing. Because this was very error-prone, Hungarian notation was invented, which was a system whereby the nameof a variable was enhanced to contain the type that the programmer intended, so that piCount was a pointer to integer named Count.

Nowadays, all languages are typed, either at compile-time (static), at run-time (dynamic), or a mix of the two and Hungarian notation is not used.

There are often big fights on whether static or dynamic typing is best (329 million Google hits for “is static or dynamic typing better”), but typing itself is uncontroversial. In fact, I think most programmers would find the idea of using an untyped programming language absurd.

Yet, there is one domain where untyped programming is still widely used, namely when writing pipelines that combine multiple programmes. In that domain, there is one type, the stream of Bytes. The stream of Bytes is one of the most successful types in programming. In fact everything is a file (which is snappier than everything is a stream of Bytes, even though that is what it means) is often considered one of the defining features of Unix.

Like “B” programmes of old (the untyped programming language that came before “C” introduced all the type safety that is its defining characteristic), most pipelines use Hungarian notation to mark file types. For example, if we have a file called data.fq.gz, we will assume that it is a gzipped FastQ file. There is additionally some limited dynamic typing: if you try to un-gzip a file that is not actually compressed, then gzip is almost certain to detect this and fail with an error message.

However, for any other type of programming, we would never consider this an acceptable level of typing: semi-defined Hungarian notation and occasional dynamic checks is not a combination proposed by any modern programming language. And, thus, the question is: could we have typed pipelines, where the types correspond to file types?

When I think of the pros and cons of this idea, I constantly see that it is simply a rehash of the discussion of types in programming languages.

Advantages of typed pipelines

  1. Better error checking. This is the big argument for types in programming languages: it’s a safety net.
  2. More automated type-transformations. We call this casting in traditional programming languages, to transform one type into another. In pipelines, this could correspond to automatically compressing/decompressing files, for example. Some tools support this already: just pass in a file with the extension .gz and it will be uncompressed on the fly, but not all do and it’s not universal (gzip is widely supported, but not universally so and it is hard to keep track of which tools support bzip2). In regular programming languages, we routinely debate how much casting should be automatic and how much needs to be made explicit.

Disadvantages

  1. It’s more work, at least at the beginning. I still believe that it pays off in the long run is true, but you do require a bit more investment at the onset. In most programming languages, the debate about typing is dead: typing won; However, there is still debate on static vs dynamic typing, which hinges on this idea of more work now for less work later.
  2. False sense of security as there will still be bugs and slight mismatches in file types (e.g., you can imagine a pipeline that takes an image as an input and can accept TIFF files, but actually fails in one of the 1,000s of subtypes of TIFF files out there as TIFF is a complex format). This is another mirror of some debates in programming languages: what if you have a function that accepts integers, but only integers between 0 and 100, is it beneficial to have a complex type system that guarantees this?
  3. Sometimes, it may just be the right thing to do to pass one type as another type and a language could be too restrictive. I can imagine situations where it could make sense to treat a FastQ file as a columnar file to extract a particular element from the headers. Perhaps our language needs an escape hatch from typing. Some languages, such as C provide such escape hatches by allowing you to treat everything as Bytes if you wish to. Others do not (in fact, many do so indirectly by allowing you to call C code for all the dangerous stuff).

I am still not sure what a typed pipeline would look like exactly, but I increasingly see this as the question behind the research programme that we started with NGLess (see manuscript), although I had started thinking about this before, with Jug (https://jug.readthedocs.io/en/latest/).

Update (July 6 2019): After the initial publication of this post, I was pointed to Bioshake, which seems to be a very interesting attempt to go down this path.

Update II (July 10 2019): A few more pointers: janis is a Python-library for workflow definition (an EDSL) which includes types for bioinformatics. Galaxy also includes types for file data.

NG-meta-profiler & NGLess paper published

(I wanted to write about this earlier, but June was a crazy month with manuscript submissions, grant submissions, and a lot of travel.)

The first NGLess manuscript was finally published. See this twitter thread for a summary, which I will not rehash here as I have already written extensively about NGLess and the ideas behind it here.

I also wrote a Nature Microbiology Community blogpost with some of the history behind the tool, emphasizing again how long it takes to get to a robust tool.

Compared to the preprint, the major change is that (in response to reviewer comments) we enhanced the benchmarking section. Profiling metagenomics tools is difficult. If you use an in silico simulation, you need a realistic distribution as a basis. In our case, we used real data to define the species distribution (using mOTUs2, see https://motu-tool.org/). To obtain the simulated metagenomes, we simulated reads from sequenced genomes according to the real data distribution. There are limitations to this approach, in that we still miss a lot of the complexity of real samples, but we cannot simulate the true unknown. In the end, the functional profiles produced by NG-meta-profiler had high correlations with the ground truth (0.88 for the human gut, 0.82 for the marine environment, Spearman correlation).

Additionally, we included a more explicit discussion of the advantages of NGLess for developing tools like NG-meta-profiler, in the Pipeline design with NGLess section.

NIXML: nix + YAML for easy reproducible environments

The rise and fall of bioconda

A year ago, I remember a conversation which went basically like this:

Them: So, to distribute my package, what do you think I should use?

Me: You should use bioconda.

Them: OK, that’s interesting, but what about …?

Me: No, you should use bioconda.

Them: I will definitely look into it, but maybe it doesn’t fit my package and maybe I will just …

Me: No, you should use bioconda.

That was a year ago. Mind you, I knew there were some issues with conda, but it had a decent UX (user-experience), and more importantly, the community was growing.

Since then, conda has hit scalability concerns, which means that running it is increasingly frustrating: it is slow (an acknowledged issue, but I have had multiple instances of wait 20 minutes for an error message, which didn’t even help me solve the problem); mysterious errors are not uncommon, things that used to work now fail (I have had this more and more recently).

Thus, I no longer recommend bioconda so enthusiastically. What before seemed like some esoteric concerns about guaranteed correctness are now biting us.

The nix model

nix is a Linux distribution with a focus on declarativereproducible builds.

You write a little file (often called default.nix) which describes exactly what you want and the environment is generated from this, exactly the same each time. It has a lot going for it in terms of potential for science:

  1. Can be reproducible to a fault (Byte-for-Byte reproducibility, almost).
  2. Declarative means that the best practice of store your environment for later use is very easy to implement1

Unfortunately, the UX of nix is not great and making the environments reproducible, although possible is not so trivial (although it is now much easier). Nix is very powerful, but it uses a complicated domain-specific language and a semi-documented, ever evolving, set of build conventions which makes it hard for even experienced users to use it directly. There is no way that I can recommend it for general use.

The stack model

Stack is a tool for Haskell which uses the following concept for reproducible environments:

  1. The user specifies a list of packages that they want to use
  2. The user specifies a snapshot of the package directory.

The snapshot determines the versions of all of the packages, which automated testing has revealed to work together (at least up to the limits of unit testing). Furthermore, there is no need to say “version X.Y.Z of package A; version Q.R.S of package B,…”: you specify a single, globally encompassing version (note that this is one of the principles we adopted in NGLess, as we describe in the manuscript).

I really like this UX:

  • Want to update all your packages? just change this one number.
  • Didn’t work? just change it back: you are back where you started. This  is the big advantage of declarative approaches: what you did before does not matter, only the current state of the project.
  • Want to recreate an environment? just use this easy to read text file (for technical reasons, two files, but you get the drift).

Enter NIXML

https://github.com/luispedro/nixml

This is an afternoon hack, but the idea is to combine nix’s power with stack‘s UX by allowing you specify a set of packages in nix, using YaML

For example, start with this env.nlm file,

nixml: v0.0
snapshot: stable-19.03
packages:
  - lang: python
    version: 2
    modules:
      - numpy
      - scipy
      - matplotlib
      - mahotas
      - jupyter
      - scikitlearn
  - lang: nix
    modules:
      - vim

Now, running

nixml shell

returns a shell with the packages listed. Running

nixml shell –pure

returns a shell with only the packages listed, so you can be sure to not rely on external packages.

Internally, this just creates a nix file and runs it, but it adds the stack-like interface:

  1. it is always automatically pinned: you see the stable-19.03 thing? That means, the version of these packages that was available in the stable branch on March 2019.
  2. the syntax is simple, no need to know about python2.withPackages or any other nix internals like that. This means a loss of power for the user, but it will be a better trade-off 99% of the time.

Very much a Work in progress right now, but I am putting it out there as it is already usable for Python-based projects.


  1. There are two types of best practices advice: the type that, most people, once they try it out, adopt; and the type that you need to keep hammering into people’s heads. The second type should be seen as a failure of the tool: “best practices” are a user-experience smell.

NGLess preprint is up

We have posted a preprint describing NG-meta-profiler and NGLess in general:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language Luis Pedro CoelhoRenato AlvesPaulo MonteiroJaime Huerta-CepasAna Teresa FreitasPeer Bork 

My initial goal was to develop a tool that (1) used a domain-specific language to describe computation (2) was actually used in production. I did not want a proof-of-concept as one of the major arguments for developing a domain-specific language (DSL)  is that it is more usable than just doing a traditional library in another language. As I am skeptical that you can fully evaluate how good a tool is without long-term, real-world, usage,  I wanted NGLess to be used in my day-to-day research.

NGLess has been a long-time cooking but is now a tool that we use every day to produce real results. In that sense, at least, our objectives have been achieved.

Now, we hope that others find it as useful as we do.

Why NGLess took so long to become a robust tool (but now IS a robust tool)

Titus Brown posted that good research software takes 2-3 years to produce. As we are close to submitting a manuscript for our own NGLess, which took a bit longer than that, I will add some examples of why it took so long to get to this stage.

There is a component of why it took so long that is due to people issues and to the fact that NGLess was mostly developed as we needed to process real data (and, while I was working on other projects, rather than on NGLess). But even if this had been someone’s full time project, it would have taken a long time to get to where it is today.

It does not take so long because there are so many Big ideas in there (I wish). NGLess contains just one Big Idea: a domain specific language that results in a tool that is not just a proof of concept but a is better tool because it uses a DSL; everything else follows from that.

Rather, what takes a long time is to find all the weird corner cases. Most of these are issues the majority of users will never encounter, but collectively they make the tool so much more robust. Here are some examples:

  • Around Feb 2017, a user reported that some samples would crash ngless. The user did not seem to be doing anything wrong, but half-way through the processing, memory usage would start growing until the interpreter crashed. It took me the better part of two days to realize that their input files were malformed: they consisted of a few million well-formed reads, then a multi-Gigabyte long series of zero Bytes. Their input FastQs were, in effect, a gzip bomb.

    There is a kind of open source developer that would reply to this situation by saying well, knuckle-head, don’t feed my perfect software your crappy data, but this is not the NGLess way (whose goal is to minimize the effort of real-life people), so we considered this a bug in NGLess and fixed it so that it now (correctly) complains of malformed input and exits.

  • Recently, we realized that if you use the motus module in a system with a badly working locale, ngless could crash. The reason is that, when using that module, we print out a reference for the paper, which includes some authors with non-ASCII characters in their names. Because of some weird combination of the Haskell runtime system and libiconv (which seems to generally be a mess), it crashes if the locale is not installed correctly.

    Again, there is a kind of developer who would respond to this by well, fix your locale installation, knuckle-head, but we added a workaround.

  • When I taught the first ngless workshop in late 2017, I realized that one of inconsistencies in the language was causing a lot of confusion for the learners. So, the next release fixed that issue.
  • There are two variants of FastQ files, depending on whether the qualities are encoded by adding 33 or 64. It is generally trivial to infer which one is being used, though, so NGLess heuristically does so. In Feb 2017, a user reported that the heuristics were failing on one particular (well-formed) example, so we improved the heuristics.
  • There are 25 commits which say they produce “better error messages”. Most of these resulted from a confused debugging session.

None of these issues took that long to fix, but they only emerge through a prolonged beta use period.

You need users to try all types of bad input files, you need to try to teach the tool to understand where the pain points for new users are, you need someone to try to it out in a system with a mis-installed locale, &c

One possible conclusion it that for certain kinds of scientific software, it is actually better if it is done as a side-project: you can keep publishing other stuff, you can apply it on several problems, and the long gestation period catches all these minor issues, even while you are being productive elsewhere. (This was also true of Jug: it was never really a project per se, but after a long time it became usable and its own paper).