Towards typed pipelines

In the beginning, there was the word. The word was a 16 bit variable and it could be used to store any type of information: you could treat it as an integer (signed or unsigned) or as a pointer. There was no typing. Because this was very error-prone, Hungarian notation was invented, which was a system whereby the nameof a variable was enhanced to contain the type that the programmer intended, so that piCount was a pointer to integer named Count.

Nowadays, all languages are typed, either at compile-time (static), at run-time (dynamic), or a mix of the two and Hungarian notation is not used.

There are often big fights on whether static or dynamic typing is best (329 million Google hits for “is static or dynamic typing better”), but typing itself is uncontroversial. In fact, I think most programmers would find the idea of using an untyped programming language absurd.

Yet, there is one domain where untyped programming is still widely used, namely when writing pipelines that combine multiple programmes. In that domain, there is one type, the stream of Bytes. The stream of Bytes is one of the most successful types in programming. In fact everything is a file (which is snappier than everything is a stream of Bytes, even though that is what it means) is often considered one of the defining features of Unix.

Like “B” programmes of old (the untyped programming language that came before “C” introduced all the type safety that is its defining characteristic), most pipelines use Hungarian notation to mark file types. For example, if we have a file called data.fq.gz, we will assume that it is a gzipped FastQ file. There is additionally some limited dynamic typing: if you try to un-gzip a file that is not actually compressed, then gzip is almost certain to detect this and fail with an error message.

However, for any other type of programming, we would never consider this an acceptable level of typing: semi-defined Hungarian notation and occasional dynamic checks is not a combination proposed by any modern programming language. And, thus, the question is: could we have typed pipelines, where the types correspond to file types?

When I think of the pros and cons of this idea, I constantly see that it is simply a rehash of the discussion of types in programming languages.

Advantages of typed pipelines

  1. Better error checking. This is the big argument for types in programming languages: it’s a safety net.
  2. More automated type-transformations. We call this casting in traditional programming languages, to transform one type into another. In pipelines, this could correspond to automatically compressing/decompressing files, for example. Some tools support this already: just pass in a file with the extension .gz and it will be uncompressed on the fly, but not all do and it’s not universal (gzip is widely supported, but not universally so and it is hard to keep track of which tools support bzip2). In regular programming languages, we routinely debate how much casting should be automatic and how much needs to be made explicit.

Disadvantages

  1. It’s more work, at least at the beginning. I still believe that it pays off in the long run is true, but you do require a bit more investment at the onset. In most programming languages, the debate about typing is dead: typing won; However, there is still debate on static vs dynamic typing, which hinges on this idea of more work now for less work later.
  2. False sense of security as there will still be bugs and slight mismatches in file types (e.g., you can imagine a pipeline that takes an image as an input and can accept TIFF files, but actually fails in one of the 1,000s of subtypes of TIFF files out there as TIFF is a complex format). This is another mirror of some debates in programming languages: what if you have a function that accepts integers, but only integers between 0 and 100, is it beneficial to have a complex type system that guarantees this?
  3. Sometimes, it may just be the right thing to do to pass one type as another type and a language could be too restrictive. I can imagine situations where it could make sense to treat a FastQ file as a columnar file to extract a particular element from the headers. Perhaps our language needs an escape hatch from typing. Some languages, such as C provide such escape hatches by allowing you to treat everything as Bytes if you wish to. Others do not (in fact, many do so indirectly by allowing you to call C code for all the dangerous stuff).

I am still not sure what a typed pipeline would look like exactly, but I increasingly see this as the question behind the research programme that we started with NGLess (see manuscript), although I had started thinking about this before, with Jug (https://jug.readthedocs.io/en/latest/).

Update (July 6 2019): After the initial publication of this post, I was pointed to Bioshake, which seems to be a very interesting attempt to go down this path.

Update II (July 10 2019): A few more pointers: janis is a Python-library for workflow definition (an EDSL) which includes types for bioinformatics. Galaxy also includes types for file data.

NG-meta-profiler & NGLess paper published

(I wanted to write about this earlier, but June was a crazy month with manuscript submissions, grant submissions, and a lot of travel.)

The first NGLess manuscript was finally published. See this twitter thread for a summary, which I will not rehash here as I have already written extensively about NGLess and the ideas behind it here.

I also wrote a Nature Microbiology Community blogpost with some of the history behind the tool, emphasizing again how long it takes to get to a robust tool.

Compared to the preprint, the major change is that (in response to reviewer comments) we enhanced the benchmarking section. Profiling metagenomics tools is difficult. If you use an in silico simulation, you need a realistic distribution as a basis. In our case, we used real data to define the species distribution (using mOTUs2, see https://motu-tool.org/). To obtain the simulated metagenomes, we simulated reads from sequenced genomes according to the real data distribution. There are limitations to this approach, in that we still miss a lot of the complexity of real samples, but we cannot simulate the true unknown. In the end, the functional profiles produced by NG-meta-profiler had high correlations with the ground truth (0.88 for the human gut, 0.82 for the marine environment, Spearman correlation).

Additionally, we included a more explicit discussion of the advantages of NGLess for developing tools like NG-meta-profiler, in the Pipeline design with NGLess section.