Towards typed pipelines

In the beginning, there was the word. The word was a 16 bit variable and it could be used to store any type of information: you could treat it as an integer (signed or unsigned) or as a pointer. There was no typing. Because this was very error-prone, Hungarian notation was invented, which was a system whereby the nameof a variable was enhanced to contain the type that the programmer intended, so that piCount was a pointer to integer named Count.

Nowadays, all languages are typed, either at compile-time (static), at run-time (dynamic), or a mix of the two and Hungarian notation is not used.

There are often big fights on whether static or dynamic typing is best (329 million Google hits for “is static or dynamic typing better”), but typing itself is uncontroversial. In fact, I think most programmers would find the idea of using an untyped programming language absurd.

Yet, there is one domain where untyped programming is still widely used, namely when writing pipelines that combine multiple programmes. In that domain, there is one type, the stream of Bytes. The stream of Bytes is one of the most successful types in programming. In fact everything is a file (which is snappier than everything is a stream of Bytes, even though that is what it means) is often considered one of the defining features of Unix.

Like “B” programmes of old (the untyped programming language that came before “C” introduced all the type safety that is its defining characteristic), most pipelines use Hungarian notation to mark file types. For example, if we have a file called data.fq.gz, we will assume that it is a gzipped FastQ file. There is additionally some limited dynamic typing: if you try to un-gzip a file that is not actually compressed, then gzip is almost certain to detect this and fail with an error message.

However, for any other type of programming, we would never consider this an acceptable level of typing: semi-defined Hungarian notation and occasional dynamic checks is not a combination proposed by any modern programming language. And, thus, the question is: could we have typed pipelines, where the types correspond to file types?

When I think of the pros and cons of this idea, I constantly see that it is simply a rehash of the discussion of types in programming languages.

Advantages of typed pipelines

  1. Better error checking. This is the big argument for types in programming languages: it’s a safety net.
  2. More automated type-transformations. We call this casting in traditional programming languages, to transform one type into another. In pipelines, this could correspond to automatically compressing/decompressing files, for example. Some tools support this already: just pass in a file with the extension .gz and it will be uncompressed on the fly, but not all do and it’s not universal (gzip is widely supported, but not universally so and it is hard to keep track of which tools support bzip2). In regular programming languages, we routinely debate how much casting should be automatic and how much needs to be made explicit.

Disadvantages

  1. It’s more work, at least at the beginning. I still believe that it pays off in the long run is true, but you do require a bit more investment at the onset. In most programming languages, the debate about typing is dead: typing won; However, there is still debate on static vs dynamic typing, which hinges on this idea of more work now for less work later.
  2. False sense of security as there will still be bugs and slight mismatches in file types (e.g., you can imagine a pipeline that takes an image as an input and can accept TIFF files, but actually fails in one of the 1,000s of subtypes of TIFF files out there as TIFF is a complex format). This is another mirror of some debates in programming languages: what if you have a function that accepts integers, but only integers between 0 and 100, is it beneficial to have a complex type system that guarantees this?
  3. Sometimes, it may just be the right thing to do to pass one type as another type and a language could be too restrictive. I can imagine situations where it could make sense to treat a FastQ file as a columnar file to extract a particular element from the headers. Perhaps our language needs an escape hatch from typing. Some languages, such as C provide such escape hatches by allowing you to treat everything as Bytes if you wish to. Others do not (in fact, many do so indirectly by allowing you to call C code for all the dangerous stuff).

I am still not sure what a typed pipeline would look like exactly, but I increasingly see this as the question behind the research programme that we started with NGLess (see manuscript), although I had started thinking about this before, with Jug (https://jug.readthedocs.io/en/latest/).

Update (July 6 2019): After the initial publication of this post, I was pointed to Bioshake, which seems to be a very interesting attempt to go down this path.

Update II (July 10 2019): A few more pointers: janis is a Python-library for workflow definition (an EDSL) which includes types for bioinformatics. Galaxy also includes types for file data.

NG-meta-profiler & NGLess paper published

(I wanted to write about this earlier, but June was a crazy month with manuscript submissions, grant submissions, and a lot of travel.)

The first NGLess manuscript was finally published. See this twitter thread for a summary, which I will not rehash here as I have already written extensively about NGLess and the ideas behind it here.

I also wrote a Nature Microbiology Community blogpost with some of the history behind the tool, emphasizing again how long it takes to get to a robust tool.

Compared to the preprint, the major change is that (in response to reviewer comments) we enhanced the benchmarking section. Profiling metagenomics tools is difficult. If you use an in silico simulation, you need a realistic distribution as a basis. In our case, we used real data to define the species distribution (using mOTUs2, see https://motu-tool.org/). To obtain the simulated metagenomes, we simulated reads from sequenced genomes according to the real data distribution. There are limitations to this approach, in that we still miss a lot of the complexity of real samples, but we cannot simulate the true unknown. In the end, the functional profiles produced by NG-meta-profiler had high correlations with the ground truth (0.88 for the human gut, 0.82 for the marine environment, Spearman correlation).

Additionally, we included a more explicit discussion of the advantages of NGLess for developing tools like NG-meta-profiler, in the Pipeline design with NGLess section.

Thoughts on “Revisiting authorship, and JOSS software publications”

This is a direct response to Titus’ post: Revisiting authorship, and JOSS software publications. In fact, I started writing it as a comment there and it became so long that I decided it was better as its own post.

I appreciate the tone of Titus’ post, more asking questions than answering them, so here are my two or three cent:

There is nothing special about software papers. Different fields have different criteria for what consistutes authorship. I believe that particle physics has an approach close to “everyone that committed to the git repo gets to be an author”, which leads to papers with >1000 authors (note that I am not myself a particle physicist or anything close to it). At that point, the currency of authorship and citations is dilluted to the point where I seriously don’t know what the point is. What is maybe special about software papers is that, because they are newer, there hasn’t been time for a rough-consensus to emerge on what should be the criteria (I guess this is done in part in discussions like these). Having said that, even in well-established fields, you still have issues that are never resolved (in molecular biology: the question of whether technicians should be listed as authors brings in strong opinions on both sides).

My position is against every positive contribution deserves authorship. Some positive contributions are significant enough that they deserve authorship, others acknowledgements, others can even go unmentioned. Yes, it’s a judgement call what is “significant” and what is not, but I think someone who, for example, reports a bug that leads to a bugfix is a clear NO (even if it’s a good bug report). Even small code contributions should not lead to authorship (perhaps an acknowledgement is where my indecision boundary is at that point). People who proof read a manuscript also don’t get authorship even if they do find a few typos or suggest a few wording changes.

Of course, contribution need not be code. Tutorials, design, &c, all count. But they should be significant. I also would not consider adding as an author an individual that asked a good question during a seminar on the work. Those sometimes turn out to be like good bug reports in that you improve the work based on them. The fact that significance is a judgement call does not imply that we should drop significance as a criterion.

I think authorship is also about responsibility. If you are an author, then you must take responsibility for some part of the work (naturally, not all of it, but some of it). If there are issues later, it is your responsibility to, at the very least, explain what you did or even to fix it, &c. You should be involved in the paper writing and if, for example, some work is need for revision on that particular aspect of the code, you need to do it.

From my side, I have submitted several patches to projects which were best-efforts at the time, but I don’t want to take any responsibility for the project beyond that. If the author of one of those projects now told me that I needed to redo that to work on Mac OS X because one of the reviewers complained about it, I’d tell them “sorry, I cannot help you”. I don’t think authors should get to do that.

I would get off-topic here, but I also wish there was more of an explicit expectation that if you publish a software paper, you shall provide minimal maintenance for 5-10 years. Too often software papers are the obituary of the project rather than the announcement.

My answer to “Another question: does authorship keep accruing over versions? Should all the authors on sourmash 2.0 be authors on sourmash 3.0?” is a strong NO. You don’t double dip. If anything, I think it’s generally the authors of version 2 that often lose out as people benefit from their work and keep citing version 1.

Finally, a question of my own: what if someone does something outside the project that clearly benefits the project, should they be authors? For example, what if someone does a bioconda package for your code? Or writes an excellent blogpost/tutorial that brings you a very large number of users (or runs a tutorial on it)? Should they be authors? My answer is that it first needs to be significant, and it should not be automatic. It may be appropriate to invite them for authorship, but they should then commit to (at the very least) reading the manuscript and keeping up with the development of the project over the medium-term.

The European Court of Justice’s decision that CRISPR’d plants are GMOs is the right interpretation of a law that is bonkers

The European Court of Justice (think of it as the European Supreme Court) declared that CRISPR’d plants count as GMOs.

I think the Court is correct, CRISPR’d plants are GMOs. The EU does not have a tradition of “legislation by judicial decision” like the US’s Supreme Court (although there have been some instances of such, as in the Uber case). Thus, even though I wish the decision had gone the other way as a matter of legislation, as a matter of legal interpretation, it seems clear that the intent of the law was to ban modern biotechnology as scary and, I don’t see how CRISPR does not fill that role.

The decision is scientifically bonkers, in that it says that older atomic gardening plants are kosher, but the exact same organism would be illegal if it were to be obtained by bioengineering methods. According to this decision, you can use CRISPR to obtain and test a mutation. At this point, it’s a GMO, so you cannot sell it in most of Europe. Then you use atomic gardening, PCR, and cross-breeding and you obtain exactly the same genotype. However, now, it’s not a GMO, so it’s fine to sell it. The property of being a GMO is not a property of the plant, but the property of its history. Some plants may carry markers which will identify them as GMOs, but there may be many cases where you can have two identical plants, only one of which is a GMO. This has irked some scientists (see this NYT article), but frankly, it is the original GMO law that is bonkers in that it regulates a method of how to obtain a plant instead of regulating the end result.

On this one, blame the lawmakers, not the court.

HT/ @PhilippBayer

NGLess preprint is up

We have posted a preprint describing NG-meta-profiler and NGLess in general:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language Luis Pedro CoelhoRenato AlvesPaulo MonteiroJaime Huerta-CepasAna Teresa FreitasPeer Bork 

My initial goal was to develop a tool that (1) used a domain-specific language to describe computation (2) was actually used in production. I did not want a proof-of-concept as one of the major arguments for developing a domain-specific language (DSL)  is that it is more usable than just doing a traditional library in another language. As I am skeptical that you can fully evaluate how good a tool is without long-term, real-world, usage,  I wanted NGLess to be used in my day-to-day research.

NGLess has been a long-time cooking but is now a tool that we use every day to produce real results. In that sense, at least, our objectives have been achieved.

Now, we hope that others find it as useful as we do.

Quick followups: NGLess benchmark & Notebooks as papers

A quick follow-up on two earlier posts:

We finalized the benchmark for ngless that I had discussed earlier:

As you can see, NGLess performs much better than either MOCAT or htseq-count. We tried to use featureCounts too, but that completely failed to produce results for some of the samples (we gave it a whopping 1TB of RAM, but it used it all up before crashing).

It also reveals that although ngless was developed in the context of our metagenomics work, it would also likely do well on the type of problems for which htseq-count is currently being used, in the domain of RNA-seq.

§

Earlier, I also  wrote skeptically about the idea of replacing papers with Jupyter notebooks:

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

The little script that generates the plot above is an excellent example of this. It is available online (on github: plot-comparison.py). It’s over 100 lines and, even then, the final result required some minor aesthetic manipulations in inkscape (so that, if you run it, the result is slightly uglier: in particular, the legend is absent).

Would it really add anything to the presentation of the manuscript to have those 100 lines of code be intermingled with the presentation of ngless as a metagenomics profiler?

In this case, I am confident that a Jupyter notebook would be worse than the current solution of a PDF as a main presentation with the data table and plotting scripts as supplemental material.

Why does natural evolution use genetic algorithms when it’s not a very good optimization method?

[Epistemic status: waiting for jobs to finish on the cluster, so I am writing down random ideas. Take it as speculation.]

Genetic algorithms are a pretty cool idea: when you have an objective function, you can optimize it by starting with some random guessing and then use mutations and cross-over to improve your guess. At each generation, you keep the most promising examples and eventually you will converge on a good solution. Just like evolution does.

Unfortunately, in practice, this idea does not pan out: genetic algorithms are not that effective. In fact, I am not aware of any general problem area where they are considered the best option. For example, in machine learning, the best methods tend to be variations of stochastic gradient descent.

And, yet, evolution uses genetic algorithms. Why doesn’t evolution use stochastic gradient descent or something better than genetic algorithms?

What would evolutionary gradient descent even look like?

First of all, let’s assure ourselves that we are not just using vague analogies to pretend we have deep thoughts. Evolutionary gradient descent is at least conceptually possible.

To be able to do gradient descent, a bacterium reproducing would need two pieces of information to compare itself to its mother cell: (1) how does it differ in genotype and (2) how much better is it doing than its parent. Here is one possible implementation of this idea: (1) tag the genome where it differs from the the mother cell (epigenetics!) and (2) have some memory of how fast it could grow. When reproducing, if we are performing better than our mother, then introduce more mutations in the regions where we differ from mother.

Why don’t we see something like this in nature? Here are some possible answers

Almost all mutations are deleterious

Almost always (viruses might be an exception), higher mutation rates are bad. Even in a controlled way (just the regions that seem to matter), adding more mutations will make things worse rather than better.

The signal is very noisy

Survival or growth is a very noisy signal of how good a genome is. Maybe we just got luckier than our parents in being born at a time of glucose plenty. If the environment is not stable, over reacting to a single observation may be the wrong thing to do.

The relationship between phenotype and genotype is very tenuous

What we’d really like to do is something like “well, it seems that in this environment, it is a good idea if membranes are more flexible, so I will mutate membrane-stiffness more”. However, the relationship between membrane stiffness and the genotype is complex. There is no simple “mutate membrane-stiffness” option for a bacterium. Epistatic effects are killers for simple ideas like this one.

On the other hand, the relationship between any particular weight in a deep CNN and the output is very weak. Yet, gradient descent still works there.

The cost of maintaining the information for gradient descent is too high

Perhaps, it’s just not worth keeping all this accounting information. Especially because it’s going to be yet another layer of noise.

Maybe there are examples of natural gradient descent, we just haven’t found them yet

There are areas of genomes that are more recombination prone than others (and somatic mutation in the immune system is certainly a mechanism of controlled chaos). Viruses may be another entity where some sort of gradient descent could be found. Maybe plants with their weird genomes are using all those extra copies to transmit information across generations like this.

As I said, this is a post of random speculation while I wait for my jobs to finish….