Random work-related updates

There is life besides covid19.

  1. We’ve started blogging a bit as a group on our website, which is a development I am happy about.
  • http://big-data-biology.org/blog/2020-04-29-NME/ : The famous Tank story of a neural network that purported to tell apart American from Soviet tanks, but just classified on the weather (or time of day) is probably an urban legend, but this one is true. What I like about this type of blogging is that this is the type of story that takes a lot of time to unravel, but does not make it to the final manuscript and gets lost. [same story as Twitter thread]. Most of the credit for this post should go to Célio.
  • http://big-data-biology.org/blog/2020-04-10-cryptic : A different type of blogpost: this is a work-in-progress report on some our work. We are still far away from being able to turn this story into a manuscript, but, in the meanwhile, putting it all in writing and out there may accelerate things too. Here, Amy and Célio share most of the credit.

2. We’ve updated the macrel preprint, which includes a few novel things. We have also submitted it to a journal now, so fingers crossed! In parallel, macrel has been updated to version 0.4, which is now the version described in the manuscript.

3. NGLess is now at version 1.1.1. This is mostly a bugfix release compared to v1.1.0.

4. We submitted two abstracts to ISMB 2020: The first one focuses on macrel, while the second one is something we should soon blog about: we have been looking into the problem of predict small ORFs (smORFs) more broadly. For this we took the dataset from (Sberro et al., 2019) which had used conservation signatures to identify real smORFs in silico and treated it as a classification problem: is is possible to based solely on the sequence (including the upstream sequence) to identify whether a smORF is conserved or not (which we are taking as a proxy for it being a functional molecule).

Did anyone in Santa Clara County get Covid-19?

This is not a real question, we know that yes, unfortunately, they did. There are over one thousand confirmed cases. But a recent preprint, posits that over 40,000 and maybe as many as 80,000 people did.

The preprint is COVID-19 Antibody Seroprevalence in Santa Clara County, California by Bendavid et al., MedArxiv, 2020 https://doi.org/10.1101/2020.04.14.20062463

The preprint is not very informative on methods, but, it seems to me, that, if this was the only evidence we had about Santa Clara County, it would barely be enough, according to the traditional standards of scientific evidence, for the authors to claim that anyone in Santa Clara got infected. In any case, the prevalence reported is likely an over estimate.

The evidence

The authors tested 3,330 individuals using a serological (antibody) test and obtained 50 positive results (98.5% tested negative, 1.5% tested positive).

This test was validated by testing on 401 samples from individuals who were known to not have had covid19. Of these 401 known negatives, the test returned positive for 2 of them. The test was also performed on 197 known positive samples, and returned positive for 178 of them.

There are some further adjustments to the results to match demographics, but they are not relevant here.

The null hypothesis

The null hypothesis is that none of the 3,330 individuals had ever come into contact with covid19 and had no antibodies.

Does the evidence reject the null hypothesis?

Maybe, but it’s not so clear and we need to bring in the big guns to get that result. In any case, the estimate of 1.5% is almost certainly an over-estimate.

Naïve version

  1. Let’s estimate the specificity: the point estimate is 99.5% (399/401), but with a confidence interval of [98.3-99.9%]
  2. With the point estimate of 99.5%, then the the p-value of having at least one infection is a healthy 2·10⁻¹¹
  3. However, if the confidence interval is [98.3-99.9%], we should perhaps assume a worst-case. If specificity is 98.3% then the number of positive tests is actually slightly lower than expected (we expected 57 just by chance!). Only if the specificity is above 98.7% do we get some evidence that there may have been at least one infected person.

With this naive approach, the authors have not shown that they were able to detect any infection.

Semi-Bayesian approach

A more complex model samples over the possible values of the specificity

  1. If the specificity is modeled as a random variable, then we have observed 2 positives out of 401 in known cases.
  2. So, we can set a posterior for it of Beta(402, 3) (assuming the classical Beta(1,1) prior).
  3. Let’s now simulate 3330 tests, assuming they are all negative, with the given specificity.
  4. If we repeat this many times, then, about 7% of the times, we get 50 or more false positives! So, p-value=0.07. As the olds say, trending towards significance.

Still no evidence for any infections in Santa Clara County.


Finally, the big guns get the right result (we actually know that there are infections in Santa Clara County).

We now do a full Bayesian model

  1. We model the specificity as above (with a prior Beta(1,1)), but keep also model the true prevalence as another Bayesian variable.
  2. Now, we have some true positives in the set, given by the prevalence.
  3. We model the sensitivity in the same way that we had modeled the specificity already, as a random variable constrained by observed data.
  4. We know that the true + false positives equals 50.

Now the prevalence credible interval is (0.4-1.7%). It is not credible that there are zero individuals who are positives anymore. The credible interval for the number of positives in the set is (15-51).

The posterior distribution for the prevalence is the following:

In this post, I did not even consider the fact that the sampling was heavily biased (it is: the authors recruited through Facebook, it can hardly be expected that people who are looking to get tested would not be enriched for those who suspect were sick).

In the end, I conclude that there is evidence that the prevalence is greater than zero, but likely not as large as the authors claim.

PS: Code is at https://gist.github.com/luispedro/bbacfa6928dd3f142aba64bfa4bd3334

What’s the deal with computational irreproducibility?

How can computational results be so hard to reproduce? Even with the same input and the same code one can get different results. Shouldn’t computers always return the same results for the same computation?

Let’s look at a few classes of problems, from the easiest to solve to the most complicated.

Different, but equivalent, results

Two different gzip files can uncompress to the same result.

This is obviously a meaningless difference. When we promise that we return a certain result, we should not bound ourselves to specific ways of encoding it.

On NGLess, we did learn this the hard way because some of our tests seemed to be flaky at some point as we were comparing the compressed files. So, depending on the machine, tests would either pass or fail. Now, we test the uncompressed versions

Incompletely specified results

What does a sort algorithm return? Well, obviously it should return a sorted version of its inputs. The problem comes when there are “equivalent” (but not identical) items in the set: in which order should they be returned?

In this case, one can use stable sorting, which preserve the order of “equivalent” input elements. Unfortunately, the fastest sorting algorithms are not stable and use randomness (see below). Alternatively, one can use some tie breaking system so that no two elements compare equal. Now, the results are fully specified by the inputs. This can be done even on attributes that would otherwise be meaningless: for example, if you want to display the results of your processing so that the highest scoring sequences come first, you can sort by scores and, if the scores are identical, break ties using the sequence itself (it’s pretty meaningless that sequences starting with Alanines should come before those starting with Valines, but it means that the output is completely specified).

Another problem is when results depend on the environment. For example, if your tool sorts strings, then it will depend on the environment how this sorting is done! This is a huge rabbit hole and arguably a big mistake in API design that the default sort function in programming languages is not a pure function but depends on some deeply hidden state, but we have seen it cause problems where partial results were sorted in incompatible ways. In NGLess, we always use UTF-8 and we always sort in the same way (our results matrices are sorted by row name and those always use the same sort). The cost is that we will not respect all the nuances in how sorting “should” be done differently in Canadian French vs. European French. The gain is that Canadians and French will get the same results.


Many algorithms use random numbers. In practice, one rarely needs truly random numbers and can use pseudo-random numbers, which only behave random, but are perfectly reproducible.

Furthermore, these methods can take a seed which sets their internal machinery to known values so that one can obtain the same sequence every time.

As an alternative to setting it to the same value (which is not appropriate for every situation), one can also set it to a data-dependent value. For example, if you process sequences by batches of 100 sequences, it may be inappropriate to reuse the same seed for every new batch as this could easily create biases. Instead, one can set the seed based on a simple computation from the input data itself (a quick hash of the inputs). Now, each batch will be (1) reproducible and (2) have a different pseudo-random pattern.

Non-deterministic results

Some processes can become truly non-deterministic, not just pseudo-random. This can easily happen if, for example, threads are used.

In the example above, I mentioned resetting the seed for each batch. In a sequential system, this would be complete overkill, but if batches are being processed by separate threads, it is the only way.

Even with these tricks, one can easily have non-deterministic results if there is any state shared between batches or if the order in which they are computed influences the result.

Heuristics and approximations

Now, we get into the really complicated cases: very often, we do not have a true solution to the problem. When we use a tool like bwa we are not really solving the problem of find all the best alignments given a specific definition. There is software that solves that (e.g., Swipe), but it is too slow. Instead, bwa employs a series of very smart heuristics that will give you a very good approximate solution at a small fraction of the (computational) cost.

Now, definition becomes the output is the result of running bwa. This seems qualitatively different from saying the output is a sorted version of the input.


If we now admit that our results are defined by this is the result of running program X on the data as opposed to a more classical mathematical definition, then it becomes imperative that we specify which program it is. Furthermore, we need to specify which version of the program it is. In fact, specifying version is a well-recognized best practice in computational software.

The problem is that it is very hard to version the full stack. You may write in your manuscript that you are using version 1.6.3 of tool X, but if that tool depends on Numpy and Python, you may need to define the full version of those as well (and even that may not be enough). So, while it may be true that computers return the same result for the same computation, this means that we need to present the computer with the same computation all the way from the script code we wrote through to the device drivers.

In this respect, R is a bit better than Python at keeping compatibility, but even R has changed elements such as the random number generator it uses so that even if you were setting the seed to a fixed value as we discussed above, it would give you different results.

My preference is that, if people are going to be providing versions, that they that they provide a machine-readable way to generate the full environment (e.g., a default.nix file, a environment.yml conda file, …). Otherwise, while it is not completely useless, it is often not that informative either.

Nonetheless, this comes with costs: it becomes harder to compose. If tool 1 needs Python 3.6.4, tool 2 needs Python 3.5.3, and tool 3 needs Python 3.5.1, we must have all of them available and switch between them. We do have more and more infrastructure to make this switches fast-enough, but we still end installing Gigabytes of dependencies to run a script of 230 lines.

This also points to another direction: the more we can move away from this is the result of running X v1.2.3 to having outputs be defined by their inputs, the less dependent on specific versions of the tools we become. It may be impossible to get this 100%, but maybe we can get better than we have now. With NGLess, we have tried to move that way in minor ways so that the result does not depend on the version of the tool being run, but we’re still not 100% there.

Big Data Biology Lab Software Tool Commitments

Cross-posting from our group website.

Preamble. We produced two types of code artefacts: (i) code that is supportive of results in a results-driven paper and (ii) software tools intended for widespread use.

For an example of the first type, see the Code Ocean capsule that is linked to (Coelho et al., 2018). The main goal of this type of code release is to serve as an Extended Methods section to the paper. Hopefully, it will be useful for the small minority of readers of the paper who really want to dig into the methods or build upon the results, but the work aims at biological results.

This document focuses on the second type of code release: tools that are intended for widespread use. We’ve released a few of these in the last few years: JugNGLess, and Macrel. Here, the whole point is that others use the code. We also use these tools internally, but if nobody else ever adopts the tools, we will have fallen short.

The Six Commitments

  1. Five-year support (from date of publication) If we publish a tool as a paper, then we commit to supporting it for at least five years from the date of publication. We may stop developing new features, but if there are bugs in the released version, we will assume responsibility and fix them. We will also do any minor updates to keep the tool running (for example, if a new Python version breaks something in one of our Python-based tools, we will fix it). Typically, support is provided if you open an issue on the respective Github page and/or post to the respective mailing-list.
  2. Standard, easy to install, packages Right now, this means: we provide conda packages. In the future, if the community moves to another system, we may move too.
  3. High-quality code with continuous integration All our published packages have continuous integration and try to follow best practices in coding.
  4. Complete documentation We provide documentation for the tools, including tutorials, example data, and reference manuals.
  5. Work well, fail well We strive to make our tools not only work well, but also “fail well”: that is, when the user provides erroneous input, we attempt to provide good quality error messages and to never produce bad output (including never producing partial outputs if the process terminates part-way through processing).
  6. Open source, open communication Not only do we provide the released versions of our tools as open source, but all the development is done in the open as well.

Note for group members: This is a commitment from the group and, at the end of the day, the responsibility is Luis’ responsibility. If you leave the group, you don’t have to be responsible for 5 years. If you leave, your responsibility is just the basic responsibility of any author: to be responsive to queries about what was described in the manuscript, but not anything beyond that. What it does mean is that we will not be submitting papers on tools that risk being difficult to maintain. In fact, while the goals above are phrased as outside-focused, they are also internally important so that we can keep working effectively even as group members move on.

Towards typed pipelines

In the beginning, there was the word. The word was a 16 bit variable and it could be used to store any type of information: you could treat it as an integer (signed or unsigned) or as a pointer. There was no typing. Because this was very error-prone, Hungarian notation was invented, which was a system whereby the nameof a variable was enhanced to contain the type that the programmer intended, so that piCount was a pointer to integer named Count.

Nowadays, all languages are typed, either at compile-time (static), at run-time (dynamic), or a mix of the two and Hungarian notation is not used.

There are often big fights on whether static or dynamic typing is best (329 million Google hits for “is static or dynamic typing better”), but typing itself is uncontroversial. In fact, I think most programmers would find the idea of using an untyped programming language absurd.

Yet, there is one domain where untyped programming is still widely used, namely when writing pipelines that combine multiple programmes. In that domain, there is one type, the stream of Bytes. The stream of Bytes is one of the most successful types in programming. In fact everything is a file (which is snappier than everything is a stream of Bytes, even though that is what it means) is often considered one of the defining features of Unix.

Like “B” programmes of old (the untyped programming language that came before “C” introduced all the type safety that is its defining characteristic), most pipelines use Hungarian notation to mark file types. For example, if we have a file called data.fq.gz, we will assume that it is a gzipped FastQ file. There is additionally some limited dynamic typing: if you try to un-gzip a file that is not actually compressed, then gzip is almost certain to detect this and fail with an error message.

However, for any other type of programming, we would never consider this an acceptable level of typing: semi-defined Hungarian notation and occasional dynamic checks is not a combination proposed by any modern programming language. And, thus, the question is: could we have typed pipelines, where the types correspond to file types?

When I think of the pros and cons of this idea, I constantly see that it is simply a rehash of the discussion of types in programming languages.

Advantages of typed pipelines

  1. Better error checking. This is the big argument for types in programming languages: it’s a safety net.
  2. More automated type-transformations. We call this casting in traditional programming languages, to transform one type into another. In pipelines, this could correspond to automatically compressing/decompressing files, for example. Some tools support this already: just pass in a file with the extension .gz and it will be uncompressed on the fly, but not all do and it’s not universal (gzip is widely supported, but not universally so and it is hard to keep track of which tools support bzip2). In regular programming languages, we routinely debate how much casting should be automatic and how much needs to be made explicit.


  1. It’s more work, at least at the beginning. I still believe that it pays off in the long run is true, but you do require a bit more investment at the onset. In most programming languages, the debate about typing is dead: typing won; However, there is still debate on static vs dynamic typing, which hinges on this idea of more work now for less work later.
  2. False sense of security as there will still be bugs and slight mismatches in file types (e.g., you can imagine a pipeline that takes an image as an input and can accept TIFF files, but actually fails in one of the 1,000s of subtypes of TIFF files out there as TIFF is a complex format). This is another mirror of some debates in programming languages: what if you have a function that accepts integers, but only integers between 0 and 100, is it beneficial to have a complex type system that guarantees this?
  3. Sometimes, it may just be the right thing to do to pass one type as another type and a language could be too restrictive. I can imagine situations where it could make sense to treat a FastQ file as a columnar file to extract a particular element from the headers. Perhaps our language needs an escape hatch from typing. Some languages, such as C provide such escape hatches by allowing you to treat everything as Bytes if you wish to. Others do not (in fact, many do so indirectly by allowing you to call C code for all the dangerous stuff).

I am still not sure what a typed pipeline would look like exactly, but I increasingly see this as the question behind the research programme that we started with NGLess (see manuscript), although I had started thinking about this before, with Jug (https://jug.readthedocs.io/en/latest/).

Update (July 6 2019): After the initial publication of this post, I was pointed to Bioshake, which seems to be a very interesting attempt to go down this path.

Update II (July 10 2019): A few more pointers: janis is a Python-library for workflow definition (an EDSL) which includes types for bioinformatics. Galaxy also includes types for file data.

NG-meta-profiler & NGLess paper published

(I wanted to write about this earlier, but June was a crazy month with manuscript submissions, grant submissions, and a lot of travel.)

The first NGLess manuscript was finally published. See this twitter thread for a summary, which I will not rehash here as I have already written extensively about NGLess and the ideas behind it here.

I also wrote a Nature Microbiology Community blogpost with some of the history behind the tool, emphasizing again how long it takes to get to a robust tool.

Compared to the preprint, the major change is that (in response to reviewer comments) we enhanced the benchmarking section. Profiling metagenomics tools is difficult. If you use an in silico simulation, you need a realistic distribution as a basis. In our case, we used real data to define the species distribution (using mOTUs2, see https://motu-tool.org/). To obtain the simulated metagenomes, we simulated reads from sequenced genomes according to the real data distribution. There are limitations to this approach, in that we still miss a lot of the complexity of real samples, but we cannot simulate the true unknown. In the end, the functional profiles produced by NG-meta-profiler had high correlations with the ground truth (0.88 for the human gut, 0.82 for the marine environment, Spearman correlation).

Additionally, we included a more explicit discussion of the advantages of NGLess for developing tools like NG-meta-profiler, in the Pipeline design with NGLess section.

Thoughts on “Revisiting authorship, and JOSS software publications”

This is a direct response to Titus’ post: Revisiting authorship, and JOSS software publications. In fact, I started writing it as a comment there and it became so long that I decided it was better as its own post.

I appreciate the tone of Titus’ post, more asking questions than answering them, so here are my two or three cent:

There is nothing special about software papers. Different fields have different criteria for what consistutes authorship. I believe that particle physics has an approach close to “everyone that committed to the git repo gets to be an author”, which leads to papers with >1000 authors (note that I am not myself a particle physicist or anything close to it). At that point, the currency of authorship and citations is dilluted to the point where I seriously don’t know what the point is. What is maybe special about software papers is that, because they are newer, there hasn’t been time for a rough-consensus to emerge on what should be the criteria (I guess this is done in part in discussions like these). Having said that, even in well-established fields, you still have issues that are never resolved (in molecular biology: the question of whether technicians should be listed as authors brings in strong opinions on both sides).

My position is against every positive contribution deserves authorship. Some positive contributions are significant enough that they deserve authorship, others acknowledgements, others can even go unmentioned. Yes, it’s a judgement call what is “significant” and what is not, but I think someone who, for example, reports a bug that leads to a bugfix is a clear NO (even if it’s a good bug report). Even small code contributions should not lead to authorship (perhaps an acknowledgement is where my indecision boundary is at that point). People who proof read a manuscript also don’t get authorship even if they do find a few typos or suggest a few wording changes.

Of course, contribution need not be code. Tutorials, design, &c, all count. But they should be significant. I also would not consider adding as an author an individual that asked a good question during a seminar on the work. Those sometimes turn out to be like good bug reports in that you improve the work based on them. The fact that significance is a judgement call does not imply that we should drop significance as a criterion.

I think authorship is also about responsibility. If you are an author, then you must take responsibility for some part of the work (naturally, not all of it, but some of it). If there are issues later, it is your responsibility to, at the very least, explain what you did or even to fix it, &c. You should be involved in the paper writing and if, for example, some work is need for revision on that particular aspect of the code, you need to do it.

From my side, I have submitted several patches to projects which were best-efforts at the time, but I don’t want to take any responsibility for the project beyond that. If the author of one of those projects now told me that I needed to redo that to work on Mac OS X because one of the reviewers complained about it, I’d tell them “sorry, I cannot help you”. I don’t think authors should get to do that.

I would get off-topic here, but I also wish there was more of an explicit expectation that if you publish a software paper, you shall provide minimal maintenance for 5-10 years. Too often software papers are the obituary of the project rather than the announcement.

My answer to “Another question: does authorship keep accruing over versions? Should all the authors on sourmash 2.0 be authors on sourmash 3.0?” is a strong NO. You don’t double dip. If anything, I think it’s generally the authors of version 2 that often lose out as people benefit from their work and keep citing version 1.

Finally, a question of my own: what if someone does something outside the project that clearly benefits the project, should they be authors? For example, what if someone does a bioconda package for your code? Or writes an excellent blogpost/tutorial that brings you a very large number of users (or runs a tutorial on it)? Should they be authors? My answer is that it first needs to be significant, and it should not be automatic. It may be appropriate to invite them for authorship, but they should then commit to (at the very least) reading the manuscript and keeping up with the development of the project over the medium-term.

The European Court of Justice’s decision that CRISPR’d plants are GMOs is the right interpretation of a law that is bonkers

The European Court of Justice (think of it as the European Supreme Court) declared that CRISPR’d plants count as GMOs.

I think the Court is correct, CRISPR’d plants are GMOs. The EU does not have a tradition of “legislation by judicial decision” like the US’s Supreme Court (although there have been some instances of such, as in the Uber case). Thus, even though I wish the decision had gone the other way as a matter of legislation, as a matter of legal interpretation, it seems clear that the intent of the law was to ban modern biotechnology as scary and, I don’t see how CRISPR does not fill that role.

The decision is scientifically bonkers, in that it says that older atomic gardening plants are kosher, but the exact same organism would be illegal if it were to be obtained by bioengineering methods. According to this decision, you can use CRISPR to obtain and test a mutation. At this point, it’s a GMO, so you cannot sell it in most of Europe. Then you use atomic gardening, PCR, and cross-breeding and you obtain exactly the same genotype. However, now, it’s not a GMO, so it’s fine to sell it. The property of being a GMO is not a property of the plant, but the property of its history. Some plants may carry markers which will identify them as GMOs, but there may be many cases where you can have two identical plants, only one of which is a GMO. This has irked some scientists (see this NYT article), but frankly, it is the original GMO law that is bonkers in that it regulates a method of how to obtain a plant instead of regulating the end result.

On this one, blame the lawmakers, not the court.

HT/ @PhilippBayer

NGLess preprint is up

We have posted a preprint describing NG-meta-profiler and NGLess in general:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language Luis Pedro CoelhoRenato AlvesPaulo MonteiroJaime Huerta-CepasAna Teresa FreitasPeer Bork 

My initial goal was to develop a tool that (1) used a domain-specific language to describe computation (2) was actually used in production. I did not want a proof-of-concept as one of the major arguments for developing a domain-specific language (DSL)  is that it is more usable than just doing a traditional library in another language. As I am skeptical that you can fully evaluate how good a tool is without long-term, real-world, usage,  I wanted NGLess to be used in my day-to-day research.

NGLess has been a long-time cooking but is now a tool that we use every day to produce real results. In that sense, at least, our objectives have been achieved.

Now, we hope that others find it as useful as we do.

Quick followups: NGLess benchmark & Notebooks as papers

A quick follow-up on two earlier posts:

We finalized the benchmark for ngless that I had discussed earlier:

As you can see, NGLess performs much better than either MOCAT or htseq-count. We tried to use featureCounts too, but that completely failed to produce results for some of the samples (we gave it a whopping 1TB of RAM, but it used it all up before crashing).

It also reveals that although ngless was developed in the context of our metagenomics work, it would also likely do well on the type of problems for which htseq-count is currently being used, in the domain of RNA-seq.


Earlier, I also  wrote skeptically about the idea of replacing papers with Jupyter notebooks:

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

The little script that generates the plot above is an excellent example of this. It is available online (on github: plot-comparison.py). It’s over 100 lines and, even then, the final result required some minor aesthetic manipulations in inkscape (so that, if you run it, the result is slightly uglier: in particular, the legend is absent).

Would it really add anything to the presentation of the manuscript to have those 100 lines of code be intermingled with the presentation of ngless as a metagenomics profiler?

In this case, I am confident that a Jupyter notebook would be worse than the current solution of a PDF as a main presentation with the data table and plotting scripts as supplemental material.