Perfect reproducibility using ngless [2/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the second of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility [this post]
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous

Perfect reproducibility using ngless

With ngless, your analysis is perfectly reproducible forever. In particular, this is achieved by the use of explicit version strings in the ngless scripts. Let us the first two lines in the example we used before:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

The first line notes the version of ngless that is to be used to run this script. At the moment, ngless is in a pre-release phase and so the version string is “0.0”. In the future, however, this will enable ngless to keep improving while still allowing all past scripts to work exactly as intendended. No more, “I updated some software package and now all my scripts give me different results.” Everything is explicitly versioned.

There are several command line options for ngless, which can change the way that it works internally (e.g., where it keeps its temporary files, whether it is verbose or not, how many cores it should use, &c). You can also use a configuration file to set these options. However, no command line or configuration option change the final output of the analysis. Everything you need to know about the results is in the script.

Reproducible, extendable, and reviewable

It’s not just about reproducibility. In fact, reproducibility is often not that great per se: if you have a big virtual machine image with 50 dependencies, which runs a 10,000 line hairy script to reproduce the plots in a paper, that’s reproducible, but not very useful (except if you want to really dig in). Ngless scripts, however, are easily extendable and even inspectable. Recall the rest of the script (the bits that do actual work):

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

If you have ever worked a bit with NGS data, you can probably get the gist of what is going on. Except for maybe some of the details of what substrim does (it trims the read by finding the largest sustring where all nucleotides are of at least the given quality, see docs), your guess of what is going on would be pretty accurate.

It is easily extendable: If you want to add another functional table, perhaps using kegg modules, you just need to edit the features argument to the count function (you’d need to know which ones are available, but after looking that up, it’s a trivial change).

If you now wanted to perform a similar analysis on your data, I bet you could easily adapt the script for your settings.


A few weeks ago, I asked on twitter/facebook:

Science people: anyone know of any data on how often reviewers check submitted software/scripts if they are available? Thanks.


I didn’t get an answer for the original question, but there was a bit of discussion and as far as I know nobody really checks code submitted along with papers (which is, by itself, not a enough of a reason to not demand code). However, if you were reviewing a paper and the supplemental material had the little script above, you could easily check it out and make sure the authors used the right settings and databases. The resulting code is inspectable and reviewable.


Friday Links

I have so many links this week, that I am thinking of changing the format, from a regular Friday Links feature to posting some of these short notes as their own posts.

1. On reproducibility and incentives

The linked position paper has an interesting discussion of incentives [1]:

Publishing a result does not make it true. Many published results have uncertain truth value. Dismissing a direct replication as “we already knew that” is misleading; the actual criticism is “someone has already claimed that.”

They also discuss how for-profit actors (pharma and biotech) have better incentives to replicate (and get it right, not published; in the first place):

Investing hundreds of thousands of dollars on a new treatment that is ineffective is a waste of resources and an enormous burden to patients in experimental trials. By contrast, for academic researchers there are few consequences for being wrong. If replications get done and the original result is irreproducible nothing happens.

2. A takedown of a PNAS paper:

If, as an Academic Editor for PLOS One I had received this article as a manuscript, I would probably have recommended Rejection without sending it out for further review. But if I had sent the manuscript out for review, I would have chosen at least some reviewers with relevant psychometric backgrounds.


By the way, the linked publication has a very high altmetric score (top 5% and it was only published last week).

3. On learning bioinformatics. This is the 21st century, the hard part is motivation

4. An awesome video via Ed Yong: what happens when a mosquito bites

I typically avoid science popularizations as frustrating to read (often oversimplified to the point of being wrong or trying to spin a scientific point into some BS philosophy), but Ed Yong is so refreshing. He is truly fascinated by the science itself and understands it.

(The Economist, which I used to have time to read, is also excellent at science, as it is at everything else).

5. A gem found by Derek Lowe:

Emma, please insert NMR data here! where are they? and for this compound, just make up an elemental analysis…

In the supplemental data of a published paper.

6. The dangers of lossy image compression: scanning documents semi-randomly changes numbers: because rows of numbers may look very similar in pixel distance, this makes the system reuse patches of the image!

7. The Mighty Named Pipe

[1] I quote from the preprint on arXiv, I hope it hasn’t changed in the final version.