Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

This week, a discussion broke out about the stability of the Python scientific ecosystem. It was triggered by a blogpost from Konrad Hinsen, which led to several twitter follow ups.

First of all, let me  say that numpy/scipy are great. I use them and recommend them all the time. I am not disparaging the projects as a whole or the people who work on them. It’s just that I would prefer if they were stabler. Given twitter’s limitations, perhaps this was not as clear as I would like on my own twitter response:

I pointed out that I have been bit by API changes:

All of these are individually minor (and can be worked around), but these are just the issues that I have personally ran into and caused enough problems for me to remember them. The most serious was the mannwhitneyu change, which was a silent change (i.e., the function started returning a different result rather than raising an exception or another type of error).


Konrad had pointed out the Linux kernel project as one extreme version of “we never break user code”:

The other extreme is the Silicon-Valley-esque “move fast and break stuff”, which is appropriate for a new project. These are not binary decisions, but two extremes of a continuum. I would hope to see numpy move more towards the “APIs are contracts we respect” side of the spectrum as I think it currently behaves too much like a startup.

Numpy does not use semantic versioning, but if it did almost all its releases would be major releases as they almost always break some code. We’d be at Numpy 14.0 by now. Semantic versioning would allow for a smaller number of “large, possibly-breaking releases” (every few years) instead of a constant stream of minor backwards-incompatible changes. We’d have Numpy 4.2 now, and a list of deprecated features to be removed by 5.0.

Some of the problems that have arisen could have been solved by (1) introducing a new function with the better behaviour, (2) deprecating the old one, (3) waiting a few years and removing the original version (in a major release, for example). This would avoid the most important problem, silent changes.


A typical response is to say “well, just use anaconda (or similar) to freeze your dependencies”. Again, let me be clear, I use and recommend anaconda for everything. It’s great. But, in the context of the backwards compatibility problem, I don’t think this recommendation is really thought through as it only solves a fraction of the problem at hand (again, an important fraction but it’s not a magic wand).  (See also this post by Titus Brown).

What does anaconda not solve? It does not solve the problem of the intermediate layer, libraries which use numpy, but are to be used by final users. What is the expectation here? That I release my computer vision code (mahotas) with a note: Only works on Numpy 1.11? What if I want a project that uses both mahotas and scikit-learn, but scikit-learn is for Numpy 1.12 only? Is the conclusion that you cannot mix mahotas and scikit-learn? This would be worse than the Python 2/3 split. A typical project of mine might use >5 different numpy-dependent libraries. What are the chances that they all expect the exact same numpy version?

Right now, the solution I use in my code is “if I know that this numpy function has changed behaviour, either work around it, avoid it, or reimplement it (perhaps by copying and pasting from numpy)”. For example, some functions return views or copies depending on which version of numpy you have. To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.

It also happens all the time that I have an anaconda environment, add a new package and numpy is upgraded/downgraded. Is this to be considered buggy behaviour by anaconda? Anaconda currently does not upgrade everything to Python 3 when you request a package that is not available on Python 2, nor does it downgrade from 3 to 2; why should it treat numpy any differently if there is no guarantee that behaviour is consistent across numpy verions?

Sometimes the code at hand is not even an officially released library, but some code from another project. Let’s say that I have code that takes a metagenomics abundance matrix, does some preprocessing and computes stats and plots. I might have written it originally for a paper a few years back, but now want to do the same analysis on new data. Is the recommendation that I always write from scratch because it’s a new numpy version? What if it’s someone else asking me for the code? Is the recommendation that I ask “are you still on numpy 1.9, because I only really tested it there”. Note that Python’s dynamic nature actually makes this problem worse than in statically checked languages.

What about training materials? As I also wrote on twitter, it’s when teaching Python that I suffer most from Python 2-vs-Python 3 issues. Is the recommendation that training materials clearly state “This is a tutorial for numpy 1.10 only. Please downgrade to that version or search for a more up to date tutorial”? Note that beginners are the ones most likely to struggle with these issues. I can perfectly understand what it means that: “array == None and array != None do element-wise comparison”(from the numpy 1.13 release notes). But if I was just starting out, would I understand it immediately?

Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.


How NGLess uses its version declaration

NGLess is my metagenomics tool, which is based on a domain specific language. So, NGLess is both a language and a tool (which implements the language).

Since the beginning, ngless has had a focus on reproducibility and one the small ways in which this was implemented was that ngless requires a version declaration. Every ngless script is required to start with a version declaration:

    ngless "0.5"

This was always intended to enable the language to change while keeping perfect reproducibility of past scripts. Until recently, though, this was just hypothetical.

In October, I taught a course on NGLess and it became clear that one of the minor inconsistencies in the previous version of the language (at the time, version “0.0”) was indeed confusing. In the previous version of the language, the preprocess function modified its arguments. No other function did this.

In version “0.5” (which was released on November 1st), preprocess is now a pure function, so that you must assign its output to a value.

However, and this is where the version declaration comes into play, the newer executable still accepts scripts with the version declaration ngless "0.0". Furthermore, if you declare your script as using ngless 0.0, then the old behaviour is used. Thus, we fixed the language, but nobody needs to update their scripts.

Implementation note (which shouldn’t concern the user, but may be interesting to others): before interpretation, ngless will transform the input script, adding checks and optimizing it. A new pass (which is only enabled is the user requested version “0.0”), simply transforms the older code into its newer counterpart. Then, the rest of the process proceeds as if the user had typed in the newer version.

Perfect reproducibility using ngless [2/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the second of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility [this post]
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous

Perfect reproducibility using ngless

With ngless, your analysis is perfectly reproducible forever. In particular, this is achieved by the use of explicit version strings in the ngless scripts. Let us the first two lines in the example we used before:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

The first line notes the version of ngless that is to be used to run this script. At the moment, ngless is in a pre-release phase and so the version string is “0.0”. In the future, however, this will enable ngless to keep improving while still allowing all past scripts to work exactly as intendended. No more, “I updated some software package and now all my scripts give me different results.” Everything is explicitly versioned.

There are several command line options for ngless, which can change the way that it works internally (e.g., where it keeps its temporary files, whether it is verbose or not, how many cores it should use, &c). You can also use a configuration file to set these options. However, no command line or configuration option change the final output of the analysis. Everything you need to know about the results is in the script.

Reproducible, extendable, and reviewable

It’s not just about reproducibility. In fact, reproducibility is often not that great per se: if you have a big virtual machine image with 50 dependencies, which runs a 10,000 line hairy script to reproduce the plots in a paper, that’s reproducible, but not very useful (except if you want to really dig in). Ngless scripts, however, are easily extendable and even inspectable. Recall the rest of the script (the bits that do actual work):

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

If you have ever worked a bit with NGS data, you can probably get the gist of what is going on. Except for maybe some of the details of what substrim does (it trims the read by finding the largest sustring where all nucleotides are of at least the given quality, see docs), your guess of what is going on would be pretty accurate.

It is easily extendable: If you want to add another functional table, perhaps using kegg modules, you just need to edit the features argument to the count function (you’d need to know which ones are available, but after looking that up, it’s a trivial change).

If you now wanted to perform a similar analysis on your data, I bet you could easily adapt the script for your settings.


A few weeks ago, I asked on twitter/facebook:

Science people: anyone know of any data on how often reviewers check submitted software/scripts if they are available? Thanks.


I didn’t get an answer for the original question, but there was a bit of discussion and as far as I know nobody really checks code submitted along with papers (which is, by itself, not a enough of a reason to not demand code). However, if you were reviewing a paper and the supplemental material had the little script above, you could easily check it out and make sure the authors used the right settings and databases. The resulting code is inspectable and reviewable.

Friday Links

I have so many links this week, that I am thinking of changing the format, from a regular Friday Links feature to posting some of these short notes as their own posts.

1. On reproducibility and incentives

The linked position paper has an interesting discussion of incentives [1]:

Publishing a result does not make it true. Many published results have uncertain truth value. Dismissing a direct replication as “we already knew that” is misleading; the actual criticism is “someone has already claimed that.”

They also discuss how for-profit actors (pharma and biotech) have better incentives to replicate (and get it right, not published; in the first place):

Investing hundreds of thousands of dollars on a new treatment that is ineffective is a waste of resources and an enormous burden to patients in experimental trials. By contrast, for academic researchers there are few consequences for being wrong. If replications get done and the original result is irreproducible nothing happens.

2. A takedown of a PNAS paper:

If, as an Academic Editor for PLOS One I had received this article as a manuscript, I would probably have recommended Rejection without sending it out for further review. But if I had sent the manuscript out for review, I would have chosen at least some reviewers with relevant psychometric backgrounds.


By the way, the linked publication has a very high altmetric score (top 5% and it was only published last week).

3. On learning bioinformatics. This is the 21st century, the hard part is motivation

4. An awesome video via Ed Yong: what happens when a mosquito bites

I typically avoid science popularizations as frustrating to read (often oversimplified to the point of being wrong or trying to spin a scientific point into some BS philosophy), but Ed Yong is so refreshing. He is truly fascinated by the science itself and understands it.

(The Economist, which I used to have time to read, is also excellent at science, as it is at everything else).

5. A gem found by Derek Lowe:

Emma, please insert NMR data here! where are they? and for this compound, just make up an elemental analysis…

In the supplemental data of a published paper.

6. The dangers of lossy image compression: scanning documents semi-randomly changes numbers: because rows of numbers may look very similar in pixel distance, this makes the system reuse patches of the image!

7. The Mighty Named Pipe

[1] I quote from the preprint on arXiv, I hope it hasn’t changed in the final version.