Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

This week, a discussion broke out about the stability of the Python scientific ecosystem. It was triggered by a blogpost from Konrad Hinsen, which led to several twitter follow ups.

First of all, let me  say that numpy/scipy are great. I use them and recommend them all the time. I am not disparaging the projects as a whole or the people who work on them. It’s just that I would prefer if they were stabler. Given twitter’s limitations, perhaps this was not as clear as I would like on my own twitter response:

I pointed out that I have been bit by API changes:

All of these are individually minor (and can be worked around), but these are just the issues that I have personally ran into and caused enough problems for me to remember them. The most serious was the mannwhitneyu change, which was a silent change (i.e., the function started returning a different result rather than raising an exception or another type of error).

*

Konrad had pointed out the Linux kernel project as one extreme version of “we never break user code”:

The other extreme is the Silicon-Valley-esque “move fast and break stuff”, which is appropriate for a new project. These are not binary decisions, but two extremes of a continuum. I would hope to see numpy move more towards the “APIs are contracts we respect” side of the spectrum as I think it currently behaves too much like a startup.

Numpy does not use semantic versioning, but if it did almost all its releases would be major releases as they almost always break some code. We’d be at Numpy 14.0 by now. Semantic versioning would allow for a smaller number of “large, possibly-breaking releases” (every few years) instead of a constant stream of minor backwards-incompatible changes. We’d have Numpy 4.2 now, and a list of deprecated features to be removed by 5.0.

Some of the problems that have arisen could have been solved by (1) introducing a new function with the better behaviour, (2) deprecating the old one, (3) waiting a few years and removing the original version (in a major release, for example). This would avoid the most important problem, silent changes.

*

A typical response is to say “well, just use anaconda (or similar) to freeze your dependencies”. Again, let me be clear, I use and recommend anaconda for everything. It’s great. But, in the context of the backwards compatibility problem, I don’t think this recommendation is really thought through as it only solves a fraction of the problem at hand (again, an important fraction but it’s not a magic wand).  (See also this post by Titus Brown).

What does anaconda not solve? It does not solve the problem of the intermediate layer, libraries which use numpy, but are to be used by final users. What is the expectation here? That I release my computer vision code (mahotas) with a note: Only works on Numpy 1.11? What if I want a project that uses both mahotas and scikit-learn, but scikit-learn is for Numpy 1.12 only? Is the conclusion that you cannot mix mahotas and scikit-learn? This would be worse than the Python 2/3 split. A typical project of mine might use >5 different numpy-dependent libraries. What are the chances that they all expect the exact same numpy version?

Right now, the solution I use in my code is “if I know that this numpy function has changed behaviour, either work around it, avoid it, or reimplement it (perhaps by copying and pasting from numpy)”. For example, some functions return views or copies depending on which version of numpy you have. To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.

It also happens all the time that I have an anaconda environment, add a new package and numpy is upgraded/downgraded. Is this to be considered buggy behaviour by anaconda? Anaconda currently does not upgrade everything to Python 3 when you request a package that is not available on Python 2, nor does it downgrade from 3 to 2; why should it treat numpy any differently if there is no guarantee that behaviour is consistent across numpy verions?

Sometimes the code at hand is not even an officially released library, but some code from another project. Let’s say that I have code that takes a metagenomics abundance matrix, does some preprocessing and computes stats and plots. I might have written it originally for a paper a few years back, but now want to do the same analysis on new data. Is the recommendation that I always write from scratch because it’s a new numpy version? What if it’s someone else asking me for the code? Is the recommendation that I ask “are you still on numpy 1.9, because I only really tested it there”. Note that Python’s dynamic nature actually makes this problem worse than in statically checked languages.

What about training materials? As I also wrote on twitter, it’s when teaching Python that I suffer most from Python 2-vs-Python 3 issues. Is the recommendation that training materials clearly state “This is a tutorial for numpy 1.10 only. Please downgrade to that version or search for a more up to date tutorial”? Note that beginners are the ones most likely to struggle with these issues. I can perfectly understand what it means that: “array == None and array != None do element-wise comparison”(from the numpy 1.13 release notes). But if I was just starting out, would I understand it immediately?

Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.

7 thoughts on “Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

  1. Really nice post, thank you!

    I am glad you mentioned Semantic Versioning. It has not come up on the Twitter conversations yet and I was waiting to drop it in :).

  2. Quite confused by some items here. It is my understanding that conda does not at all support an upgrade between Python major versions (i.e. 2->3) within an environment?
    Also, you write about the numpy view vs copy issue: “To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.”
    How is it more computationally inefficient than the design of your code apparently demands? If you only need a view, you would NOT add “.copy()” to the numpy call, so when numpy would now return a copy, THAT in turn would be inefficient on numpy’s side. If your code on the other hand REQUIRES a copy(), you add a “.copy()” call, but that’s not more inefficient than before, is it? You need a copy, you get a copy, if that’s coming in by the upper level numpy function or with an added “.copy()” call does not change efficiency, to my knowledge?
    Finally, in your last section you write about the problem to keep teaching material up-to-date, and I share this pain, especially for fast-growing new upcoming plotting libraries. But your text confusingly mixes Python2/3 with numpy issues, where I cannot see the connection?

    1. I might have been unclear, so perhaps I can clarify it here:

      “It is my understanding that conda does not at all support an upgrade between Python major versions (i.e. 2->3) within an environment?”

      Indeed. However, it is will upgrade (and even downgrade) numpy versions, often even automatically. My question is whether this should be considered a bug given that changing numpy versions can change behaviour.

      ” If your code on the other hand REQUIRES a copy(), you add a “.copy()” call, but that’s not more inefficient than before, is it?”

      The inefficiency is that, when running on versions of numpy that return a copy, this creates a second copy. Actually, I just noticed that in the case of `diagonal()`, this is even recommended as a way to make code work across versions https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.diagonal.html

      Btw, the way to consistently get a view of the diagonal is `arr.flat[::arr.shape[0]+1]`. So, if I need a copy, I write `arr.diagonal().copy()` and if I need a view I write `arr.flat[::arr.shape[0]+1]`. I avoid using `arr.diagonal()` as I don’t know what it will do.

      *

      In the last bit, again the question is whether numpy versions should be treated like the python 2/3 switch, where materials were quite clearly targeted to a specific version.

      1. I agree with the downgrade confusion that sometimes occurs when package maintainers are too strict in their dependency determination. Note that you can pin your numpy to a version (or minor version) which would be taken into account during “conda update –all” execution. That way you could update other dependencies (if they don’t insist on the newest numpy) while keeping your numpy version stable. https://conda.io/docs/user-guide/tasks/manage-pkgs.html#preventing-packages-from-updating-pinning

  3. Yes, API stability is my main issue with Python. Agree on all points.
    While numpy changes frequently, newer packaages such as the interesting visualization libraries such as holoviews, bokeh etc. are even worse in this regard. This is quite understandable as they are still maturing but from an end-user perspective it is very frustrating.
    I write code that you can’t use a couple of months later without rewriting it, as I want to use it in combination with other libraries that have conflicting version dependencies.
    It really hurts the Scipy/Numpy ecosystem.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.