Thoughts on software sustainability

This is motivated by the latest Microbinfie podcast, episode #24: Experimental projects lead to experimental software, which discussed the issue of “scientific software sustainability”.

As background, in our group, we have made a public commitment to supporting tools over at least 5-years, starting with the date of publication.¹

  1. I have increasingly become convinced that code made available for reproducibility, what I call Extended Methods, is one thing that should be encouraged and rewarded, but is very different from making tools, which is another thing that should be encouraged and rewarded. The criteria for evaluating these two modes of code release should be very different. The failure to distinguish between Extended Methods code and tools plagues the whole discussion in the discipline. For example, internal paths may be acceptable for Extended Methods, but should be an immediate Nope for tools. I would consider this as immediate cause to recommend rejection if a manuscript was trying to sell code with internal path as a tool. On the other hand, it should also be more accepted to present code that is explicitly marked as Extended Methods and not necessarily intended for widespread use. Too often, there is pressure to pretend otherwise and blur these concepts.
  2. Our commitment is only for tools, which we advertise as for others to use. Code that is adjacent to a results-driven paper does not come with the same guarantees! How do you tell? If we made a bioconda release, it’s a tool.
  3. Our long-term commitment has the most impact before publication! In fact, ideally, it should have almost no impact on what we do after publication. This may sound like a paradox, but if tools are written by students/postdocs who will move on, the lab as a whole needs to avoid being in a situation where, after the student has moved on, someone else (including me!) is forced to debug unreadable, untested, code that is written in 5 different programming languages with little documentation. On the other hand, if we have it set up so that it runs on a continuous integration (CI) system with an exhaustive test suite and it turns out that a new version of Numpy breaks one of the tests, it is not so difficult to find the offending function call, add a fix, check with CI that it builds and passes the tests, and push out a new release. Similarly, having bioconda releases, cuts down a lot of support requests to use bioconda to install.

¹ In practice, this will often be a longer period, on both ends, as tools are supported pre-publication (e.g.macrel is only preprinted, but we will support it) and post-publication (I am still fixing the occasional issue in mahotas, even though the paper came out in 2013).

Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

This week, a discussion broke out about the stability of the Python scientific ecosystem. It was triggered by a blogpost from Konrad Hinsen, which led to several twitter follow ups.

First of all, let me  say that numpy/scipy are great. I use them and recommend them all the time. I am not disparaging the projects as a whole or the people who work on them. It’s just that I would prefer if they were stabler. Given twitter’s limitations, perhaps this was not as clear as I would like on my own twitter response:

I pointed out that I have been bit by API changes:

All of these are individually minor (and can be worked around), but these are just the issues that I have personally ran into and caused enough problems for me to remember them. The most serious was the mannwhitneyu change, which was a silent change (i.e., the function started returning a different result rather than raising an exception or another type of error).

*

Konrad had pointed out the Linux kernel project as one extreme version of “we never break user code”:

The other extreme is the Silicon-Valley-esque “move fast and break stuff”, which is appropriate for a new project. These are not binary decisions, but two extremes of a continuum. I would hope to see numpy move more towards the “APIs are contracts we respect” side of the spectrum as I think it currently behaves too much like a startup.

Numpy does not use semantic versioning, but if it did almost all its releases would be major releases as they almost always break some code. We’d be at Numpy 14.0 by now. Semantic versioning would allow for a smaller number of “large, possibly-breaking releases” (every few years) instead of a constant stream of minor backwards-incompatible changes. We’d have Numpy 4.2 now, and a list of deprecated features to be removed by 5.0.

Some of the problems that have arisen could have been solved by (1) introducing a new function with the better behaviour, (2) deprecating the old one, (3) waiting a few years and removing the original version (in a major release, for example). This would avoid the most important problem, silent changes.

*

A typical response is to say “well, just use anaconda (or similar) to freeze your dependencies”. Again, let me be clear, I use and recommend anaconda for everything. It’s great. But, in the context of the backwards compatibility problem, I don’t think this recommendation is really thought through as it only solves a fraction of the problem at hand (again, an important fraction but it’s not a magic wand).  (See also this post by Titus Brown).

What does anaconda not solve? It does not solve the problem of the intermediate layer, libraries which use numpy, but are to be used by final users. What is the expectation here? That I release my computer vision code (mahotas) with a note: Only works on Numpy 1.11? What if I want a project that uses both mahotas and scikit-learn, but scikit-learn is for Numpy 1.12 only? Is the conclusion that you cannot mix mahotas and scikit-learn? This would be worse than the Python 2/3 split. A typical project of mine might use >5 different numpy-dependent libraries. What are the chances that they all expect the exact same numpy version?

Right now, the solution I use in my code is “if I know that this numpy function has changed behaviour, either work around it, avoid it, or reimplement it (perhaps by copying and pasting from numpy)”. For example, some functions return views or copies depending on which version of numpy you have. To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.

It also happens all the time that I have an anaconda environment, add a new package and numpy is upgraded/downgraded. Is this to be considered buggy behaviour by anaconda? Anaconda currently does not upgrade everything to Python 3 when you request a package that is not available on Python 2, nor does it downgrade from 3 to 2; why should it treat numpy any differently if there is no guarantee that behaviour is consistent across numpy verions?

Sometimes the code at hand is not even an officially released library, but some code from another project. Let’s say that I have code that takes a metagenomics abundance matrix, does some preprocessing and computes stats and plots. I might have written it originally for a paper a few years back, but now want to do the same analysis on new data. Is the recommendation that I always write from scratch because it’s a new numpy version? What if it’s someone else asking me for the code? Is the recommendation that I ask “are you still on numpy 1.9, because I only really tested it there”. Note that Python’s dynamic nature actually makes this problem worse than in statically checked languages.

What about training materials? As I also wrote on twitter, it’s when teaching Python that I suffer most from Python 2-vs-Python 3 issues. Is the recommendation that training materials clearly state “This is a tutorial for numpy 1.10 only. Please downgrade to that version or search for a more up to date tutorial”? Note that beginners are the ones most likely to struggle with these issues. I can perfectly understand what it means that: “array == None and array != None do element-wise comparison”(from the numpy 1.13 release notes). But if I was just starting out, would I understand it immediately?

Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.

Building ImageJ Hyperstacks from Python

Building ImageJ Hyperstacks from Python

ImageJ calls a 4D image a hyperstack (it actually be 5D or even higher). You can save these and open them and it’ll show a nice GUI for them.

Unfortunately, it’s not very well documented (if at all) how it recognises some images as hyperstacks. This is what I found: A hyperstack is a multi-page TIFF with the image description tag containing information on the hyperstack.

I can generate one by outputting the individual slices to files (in the right order) and then calling tiffcp on the command line to concatenate them together. If the metadata is there, ImageJ will recognise it as a hyperstack.

§

Here is how to do it in Python and mahotas-imread [1].

Define the metadata (as a template):

_imagej_metadata = """ImageJ=1.47a
images={nr_images}
channels={nr_channels}
slices={nr_slices}
hyperstack=true
mode=color
loop=false"""

Now, we write a function which takes a z-stack and a filename to write to

def output_hyperstack(zs, oname):
    '''
    Write out a hyperstack to ``oname``

    Parameters
    ----------
    zs : 4D ndarray
        dimensions should be (c,z,x,y)
    oname : str
        filename to write to
    '''

Some basic imports:

import tempfile
import shutil
from os import system
try:
    # We create a directory to save the results
    tmp_dir = tempfile.mkdtemp(prefix='hyperstack')

    # Channels are in first dimension
    nr_channels = zs.shape[0]
    nr_slices = zs.shape[1]
    nr_images = nr_channels*nr_slices
    metadata = _imagej_metadata.format(
                    nr_images=nr_images,
                    nr_slices=nr_slices,
                    nr_channels=nr_channels)

We built up the final metadata string by replacing the right variables into the template. Now, we output all the images as separate TIFF files:

frames = []
next = 0
for s1 in xrange(zs.shape[1]):
    for s0 in xrange(zs.shape[0]):
        fname = '{}/s{:03}.tiff'.format(tmp_dir,next)
        # Do not forget to output the metadata!
        mh.imsave(fname, zs[s0,s1], metadata=metadata)
        frames.append(fname)
        next += 1

We call tiffcp to concatenate all the inputs

cmd = "tiffcp {inputs} {tmp_dir}/stacked.tiff".format(inputs=" ".join(frames), tmp_dir=tmp_dir)
r = system(cmd)
if r != 0:
    raise IOError('tiffcp call failed')
shutil.copy('{tmp_dir}/stacked.tiff'.format(tmp_dir=tmp_dir), oname)

Finally, we remove the temporary directory:

finally:
    shutil.rmtree(tmp_dir)

And, voilà! This function will output a file with the right format for ImageJ to think it is a hyperstack.

[1] Naturally, you can use other packages, but you need one which lets you write the image description TIFF tag.

Mahotas-imread Now Accepts Options When Writing

This week, I committed to mahotas-imread, some code to allow for setting options when saving:

from imread import imsave
image = ...
imsave('file.jpeg', image, opts={ 'jpeg:quality': 95 })

This saves the image array to file file.jpeg with quality 95 (out of 100).

§

This is only available in the version from github (at the moment), but I will probably put up a new release soon.

(If you like and use mahotas, please cite the software paper.)

Using imread to save disk space

Imread recently gained the ability to read&write metadata.

We deal with images around here and they can get very large in terms of disk space. To make things worse, the microscope does not save them in compressed form.

Imread, however, saves in compressed TIFF. So, we needed to (1) open the file and (2) resave it. We also do not want to lose the metadata that comes with the file. in the meanwhile.

This is what I ended up with:

def resave_file(f):
    '''
    resave_file(f)

    Resave a file using imread preserving metadata    Parameters
    ----------
    f : str
        Filename
    '''
    imdata, meta = imread.imread(f, return_metadata=True)
    tf = tempfile.NamedTemporaryFile('w',
                prefix='imread_resave_',
                suffix='.tiff',
                delete=False,
                dir=path.dirname(f))
    tf.close()
    imread.imsave(tf.name, imdata, metadata=meta)
    os.rename(tf.name, f)

On a test directory, disk usage went from 55GB down to 12GB. We use a two-step

Note: This only works with the github version of imread.

Making Your Mistakes in Public

I recently released a new version of mahotas, my computer vision package. It was version 1.0.1, which was quickly followed by 1.0.2.

1.0 had introduced a mistake, which was caught and fixed by Tony S Yu on github (issue 33). Along with a few other improvements, this warranted a new release.

Then, 1.0.1 had a little mistake of its own, caught by Jean-Patrick Pommier [jip’s homepage] on the pythonvision mailing list.

Thus, 1.0.2 was necessary. I released 1.0.2 about an hour after I was alerted to the errors.

§

This is why I develop my scientific code as open source. (1) Other people catch your mistakes and fix them for you  (and the whole community wins)! (2) because the code was public, I felt a great urge to fix the mistakes very fast. I even added tests to make sure they would not happen again. I am harnessing public pressure for good.

Public code becomes better as its openness pushes me to have it as pristine as possible.