Thoughts on software sustainability

This is motivated by the latest Microbinfie podcast, episode #24: Experimental projects lead to experimental software, which discussed the issue of “scientific software sustainability”.

As background, in our group, we have made a public commitment to supporting tools over at least 5-years, starting with the date of publication.¹

  1. I have increasingly become convinced that code made available for reproducibility, what I call Extended Methods, is one thing that should be encouraged and rewarded, but is very different from making tools, which is another thing that should be encouraged and rewarded. The criteria for evaluating these two modes of code release should be very different. The failure to distinguish between Extended Methods code and tools plagues the whole discussion in the discipline. For example, internal paths may be acceptable for Extended Methods, but should be an immediate Nope for tools. I would consider this as immediate cause to recommend rejection if a manuscript was trying to sell code with internal path as a tool. On the other hand, it should also be more accepted to present code that is explicitly marked as Extended Methods and not necessarily intended for widespread use. Too often, there is pressure to pretend otherwise and blur these concepts.
  2. Our commitment is only for tools, which we advertise as for others to use. Code that is adjacent to a results-driven paper does not come with the same guarantees! How do you tell? If we made a bioconda release, it’s a tool.
  3. Our long-term commitment has the most impact before publication! In fact, ideally, it should have almost no impact on what we do after publication. This may sound like a paradox, but if tools are written by students/postdocs who will move on, the lab as a whole needs to avoid being in a situation where, after the student has moved on, someone else (including me!) is forced to debug unreadable, untested, code that is written in 5 different programming languages with little documentation. On the other hand, if we have it set up so that it runs on a continuous integration (CI) system with an exhaustive test suite and it turns out that a new version of Numpy breaks one of the tests, it is not so difficult to find the offending function call, add a fix, check with CI that it builds and passes the tests, and push out a new release. Similarly, having bioconda releases, cuts down a lot of support requests to use bioconda to install.

¹ In practice, this will often be a longer period, on both ends, as tools are supported pre-publication (e.g.macrel is only preprinted, but we will support it) and post-publication (I am still fixing the occasional issue in mahotas, even though the paper came out in 2013).

Big Data Biology Lab Software Tool Commitments

Cross-posting from our group website.

Preamble. We produced two types of code artefacts: (i) code that is supportive of results in a results-driven paper and (ii) software tools intended for widespread use.

For an example of the first type, see the Code Ocean capsule that is linked to (Coelho et al., 2018). The main goal of this type of code release is to serve as an Extended Methods section to the paper. Hopefully, it will be useful for the small minority of readers of the paper who really want to dig into the methods or build upon the results, but the work aims at biological results.

This document focuses on the second type of code release: tools that are intended for widespread use. We’ve released a few of these in the last few years: JugNGLess, and Macrel. Here, the whole point is that others use the code. We also use these tools internally, but if nobody else ever adopts the tools, we will have fallen short.

The Six Commitments

  1. Five-year support (from date of publication) If we publish a tool as a paper, then we commit to supporting it for at least five years from the date of publication. We may stop developing new features, but if there are bugs in the released version, we will assume responsibility and fix them. We will also do any minor updates to keep the tool running (for example, if a new Python version breaks something in one of our Python-based tools, we will fix it). Typically, support is provided if you open an issue on the respective Github page and/or post to the respective mailing-list.
  2. Standard, easy to install, packages Right now, this means: we provide conda packages. In the future, if the community moves to another system, we may move too.
  3. High-quality code with continuous integration All our published packages have continuous integration and try to follow best practices in coding.
  4. Complete documentation We provide documentation for the tools, including tutorials, example data, and reference manuals.
  5. Work well, fail well We strive to make our tools not only work well, but also “fail well”: that is, when the user provides erroneous input, we attempt to provide good quality error messages and to never produce bad output (including never producing partial outputs if the process terminates part-way through processing).
  6. Open source, open communication Not only do we provide the released versions of our tools as open source, but all the development is done in the open as well.

Note for group members: This is a commitment from the group and, at the end of the day, the responsibility is Luis’ responsibility. If you leave the group, you don’t have to be responsible for 5 years. If you leave, your responsibility is just the basic responsibility of any author: to be responsive to queries about what was described in the manuscript, but not anything beyond that. What it does mean is that we will not be submitting papers on tools that risk being difficult to maintain. In fact, while the goals above are phrased as outside-focused, they are also internally important so that we can keep working effectively even as group members move on.

Thoughts on “Revisiting authorship, and JOSS software publications”

This is a direct response to Titus’ post: Revisiting authorship, and JOSS software publications. In fact, I started writing it as a comment there and it became so long that I decided it was better as its own post.

I appreciate the tone of Titus’ post, more asking questions than answering them, so here are my two or three cent:

There is nothing special about software papers. Different fields have different criteria for what consistutes authorship. I believe that particle physics has an approach close to “everyone that committed to the git repo gets to be an author”, which leads to papers with >1000 authors (note that I am not myself a particle physicist or anything close to it). At that point, the currency of authorship and citations is dilluted to the point where I seriously don’t know what the point is. What is maybe special about software papers is that, because they are newer, there hasn’t been time for a rough-consensus to emerge on what should be the criteria (I guess this is done in part in discussions like these). Having said that, even in well-established fields, you still have issues that are never resolved (in molecular biology: the question of whether technicians should be listed as authors brings in strong opinions on both sides).

My position is against every positive contribution deserves authorship. Some positive contributions are significant enough that they deserve authorship, others acknowledgements, others can even go unmentioned. Yes, it’s a judgement call what is “significant” and what is not, but I think someone who, for example, reports a bug that leads to a bugfix is a clear NO (even if it’s a good bug report). Even small code contributions should not lead to authorship (perhaps an acknowledgement is where my indecision boundary is at that point). People who proof read a manuscript also don’t get authorship even if they do find a few typos or suggest a few wording changes.

Of course, contribution need not be code. Tutorials, design, &c, all count. But they should be significant. I also would not consider adding as an author an individual that asked a good question during a seminar on the work. Those sometimes turn out to be like good bug reports in that you improve the work based on them. The fact that significance is a judgement call does not imply that we should drop significance as a criterion.

I think authorship is also about responsibility. If you are an author, then you must take responsibility for some part of the work (naturally, not all of it, but some of it). If there are issues later, it is your responsibility to, at the very least, explain what you did or even to fix it, &c. You should be involved in the paper writing and if, for example, some work is need for revision on that particular aspect of the code, you need to do it.

From my side, I have submitted several patches to projects which were best-efforts at the time, but I don’t want to take any responsibility for the project beyond that. If the author of one of those projects now told me that I needed to redo that to work on Mac OS X because one of the reviewers complained about it, I’d tell them “sorry, I cannot help you”. I don’t think authors should get to do that.

I would get off-topic here, but I also wish there was more of an explicit expectation that if you publish a software paper, you shall provide minimal maintenance for 5-10 years. Too often software papers are the obituary of the project rather than the announcement.

My answer to “Another question: does authorship keep accruing over versions? Should all the authors on sourmash 2.0 be authors on sourmash 3.0?” is a strong NO. You don’t double dip. If anything, I think it’s generally the authors of version 2 that often lose out as people benefit from their work and keep citing version 1.

Finally, a question of my own: what if someone does something outside the project that clearly benefits the project, should they be authors? For example, what if someone does a bioconda package for your code? Or writes an excellent blogpost/tutorial that brings you a very large number of users (or runs a tutorial on it)? Should they be authors? My answer is that it first needs to be significant, and it should not be automatic. It may be appropriate to invite them for authorship, but they should then commit to (at the very least) reading the manuscript and keeping up with the development of the project over the medium-term.

Utilitarian Scientific Software Development

Yesterday, I added this new feature to ngless: if the user asks it to run a non-existent script, it will try it give an error message with a guess of what you probably really meant.

For example, if you type Profiles.ngl, but the script is actually called profile.ngl:

$ ngless Profiles.ngl

Exiting after fatal error:
File `Profiles.ngl` does not exist. Did you mean 'profile.ngl' (closest match)?

Previously, it would just say Profiles.ngl not found, without a suggestion.

It took me about 10-15 minutes to implement this (actually most of the functionality was already present as ngless already implemented similar functionality in other context). Is it worth it?

When is it worth it to add bells & whistles to research software?

I think we should think about it, in an ideal world, using the utilitarian principle of research software development: software should reduce the overall human effort. If this feature saves more time overall than it took to write, then it’s worth it.

This Utilitarian Principle says that these 15 minutes were well invested if (and only if) this ngless features saves more than 15 minutes for all its users over its lifetime. I expect that every time an user triggers this error, they’ll save a few seconds (say 2 seconds). 15 minutes is 900 seconds. Thus, this feature is worth it if it is triggered over 450 times. Given that we hope that ngless will be widely used, this feature is certainly worth it.

This principle also makes the argument that it would not be worth to add such a feature to a script that is only used in an internal analysis. So, code that was only meant to be used by myself or by myself and a small number of others, should have fewer bells & whistles.

In a non-ideal world, we need to take into account the incentives of the scientific (or commercial) world and the possibility of market failure: the market does not always reward the most selfless behaviour (this includes the “market” for scientific rewards where shiny new things are “paid” more than boring old maintenance).

Fast and useful errors with ngless [3/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection [this post]
  4. Extending and interacting with other projects
  5. Miscellaneous

If you are the rare person who just writes code without bugs (if your favorite editor is cat), then you can skip this post as it only concerns those of us who make mistakes. Otherwise, I will assume that /your code will have bugs/. Your code will have silly typos and large mistakes.

Too many tools work well, but fail badly; that is, if all their dependencies are there, all the files are exactly perfect and the user specificies all the right options, then the tool will work perfectly; but any mistake and you will get a bizarre error, which will be hard to fix. Thus,the tool is bad at failing. Ngless promises to work well and fail well.

Make some errors impossible

Let us recall our running example:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

Note that we do not specify paths for the ‘omrgc’ reference or the functional map file. We also do not specify files for intermediate files. This is all implicit and you cannot mess it up. The Fastq encoding is auto-detected, removing one more opportunity for you to mess up (although you can specify the encoding if you really want to).

Ngless always uses the three step output safe writing pattern:

  1. write the output to a temp file,
  2. sync the file and its directory to disk,
  3. rename the temp file to the final output name.

The final step is atomic. That is, the operating system garantees that it either fully completes or never executes even if there is an error, so that you never get a partial file. Thus, if there is an output file, you know that ngless finished without errors (up to that point, at least) and that the output is correct. No more asking “did the cluster crash affect this file? Maybe I need to recompute or maybe I count the number of lines to make sure it’s complete”. None of that: if the file is there, it is complete.

Side-note: programming languages (or their standard libraries) should have support for this safe-output writing pattern. I do not know of any language that does.

Make error detection fast

Have you ever run a multi-step pipeline where the very last step (often saving the results) has a silly typo and everything fails disastrously at that point wasting you hours of compute time? I know I have. Ngless tries as hard as possible to make sure that doesn’t happen.

Although ngless is interpreted rather than compiled, it performs an initial pass over your script to check for all sorts of possible errors.

Ngless is a typed language and all types are checked so that if you try to run the count function without first maping, you will get an error message.

All arguments to functions are also checked. This even checks some rules that would be hard to impose using a more general purpose programming language: for example, when you call count, either (1) you are using a built-in reference which has its own annotation files or (2) you have to pass in the path to a GTF or gene map file so that the output of the mapping can be annotated and summarized. This constraint would be hard to express in, for example, Java or C++, but ngless can check this type of condition easily.

The initial check makes sure that all necessary input files exist and can be read and even that any directories used for output are present and can be written to (in the script above, if a directory named output is not present, you will get an immediate error). If you are using your own functional map, it will read the file header to check that any features you use are indeed present (in the example above, it checks that the ‘ko’ and ‘cog’ features exist in the built-in ocean microbiome catalog).

All typos and other similar errors are caught immediately. If you mistype the name of your output directory, ngless will let you know in 0.2 seconds rather than after hours of computation.

You can also just run the tests with ngless -n script-name.ngl: it does nothing except run all the validation steps.

Again, this is an idea that could be interesting to explore in the context of general purpose languages.

Make error messages helpful

An unknown error occurred
An unhelpful error message

As much as possible, when an error is detected, the message should help you make sense of it and fix it. A tool cannot always read your mind, but as much as possible, ngless error messages are descriptive.

For example, if you used an illegal argument/parameter to a function, ngless will remind you of what the legal arguments are. If it cannot write to an output file it will say it cannot write to an output file (and not just “IO Error”). If a file is missing, it will tell you which file (and it will tell you in about 0.2 seconds.

Summary

Ngless is designed to make some errors impossible, while trying hard to give you good error messages for the errors that will inevitably crop up.

Why Python is Better than Matlab for Scientific Software

Why Python is Better than Matlab for Scientific Software

This is an argument I made at EuBIAS when arguing for the use of mahotas and the Python numpy-stack for bioimage informatics. I had a few conversations around this and decided to write a longer post summarizing the whole argument.

0. My argument mostly applies for new projects

If you have legacy code in MATLAB, then it may make sense to just continue using it. If it works, don’t fix it. However, if your Matlab code keeps causing you pain, Python might be a solution.

Note too that porting code is not the same as writing from scratch. You can often convert code from MATLAB to Python in a small fraction of the time it would take you to start from scratch.

1. Python has caught up with Matlab and is in the process of overtaking it.

This is my main argument: the momentum is in Python’s direction. Even two or three years ago, Python was behind. Now, it’s sailing past Matlab.

nr_lines_python

This graph shows the number of lines of code in some important projects for bioimage informatics/science in general (numpy, matplotlib, mahotas, skimage, and sklearn). As you can see, the base projects on the top (numpy and matplotlib) have been stable for some years, while the more applied packages at the bottom have exploded in recent years.

Depending on what you are doing, Python may even better support it. It is now, Matlab which is playing catch-up with open source software (for example, Matlab is now introducing their own versions of Dataframe, which Python has through Pandas [ itself, a version of R’s Dataframe object]).

The Python projects are also newer and tend, therefore, to be programmed in a more modern way: it is typical to find automated testing, excellent and complete documentation, a dedicated peer-reviewed publication, &c. This ain’t your grandfather’s open source with a dump on sourceforge and a single README file full of typos.

As an example of the large amount of activity going on in the Python world, just this week, Yhat released ggplot for Python [1]. So, while last week, I was still pointing to plotting as one of the weakneses of Python, it might no longer be true.

2. Python is a real programming language

Matlab is not, it is a linear algebra package. This means that if you need to add some non-numerical capabilities to your application, it gets hairy very fast.

For scientific purposes, when writing a small specialized script, Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.

In every area, specialized languages are the best choice, but Python is the second best in more areas [2].

3. Python can easily interface with other languages

Python can interfact with any language which can be interacted through C, which is most languages. There is a missing link to some important Java software, but some work is being done to address that too. Technically, the same is true of Matlab.

However, the Python community and especially the scientific Python community has been extremely active in developing tools to make this as easy as you’d like (e.g., Cython).  Therefore, many tools/libraries in C-based languages are already wrapped in Python for you. I semi-routinely get sent little snippets of R code to do something. I will often just import rpy2 and use it from Python without having to change the rest of my code.

4. With Python, you can have a full open-source stack

This means that you are allowed to, for example, ship a whole virtual machine image with your paper. You can also see look at all of the code that your computation depends on. No black boxes.

5. Matlab Licensing issues are a pain. And expensive.

Note that I left the word expensive to the end, although in some contexts it may be crucial. Besides the basic MATLAB licenses, you will often need to buy a few more licenses for specific toolboxes. If you need to run the code on a cluster, often that will mean more licenses.

However, even when you do have the money, this does not make the problem go away: now you need admin for the licensing. When I was at CMU, we had campus-wide licenses and, yet, it took a while to configure the software on every new user’s computer (with some annoying minor issues like the fact that the username on your private computer needed to match the username you had on campus), you couldn’t run it outside the network (unless you set up a VPN, but this still means you need network access to run a piece of software), &c. Every so often, the license server would go down and stop everybody’s work. These secondary costs can be as large as the licensing costs.

Furthermore, using Python means you can more easily collaborate with people who don’t have access to Matlab. Even with a future version of yourself who decided to economize on Matlab licenses (or if the number of users shrinks and your institution decides to drop the campus licensing, you will not be one of the few groups now forced to buy it out-of-pocket).

By the way, if you do want support, there are plenty of options for purchasing it [3]: larger companies as well as individual consultants available in any city. The issue of support is orthogonal to the licensing of the software itself.

§

Python does have weaknesses, of course; but that’s for another post.

[1] Yhat is a commercial company releasing open-source software, by the way; for those keeping track at home.
[2] I remember people saying this about C++ back in the day.
[3] However, because science is a third-world economy, it may be easier to spend 10k on Matlab licenses which come with phone support to spending 1k on a local Python consultant.

To reproduce the paper, you cannot use the code we used for the paper

Over the last few posts, I described my nuclear segmentation paper.

It has a reproducible research archive.

§

If you now download that code, that is not the code that was used for the paper!

In fact, the version that generates the tables in the paper does not run anymore, because it only runs with old versions of numpy!

In order for it to compute the computation in the paper, I had to update the code. In order to run the code in the paper, you need to get old versions of software.

§

To some extent, this is due to numpy’s frustrating lack of forward compatibility [1]. The issue at hand was the changed semantics of the histogram function.

In the end, I think I completely avoided that function in my code for a few years as it was toxic (when you write libraries for others, you never know which version of numpy they are running).

§

But as much as I can gripe about numpy breaking code between minor versions, they would eventually be justified in changing their API with the next major version change.

In the end, the half-life of code is such that each year, it becomes harder to reproduce older papers even if the code is available.

[1] I used to develop for the KDE Project where you did not break user’s code ever and so I find it extremely frustrating to have to explain that you should not change an API on esthetical grounds in between minor versions.

Working Around Bugs in Third Party Libraries

This is another story of continuous improvement, this time in mahotas’ little brother imread.

Last week, Volker Hilsenstein here at EMBL had a few problems with imread on Windows. This is one of those very hard issues: how to help someone on a different platform, especially one which you know nothing about?

In the end, the problem was not Windows per se, but an old version of libtiff. In that version, there is a logic error (literally, there is a condition which is miswritten and always false) and the code will attempt to read a TIFF header from a file even when writing. Mahotas-imread was not ready for this.

Many (especially in the research open-source world, unfortunately) would just say: well, I won’t support broken versions of libtiff: if your code does not adhere to the spec, I am just going to not work for you, if you don’t do exactly what you should, then I won’t work either. See this excellent old essay by Joel Spolsky on this sort of thing.

In my case, I prefer to work around the bug and when libtiff tries to read in write mode, return no data; which it correctly handles. I wrote the following data reading function to pass to libtiff:

tsize_t tiff_no_read(thandle_t, void*, tsize_t) {
        return 0;
}

The purpose of this code is simply to make imread work even on a broken, 5 year old version of a third party library.

§

In the meanwhile, we also fixed compilation in Cygwin as well as a code path which led to a hard crash.

Especially the possibility of a hard crash made me decide that this was important enough to merit a new release.

Extended Depth of Field In Python using Mahotas

I wanted to implement an extended depth of field algorithm last week for visualisation.

The idea of extended depth of focus is that you have a stack of images, taken at different focal points, and you build a single artificial image so that you get everything in focus. Some simple methods work OK, some very advanced methods work better. We were happy with OK for our setting. Here’s how to do it with mahotas:

Start with standard imports:

import numpy as np
import mahotas as mh

We are going to assume that you have an image object, which has three dimensions: the stack, height, and width:

stack,h,w = image.shape

We use mh.sobel as the measure of infocusness for each pixel [1]:

focus = np.array([mh.sobel(t, just_filter=True) for t in image])

Now, we select the best slice at each pixel location:

best = np.argmax(focus, 0)

So far, very easy. The next part is the hard part. We want to do the following:

r = np.zeros((h,w))-1
for y in xrange(h):
    for x in xrange(w):
        r[y,x] = image[best[y,x], y, x]

But this is very slow (never run nested loops in Python if you can avoid it). We get the same result with a slightly less legible, but faster manipulation [2]:

image = image.reshape((stack,-1)) # image is now (stack, nr_pixels)
image = image.transpose() # image is now (nr_pixels, stack)
r = image[np.arange(len(image)), best.ravel()] # Select the right pixel at each location
r = r.reshape((h,w)) # reshape to get final result

Et voilà!

§

Here is an example, from a stack of plankton images. This is is the maximum intensity projection:

Maximum Intensity Projection

This is the most in-focus slice (using the sobel operator):

Most in-focus slice

And this is the extended depth of field result:

Extended Depth of Field

It is clearly sharper (look at some of the organelles on the top-left: they are a blur in the other two images),at the expense of some possible noise. I actually played around with blurring the image a little bit and it did improve things ever so slightly.

§

Cite the mahotas paper if using this in a scientific publication.

[1] Other methods simply use a different measure here.
[2] I am not 100% convinced that this is the best. After all we create an array of size len(image) just to index. I would be happy to find an alternative.