Utilitarian Scientific Software Development

Yesterday, I added this new feature to ngless: if the user asks it to run a non-existent script, it will try it give an error message with a guess of what you probably really meant.

For example, if you type Profiles.ngl, but the script is actually called profile.ngl:

$ ngless Profiles.ngl

Exiting after fatal error:
File `Profiles.ngl` does not exist. Did you mean 'profile.ngl' (closest match)?

Previously, it would just say Profiles.ngl not found, without a suggestion.

It took me about 10-15 minutes to implement this (actually most of the functionality was already present as ngless already implemented similar functionality in other context). Is it worth it?

When is it worth it to add bells & whistles to research software?

I think we should think about it, in an ideal world, using the utilitarian principle of research software development: software should reduce the overall human effort. If this feature saves more time overall than it took to write, then it’s worth it.

This Utilitarian Principle says that these 15 minutes were well invested if (and only if) this ngless features saves more than 15 minutes for all its users over its lifetime. I expect that every time an user triggers this error, they’ll save a few seconds (say 2 seconds). 15 minutes is 900 seconds. Thus, this feature is worth it if it is triggered over 450 times. Given that we hope that ngless will be widely used, this feature is certainly worth it.

This principle also makes the argument that it would not be worth to add such a feature to a script that is only used in an internal analysis. So, code that was only meant to be used by myself or by myself and a small number of others, should have fewer bells & whistles.

In a non-ideal world, we need to take into account the incentives of the scientific (or commercial) world and the possibility of market failure: the market does not always reward the most selfless behaviour (this includes the “market” for scientific rewards where shiny new things are “paid” more than boring old maintenance).

Advertisements

Fast and useful errors with ngless [3/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection [this post]
  4. Extending and interacting with other projects
  5. Miscellaneous

If you are the rare person who just writes code without bugs (if your favorite editor is cat), then you can skip this post as it only concerns those of us who make mistakes. Otherwise, I will assume that /your code will have bugs/. Your code will have silly typos and large mistakes.

Too many tools work well, but fail badly; that is, if all their dependencies are there, all the files are exactly perfect and the user specificies all the right options, then the tool will work perfectly; but any mistake and you will get a bizarre error, which will be hard to fix. Thus,the tool is bad at failing. Ngless promises to work well and fail well.

Make some errors impossible

Let us recall our running example:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

Note that we do not specify paths for the ‘omrgc’ reference or the functional map file. We also do not specify files for intermediate files. This is all implicit and you cannot mess it up. The Fastq encoding is auto-detected, removing one more opportunity for you to mess up (although you can specify the encoding if you really want to).

Ngless always uses the three step output safe writing pattern:

  1. write the output to a temp file,
  2. sync the file and its directory to disk,
  3. rename the temp file to the final output name.

The final step is atomic. That is, the operating system garantees that it either fully completes or never executes even if there is an error, so that you never get a partial file. Thus, if there is an output file, you know that ngless finished without errors (up to that point, at least) and that the output is correct. No more asking “did the cluster crash affect this file? Maybe I need to recompute or maybe I count the number of lines to make sure it’s complete”. None of that: if the file is there, it is complete.

Side-note: programming languages (or their standard libraries) should have support for this safe-output writing pattern. I do not know of any language that does.

Make error detection fast

Have you ever run a multi-step pipeline where the very last step (often saving the results) has a silly typo and everything fails disastrously at that point wasting you hours of compute time? I know I have. Ngless tries as hard as possible to make sure that doesn’t happen.

Although ngless is interpreted rather than compiled, it performs an initial pass over your script to check for all sorts of possible errors.

Ngless is a typed language and all types are checked so that if you try to run the count function without first maping, you will get an error message.

All arguments to functions are also checked. This even checks some rules that would be hard to impose using a more general purpose programming language: for example, when you call count, either (1) you are using a built-in reference which has its own annotation files or (2) you have to pass in the path to a GTF or gene map file so that the output of the mapping can be annotated and summarized. This constraint would be hard to express in, for example, Java or C++, but ngless can check this type of condition easily.

The initial check makes sure that all necessary input files exist and can be read and even that any directories used for output are present and can be written to (in the script above, if a directory named output is not present, you will get an immediate error). If you are using your own functional map, it will read the file header to check that any features you use are indeed present (in the example above, it checks that the ‘ko’ and ‘cog’ features exist in the built-in ocean microbiome catalog).

All typos and other similar errors are caught immediately. If you mistype the name of your output directory, ngless will let you know in 0.2 seconds rather than after hours of computation.

You can also just run the tests with ngless -n script-name.ngl: it does nothing except run all the validation steps.

Again, this is an idea that could be interesting to explore in the context of general purpose languages.

Make error messages helpful

An unknown error occurred

An unhelpful error message

As much as possible, when an error is detected, the message should help you make sense of it and fix it. A tool cannot always read your mind, but as much as possible, ngless error messages are descriptive.

For example, if you used an illegal argument/parameter to a function, ngless will remind you of what the legal arguments are. If it cannot write to an output file it will say it cannot write to an output file (and not just “IO Error”). If a file is missing, it will tell you which file (and it will tell you in about 0.2 seconds.

Summary

Ngless is designed to make some errors impossible, while trying hard to give you good error messages for the errors that will inevitably crop up.

Why Python is Better than Matlab for Scientific Software

Why Python is Better than Matlab for Scientific Software

This is an argument I made at EuBIAS when arguing for the use of mahotas and the Python numpy-stack for bioimage informatics. I had a few conversations around this and decided to write a longer post summarizing the whole argument.

0. My argument mostly applies for new projects

If you have legacy code in MATLAB, then it may make sense to just continue using it. If it works, don’t fix it. However, if your Matlab code keeps causing you pain, Python might be a solution.

Note too that porting code is not the same as writing from scratch. You can often convert code from MATLAB to Python in a small fraction of the time it would take you to start from scratch.

1. Python has caught up with Matlab and is in the process of overtaking it.

This is my main argument: the momentum is in Python’s direction. Even two or three years ago, Python was behind. Now, it’s sailing past Matlab.

nr_lines_python

This graph shows the number of lines of code in some important projects for bioimage informatics/science in general (numpy, matplotlib, mahotas, skimage, and sklearn). As you can see, the base projects on the top (numpy and matplotlib) have been stable for some years, while the more applied packages at the bottom have exploded in recent years.

Depending on what you are doing, Python may even better support it. It is now, Matlab which is playing catch-up with open source software (for example, Matlab is now introducing their own versions of Dataframe, which Python has through Pandas [ itself, a version of R’s Dataframe object]).

The Python projects are also newer and tend, therefore, to be programmed in a more modern way: it is typical to find automated testing, excellent and complete documentation, a dedicated peer-reviewed publication, &c. This ain’t your grandfather’s open source with a dump on sourceforge and a single README file full of typos.

As an example of the large amount of activity going on in the Python world, just this week, Yhat released ggplot for Python [1]. So, while last week, I was still pointing to plotting as one of the weakneses of Python, it might no longer be true.

2. Python is a real programming language

Matlab is not, it is a linear algebra package. This means that if you need to add some non-numerical capabilities to your application, it gets hairy very fast.

For scientific purposes, when writing a small specialized script, Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.

In every area, specialized languages are the best choice, but Python is the second best in more areas [2].

3. Python can easily interface with other languages

Python can interfact with any language which can be interacted through C, which is most languages. There is a missing link to some important Java software, but some work is being done to address that too. Technically, the same is true of Matlab.

However, the Python community and especially the scientific Python community has been extremely active in developing tools to make this as easy as you’d like (e.g., Cython).  Therefore, many tools/libraries in C-based languages are already wrapped in Python for you. I semi-routinely get sent little snippets of R code to do something. I will often just import rpy2 and use it from Python without having to change the rest of my code.

4. With Python, you can have a full open-source stack

This means that you are allowed to, for example, ship a whole virtual machine image with your paper. You can also see look at all of the code that your computation depends on. No black boxes.

5. Matlab Licensing issues are a pain. And expensive.

Note that I left the word expensive to the end, although in some contexts it may be crucial. Besides the basic MATLAB licenses, you will often need to buy a few more licenses for specific toolboxes. If you need to run the code on a cluster, often that will mean more licenses.

However, even when you do have the money, this does not make the problem go away: now you need admin for the licensing. When I was at CMU, we had campus-wide licenses and, yet, it took a while to configure the software on every new user’s computer (with some annoying minor issues like the fact that the username on your private computer needed to match the username you had on campus), you couldn’t run it outside the network (unless you set up a VPN, but this still means you need network access to run a piece of software), &c. Every so often, the license server would go down and stop everybody’s work. These secondary costs can be as large as the licensing costs.

Furthermore, using Python means you can more easily collaborate with people who don’t have access to Matlab. Even with a future version of yourself who decided to economize on Matlab licenses (or if the number of users shrinks and your institution decides to drop the campus licensing, you will not be one of the few groups now forced to buy it out-of-pocket).

By the way, if you do want support, there are plenty of options for purchasing it [3]: larger companies as well as individual consultants available in any city. The issue of support is orthogonal to the licensing of the software itself.

§

Python does have weaknesses, of course; but that’s for another post.

[1] Yhat is a commercial company releasing open-source software, by the way; for those keeping track at home.
[2] I remember people saying this about C++ back in the day.
[3] However, because science is a third-world economy, it may be easier to spend 10k on Matlab licenses which come with phone support to spending 1k on a local Python consultant.

To reproduce the paper, you cannot use the code we used for the paper

Over the last few posts, I described my nuclear segmentation paper.

It has a reproducible research archive.

§

If you now download that code, that is not the code that was used for the paper!

In fact, the version that generates the tables in the paper does not run anymore, because it only runs with old versions of numpy!

In order for it to compute the computation in the paper, I had to update the code. In order to run the code in the paper, you need to get old versions of software.

§

To some extent, this is due to numpy’s frustrating lack of forward compatibility [1]. The issue at hand was the changed semantics of the histogram function.

In the end, I think I completely avoided that function in my code for a few years as it was toxic (when you write libraries for others, you never know which version of numpy they are running).

§

But as much as I can gripe about numpy breaking code between minor versions, they would eventually be justified in changing their API with the next major version change.

In the end, the half-life of code is such that each year, it becomes harder to reproduce older papers even if the code is available.

[1] I used to develop for the KDE Project where you did not break user’s code ever and so I find it extremely frustrating to have to explain that you should not change an API on esthetical grounds in between minor versions.

Working Around Bugs in Third Party Libraries

This is another story of continuous improvement, this time in mahotas’ little brother imread.

Last week, Volker Hilsenstein here at EMBL had a few problems with imread on Windows. This is one of those very hard issues: how to help someone on a different platform, especially one which you know nothing about?

In the end, the problem was not Windows per se, but an old version of libtiff. In that version, there is a logic error (literally, there is a condition which is miswritten and always false) and the code will attempt to read a TIFF header from a file even when writing. Mahotas-imread was not ready for this.

Many (especially in the research open-source world, unfortunately) would just say: well, I won’t support broken versions of libtiff: if your code does not adhere to the spec, I am just going to not work for you, if you don’t do exactly what you should, then I won’t work either. See this excellent old essay by Joel Spolsky on this sort of thing.

In my case, I prefer to work around the bug and when libtiff tries to read in write mode, return no data; which it correctly handles. I wrote the following data reading function to pass to libtiff:

tsize_t tiff_no_read(thandle_t, void*, tsize_t) {
        return 0;
}

The purpose of this code is simply to make imread work even on a broken, 5 year old version of a third party library.

§

In the meanwhile, we also fixed compilation in Cygwin as well as a code path which led to a hard crash.

Especially the possibility of a hard crash made me decide that this was important enough to merit a new release.

Extended Depth of Field In Python using Mahotas

I wanted to implement an extended depth of field algorithm last week for visualisation.

The idea of extended depth of focus is that you have a stack of images, taken at different focal points, and you build a single artificial image so that you get everything in focus. Some simple methods work OK, some very advanced methods work better. We were happy with OK for our setting. Here’s how to do it with mahotas:

Start with standard imports:

import numpy as np
import mahotas as mh

We are going to assume that you have an image object, which has three dimensions: the stack, height, and width:

stack,h,w = image.shape

We use mh.sobel as the measure of infocusness for each pixel [1]:

focus = np.array([mh.sobel(t, just_filter=True) for t in image])

Now, we select the best slice at each pixel location:

best = np.argmax(focus, 0)

So far, very easy. The next part is the hard part. We want to do the following:

r = np.zeros((h,w))-1
for y in xrange(h):
    for x in xrange(w):
        r[y,x] = image[best[y,x], y, x]

But this is very slow (never run nested loops in Python if you can avoid it). We get the same result with a slightly less legible, but faster manipulation [2]:

image = image.reshape((stack,-1)) # image is now (stack, nr_pixels)
image = image.transpose() # image is now (nr_pixels, stack)
r = image[np.arange(len(image)), best.ravel()] # Select the right pixel at each location
r = r.reshape((h,w)) # reshape to get final result

Et voilà!

§

Here is an example, from a stack of plankton images. This is is the maximum intensity projection:

Maximum Intensity Projection

This is the most in-focus slice (using the sobel operator):

Most in-focus slice

And this is the extended depth of field result:

Extended Depth of Field

It is clearly sharper (look at some of the organelles on the top-left: they are a blur in the other two images),at the expense of some possible noise. I actually played around with blurring the image a little bit and it did improve things ever so slightly.

§

Cite the mahotas paper if using this in a scientific publication.

[1] Other methods simply use a different measure here.
[2] I am not 100% convinced that this is the best. After all we create an array of size len(image) just to index. I would be happy to find an alternative.