Working Around Bugs in Third Party Libraries

This is another story of continuous improvement, this time in mahotas’ little brother imread.

Last week, Volker Hilsenstein here at EMBL had a few problems with imread on Windows. This is one of those very hard issues: how to help someone on a different platform, especially one which you know nothing about?

In the end, the problem was not Windows per se, but an old version of libtiff. In that version, there is a logic error (literally, there is a condition which is miswritten and always false) and the code will attempt to read a TIFF header from a file even when writing. Mahotas-imread was not ready for this.

Many (especially in the research open-source world, unfortunately) would just say: well, I won’t support broken versions of libtiff: if your code does not adhere to the spec, I am just going to not work for you, if you don’t do exactly what you should, then I won’t work either. See this excellent old essay by Joel Spolsky on this sort of thing.

In my case, I prefer to work around the bug and when libtiff tries to read in write mode, return no data; which it correctly handles. I wrote the following data reading function to pass to libtiff:

tsize_t tiff_no_read(thandle_t, void*, tsize_t) {
        return 0;
}

The purpose of this code is simply to make imread work even on a broken, 5 year old version of a third party library.

§

In the meanwhile, we also fixed compilation in Cygwin as well as a code path which led to a hard crash.

Especially the possibility of a hard crash made me decide that this was important enough to merit a new release.

Advertisements

Extended Depth of Field In Python using Mahotas

I wanted to implement an extended depth of field algorithm last week for visualisation.

The idea of extended depth of focus is that you have a stack of images, taken at different focal points, and you build a single artificial image so that you get everything in focus. Some simple methods work OK, some very advanced methods work better. We were happy with OK for our setting. Here’s how to do it with mahotas:

Start with standard imports:

import numpy as np
import mahotas as mh

We are going to assume that you have an image object, which has three dimensions: the stack, height, and width:

stack,h,w = image.shape

We use mh.sobel as the measure of infocusness for each pixel [1]:

focus = np.array([mh.sobel(t, just_filter=True) for t in image])

Now, we select the best slice at each pixel location:

best = np.argmax(focus, 0)

So far, very easy. The next part is the hard part. We want to do the following:

r = np.zeros((h,w))-1
for y in xrange(h):
    for x in xrange(w):
        r[y,x] = image[best[y,x], y, x]

But this is very slow (never run nested loops in Python if you can avoid it). We get the same result with a slightly less legible, but faster manipulation [2]:

image = image.reshape((stack,-1)) # image is now (stack, nr_pixels)
image = image.transpose() # image is now (nr_pixels, stack)
r = image[np.arange(len(image)), best.ravel()] # Select the right pixel at each location
r = r.reshape((h,w)) # reshape to get final result

Et voilà!

§

Here is an example, from a stack of plankton images. This is is the maximum intensity projection:

Maximum Intensity Projection

This is the most in-focus slice (using the sobel operator):

Most in-focus slice

And this is the extended depth of field result:

Extended Depth of Field

It is clearly sharper (look at some of the organelles on the top-left: they are a blur in the other two images),at the expense of some possible noise. I actually played around with blurring the image a little bit and it did improve things ever so slightly.

§

Cite the mahotas paper if using this in a scientific publication.

[1] Other methods simply use a different measure here.
[2] I am not 100% convinced that this is the best. After all we create an array of size len(image) just to index. I would be happy to find an alternative.

Jug now outputs metadata on the computation

This week-end, I added new functionality to jug (previous related posts). Jug can now output the final result of a computation including metadata on all the intermediate inputs!

For example:

from jug import TaskGenerator
from jug.io import write_metadata # <---- THIS IS THE NEW STUFF

@TaskGenerator
def double(x):
    return 2*x

x = double(2)
x2 = double(x)write_metadata(x2, 'x2.meta.yaml')

When you execute this script (with jug execute), the write_metadata function will write a YAML description of the computation to the file x.meta.yaml!

This file will look like this:

args:
- args: [2]
  meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
  name: jugfile.double
meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
name: jugfile.double

It tells you that a computation named jugfile.double was computated at 18h31 on Saturday August 3. It also gives the same information recursively for all intermediate results.

§

This is the result of  a few conversations I had at BOSC 2013.

§

This is part of the new release, 0.9.6, which I put out yesterday.

The Hard Part is Motivation. Books. &c

Building Machine Learning Systems with Python

Because my book just came out, I am again excerpting from a Portuguese interview with me I mentioned previously:

Who is the target audience for this book?

Luis Pedro: There are two distinct audiences: The first are programmers who do not know much of machine learning, but liked to use a classifier, for example. The second are people who do not need an introduction to machine learning (because they already know), but maybe do not know how to do it with the tools in Python.

What were the main difficulties encountered in this process?

Luis Pedro: The challenge is always to find examples that are not too easy, but they are not also too difficult for an introduction. In one case (image classification), I took some photographs to create a dataset. In the literature, there are classical examples which are trivial to handle with modern techniques and current research problems that are very difficult. My dataset consists of poorly framed photographs taken with a mobile phone camera. Therefore, it also has a style more akin to a problem area or mobile web. The aim is to distinguish photographs of buildings, natural landscape, or texts.

There was nothing too complex for anyone who knows the area. In fact, when it comes to writing about something you already know how to do (as was the case), the hard part is the motivation.

[…]

One thing we stress throughout is principled evaluation and we warn against overselling your results. Even in the world of research, we still find papers that mix training set and test set! Obviously not coming from groups that work in machine learning, but in more applied areas, we still find people who test hyper-parameters in the test set.

Making Your Mistakes in Public

I recently released a new version of mahotas, my computer vision package. It was version 1.0.1, which was quickly followed by 1.0.2.

1.0 had introduced a mistake, which was caught and fixed by Tony S Yu on github (issue 33). Along with a few other improvements, this warranted a new release.

Then, 1.0.1 had a little mistake of its own, caught by Jean-Patrick Pommier [jip’s homepage] on the pythonvision mailing list.

Thus, 1.0.2 was necessary. I released 1.0.2 about an hour after I was alerted to the errors.

§

This is why I develop my scientific code as open source. (1) Other people catch your mistakes and fix them for you  (and the whole community wins)! (2) because the code was public, I felt a great urge to fix the mistakes very fast. I even added tests to make sure they would not happen again. I am harnessing public pressure for good.

Public code becomes better as its openness pushes me to have it as pristine as possible.

Friday Links

1.How genetic engineering ruined dogs

2. A Professor is unhappy that his PlOS One paper ended up in a commercial book

This is the old copyleft and viral licensing discussion from software: GPL or BSD?

I am posting because I found it interesting, not because I agree.

I actually found it a bit naïve that the guy did not know what open access really meant. If one is publishing in an open access journal, one should know what CC-BY means.

Still, when will someone propose a version of the GPL for scientific papers?

3. A scientist associated with a certain idea that is gaining traction says that ambitious young researchers should try to find the major flaws in that idea! This is what being a good scientist is.

4. From a post about the IMF & Academia:

Just as Canadians know much more about the US, than Americans know about Canada, bloggers know much more about academia than academics know about the blogosphere.

As with the previous link, my purpose here is to about the scientific aspect, not the politics.

5. Melatonin for sleep dysfunctions in children.

As a biologist, my initial thought is that taking a hormone is probably more likely to have side effects than a small molecule (ie, traditional drug). Small molecules have side effects all the time, but hormones should be expected to be all over the place, especially in children. However, it seems calling it natural makes many people feel the opposite (also lowers the regulatory burden because this feeling is codified in law).

6. An interview with me (in Portuguese).

People Agreeing With Me on the Internet

A couple of things I noticed lately that relate to a few of my posts/ideas.

§ One

Many “scientists should do x” conversations need to be reframed as”the culture of science needs to support x.”

— Jacquelyn Gill (@JacquelynGill) May 16, 2013

This is a very nice way to phrase what I wrote about collective action problems in sharing code

§ Two

First of all, a really cool thing, Thomas J. Webb commented on his own paper with a blog update.

In the blog post, he writes (emphasis is mine):

With some trepidation, I opened up the file of R code I’d used for the original analysis. And got a pleasant surprise: it was readable! Largely this was because I submitted it as an appendix to our paper, and so had taken more care than usual to annotate it carefully. I think this demonstrates an under-apprecaiated virtue of sharing data and code: in preparing it such that it is comprehensible to others, it becomes much more useful to your future self. This point is nicely made in a new paper by Ethan White and colleagues on making using and reusing data easier.

Exactly: sharing code makes you write better code.

(hat tip to @mattjhodgkinson on twitter).