Paper review: Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis

Paper review:

Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis by L. Shamir in Journal of Microscopy, 2011 [DOI]

This is an excellent simple paper [1]. I will jump to the punchline (slightly edited by me for brevity):

This paper demonstrates that microscopy images that were previously used for developing and assessing the performance of bioimage classification algorithms can be classified even when the biological content is removed from the images [by replacing them with white squares], showing that previously reported results might be biased, and that the computer analysis could be driven by artefacts rather than by the actual biological content.

Here is an example of what the author means:


Basically, the author shows that even after modifying the images by drawing white boxes where the cells are, classifiers still manage to do apparently well. Thus, they are probably picking up on artefacts instead of signal.

This is (and this analogy is from the paper, although not exactly in this form) like a face recognition system which seems to work very well because all of the images it has of me have me wearing the same shirt. It can perform very well on the training data, but will be fooled by anyone who wears the same shirt.


This is a very important work as it points to the fact that many previous results were probably overinflated. Looking at the dates when this work was done, this was probably at the same time that I was working on my own paper on evaluation of subcellular location determination (just that it took a while for that one to appear in print).

I expect that my proposed stricter protocol for evaluation (train and test on separate images) would be more protected against this sort of effect [2]: we are now modeling the real problem instead of a proxy problem.


I believe two things about image analysis of biological samples:

  1. Computers can be much better than humans at this task.
  2. Some (most? much of?) published literature overestimates how well computers do with the method being presented.

Note that there is no contradiction between the two, except that point 2, if widely believed, can make it harder to convince people of point 1.

(There is also a third point which is most people overestimate how well humans do.)

[1] Normally, I’d review recent papers only, but this not only had this one escaped my attention when it came out (in my defense, it came out just when I was trying to finish my PhD thesis), but it deals with themes I have blogged about before.
[2] I tried a bit of testing around here, but it is hard to automate the blocking of the cells. Automatic thresholding does not work because it depends on the shape of the signal! This is why the author of this paper drew squares by hand.

New Paper: Determining the subcellular location of new proteins from microscope images using local features

I have a new paper out:

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]

(It’s not open access, but feel free to email me for a preprint.)

Nuclear examples

As you can see, this was 10 months in review, so I am very happy that it is finally out. To be fair, the final version is much improved due to some reviewer comments (alas, not all reviewer comments were constructive).

There are two main ideas in this paper. We could perhaps have broken this up into two minimum publishable units, but the first idea immediately brings up a question. We went ahead and answered that too.

The is the main point of the paper is that:

1. The evaluation of bioimage classification systems (in the context of subcellular classification, but others too) has under-estimated the problem.

Almost all evaluations have used the following mode [1]:

  1. Define the classes of interest, such as the organelles: nuclearGolgimitochondria, …
  2. For each of these, select a representative marker (ie, DAPI for the nuclear class, &c).
  3. Collect multiple images of different cells tagged with the representative marker for each protein.
  4. Test whether a system trained on some images of that marker can recognise other images of the same marker.
  5. Use cross-validation over these images. Get good results. Publish!

Here is the point of this paper: By using a single marker (a tagged protein or other fluorescent marker) for each class, we are unable to distinguish between two hypothesis: (a) the system is good at distinguishing the classes and (b) the system is good at distinguishing the markers. We show empirically that, in many cases, you are distinguishing markers and not locations!

This is a complex idea, and I will have at least another post just on this idea.

The natural follow-up question is how can we get better results in this new problem?

2. Local features work very well for bioimage analysis. Using SURF and an adaptation of SURF we obtained a large accuracy boost. The code is available in my library mahotas.

I had pointed out in my review of Liscovitch et al. that we had similarly obtained good results with local features.

I will have a few posts on this paper, including at least one on things that we left out because they did not work very well.

[1] All that I know. I may be biased towards the subcellular location literature (which I know very well), but other literatures may have been aware of this problem. Add a comment below if you know of something.

Parameter Optimization and Early Exit

Andrew Gelman calls this the folk theorem of statistical computing:

When you have computational problems, often there’s a problem with your model.

(This is not a folk theorem per se, as it is not a theorem; but the name stuck. [1])

There is an interesting corollary:

When you are performing parameter selection using cross-validation or a similar blind procedure, the bad parameter choices take longer than the good ones!

This reminds me of the advertising quip (typically attributed to John Wanamaker) that Half the money I spend on advertising is wasted; the trouble is I don’t know which half.

Of course, you can reply: if I knew the right parameters, then you would not need to find them in the first place. Still, understanding that your computation is slower in the really bad cases can be actionable.


Here is an example from machine learning: if you are using a support vector machine based system, you will often need to fit two parameters:

  1. The SVM penalty $C$.
  2. The kernel parameter (in the case of radial basis functions, the width $sigma$).

A simple solution is to try a grid of these parameters:

for c in [2**(-2), 2**(-1), 2**0,...]:
    for s in [2**(-2), 2**(-1), 2**0,...]:
        use cross validation to evaluate using c & s
pick best one

There is a fair share of wasted computation. For example, let’s say you have a parameter choice that is so awful it gives you 100% error rate and another which is so good it gives you 0% error. If you are lucky and you check the good combination first, you can abort the bad parameter choices early: the first time you see an error, you know it will never be as good.

This leads to the following algorithm:

error = { param -> 0 for all parameters }
until done:
    next_param = parameter with less error which has not run all folds
    run crossvalidation fold on next_param
    error[next_param] += error on this fold
    best_err = error[ best completed_parameter_value ]
    if best_err is minimum error:
        return best_completed_parameter_value

This aborts as early as possible with the best error.


This is implemented in my machine learning package (github link) by default.

I tested it on murphylab_slf7dna (with a bit of hacking of the internals to print statistics &c). I see that fitting with the right parameters takes 650ms (after preprocessing). We check a total of 48 parameter values. So we might expect that to take 0.65 * 48 = 31s. Since bad parameters take longer, it actually takes 48s (50% longer).

Using the early exit trick, it takes it down to 24s. This is half the time of running the full grid. This despite the fact that slightly more than the full grid was run: 57%.

[1] A really interesting question is whether you can formalise and prove it. A good model will often be one that has a nice little “fitness” peak, which is also easy to fit. Likelihood functions with local optima all over the place or ill-conditioned may correspond to worse models. There may be a proof hiding in here.