FAQ: How Many Clusters Did You Use?

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]  [Previous discussion on this blog]

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220  [Previous discussion on this blog]

Both of my Bioinformatics papers above use the concept of bag of visual words. The first for classification, the second for pattern unmixing.

Visual words are formed by clustering local appearance descriptors. The descriptors may have different origins (see the papers above and the references below) and the visual words are used differently, but the clustering is a common intermediate step.

A common question when I present this work is how many clusters do I use? Here’s the answer: it does not matter too much.

I used to just pick a round number like 256 or 512, but for the local features paper, I decided to look at the issue a bit closer. This is one of the panels from the paper, showing accuracy (y-axis) as a function of the number of clusters (x-axis):


As you can see, if you use enough clusters, you’ll do fine. If I had extended the results rightwards, then you’d see a plateau (read the full paper & supplements for these results) and then a drop-off. The vertical line shows N/4, where N is the number of images in the study. This seems like a good heuristic across several datasets.

One very interesting result is that choosing clusters by minimising AIC can be counter-productive! Here is the killer data (remember, we would be minimizing the AIC):


Minimizing the AIC leads to lower accuracy! AIC was never intended to be used in this context, of course, but it is often used as a criterion to select the number of clusters. I’ve done it myself.

Punchline: If doing classification using visual words, minimsing AIC may be detrimental, try using N/4 (N=nr of images).

Other References

This paper (reviewed before on this blog) presents supporting data too:

Noa Liscovitch, Uri Shalit, & Gal Chechik (2013). FuncISH: learning a functional representation of neural ISH images Bioinformatics DOI: 10.1093/bioinformatics/btt207

Paper review: Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis

Paper review:

Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis by L. Shamir in Journal of Microscopy, 2011 [DOI]

This is an excellent simple paper [1]. I will jump to the punchline (slightly edited by me for brevity):

This paper demonstrates that microscopy images that were previously used for developing and assessing the performance of bioimage classification algorithms can be classified even when the biological content is removed from the images [by replacing them with white squares], showing that previously reported results might be biased, and that the computer analysis could be driven by artefacts rather than by the actual biological content.

Here is an example of what the author means:


Basically, the author shows that even after modifying the images by drawing white boxes where the cells are, classifiers still manage to do apparently well. Thus, they are probably picking up on artefacts instead of signal.

This is (and this analogy is from the paper, although not exactly in this form) like a face recognition system which seems to work very well because all of the images it has of me have me wearing the same shirt. It can perform very well on the training data, but will be fooled by anyone who wears the same shirt.


This is a very important work as it points to the fact that many previous results were probably overinflated. Looking at the dates when this work was done, this was probably at the same time that I was working on my own paper on evaluation of subcellular location determination (just that it took a while for that one to appear in print).

I expect that my proposed stricter protocol for evaluation (train and test on separate images) would be more protected against this sort of effect [2]: we are now modeling the real problem instead of a proxy problem.


I believe two things about image analysis of biological samples:

  1. Computers can be much better than humans at this task.
  2. Some (most? much of?) published literature overestimates how well computers do with the method being presented.

Note that there is no contradiction between the two, except that point 2, if widely believed, can make it harder to convince people of point 1.

(There is also a third point which is most people overestimate how well humans do.)

[1] Normally, I’d review recent papers only, but this not only had this one escaped my attention when it came out (in my defense, it came out just when I was trying to finish my PhD thesis), but it deals with themes I have blogged about before.
[2] I tried a bit of testing around here, but it is hard to automate the blocking of the cells. Automatic thresholding does not work because it depends on the shape of the signal! This is why the author of this paper drew squares by hand.

Is Cell Segmentation Needed for Cell Analysis?

Having just spent some posts discussing a paper on nuclear segmentation (all tagged posts), let me ask the question:

Is cell segmentation needed? Is this a necessary step in an analysis pipeline dealing with fluorescent cell images?

This is a common FAQ whenever I give a talk on my work which does not use segmentation, for example, using local features for classification (see the video). It is a FAQ because, for many people, it seems obvious that the answer is that Yes, you need cell segmentation. So, when they see me skip that step, they ask: shouldn’t you have segmented the cell regions?

Here is my answer:

Remember Vapnik‘s dictum [1]do not solve, as an intermediate step, a harder problem than the problem you really need to solve.

Thus the question becomes: is your scientific problem dependent on cell segmentation? In the case, for example, of subcellular location determination, it is not: all the cells in the same field display the same phenotype, your goal being the find out what it is. Therefore, you do not need to have an answer for each cell, only for the whole field.

In other problems, you may need to have a per-cell answer: for example in some kinds of RNAi experiment only a fraction of the cells in a field display the RNAi phenotype and the others did not take up the RNAi. Therefore, segmentation may be necessary. Similarly, if a measurement such as distance of fluorescent bodies to cell membrane is meaningful, by itself (as opposed to being used as a feature for classification), then you need segmentation.

However, sometimes you can get away without segmentation.


An important point to note is the following: while it may be good to have access to perfect classification, imperfect classification (i.e., the type you actually get), may not help as much as the perfect kind.


Just to be sure, I was not the first person to notice that you do not need segmentation for subcellular location determination. I think this is the first reference:

Huang, Kai, and Robert F. Murphy. “Automated classification of subcellular patterns in multicell images without segmentation into single cells.” Biomedical Imaging: Nano to Macro, 2004. IEEE International Symposium on. IEEE, 2004. [Google scholar link]

[1] I’m quoting from memory. It may a bit off. It sounds obvious when you put it this way, but it is still often not respected in practice.

Video Abstract for Our Paper

Available on figshare. Check it out!

Can’t embed because WordPress would require me to pay a bit too much right now [it will probably be the only video I’ll post this year].

I did this on Linux and was surprised at how much the open source video editing software has grown. Everything worked well and the interfaces were good. It only took a few hours (which seems a lot, but this might reach as many people as if I gave a talk and I’d certainly spend as much time preparing for a talk).

I wish I had a better microphone than my laptop microphone, though.

Recognition of an Organelle Marker is not the Same as Recognition of the Organelle

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [DOI]

As I wrote on Wednesday, this paper has two main ideas: (1) traditional subcellular location determination systems do not generalize very well and (2) local features do better. I will now try to explain the first point in depth.


Here are the first two sentences of the abstract (added emphasis):

Evaluation of previous systems for automated determination of subcellular location from microscope images has been done using datasets in which each location class consisted of multiple images of the same representative protein. Here, we frame a more challenging and useful problem where previously unseen proteins are to be classified.

To expand on this: the typical evaluation model is the following:

  1. Define the classes of interest (e.g., the major organelles: nucleusmitochondria, …).
  2. For each class, choose a representative. It could be a protein which was fluorescently tagged or another fluorescent marker (like DAPI for DNA). In our work, we only used fluoresencent proteins, but the same logic applies to small molecular markers.
  3. Collect multiple images of cells tagged with this marker.
  4. Split up the set of images into training and testing groups. Learn a classifier on the training set, evaluate it on the testing sets.
  5. Report the results.

The techniques were, almost always, feature based [1]. A feature is a function which computes a number from the image. By computing numbers which represent the properties of interest, we can hope that images from the same class will have similar results. The following image illustrates this [2]:


Images of known proteins (left and right) are projected into a low dimensional space of features. Then an image of unknown label can be predicted by looking in this low dimensional space as well.


We can get very high accuracies, above 95% in some cases, with this family of systems, which have been interpreted as meaning that automated system can determine the location of proteins at high accuracies. There is a big hidden assumption in the reasoning, however!

There are two hypothesis that are consistent with the data:

  1. The system is very good at recognizing this location.
  2. The system is very good at recognizing this protein.

Under the second hypothesis, the system is very good at recognizing the marker you used for DNA (say DAPI), but may fail miserably when presented with another nuclear marker.


Fundamentally, to test between the two hypothesis above, we need datasets with multiple proteins per location. This is what we collected. And, when we tested the generalization ability of traditional methods, they fell short.

While a traditional approach was able to get 84% accuracy when it only needed to recognize the proteins it had been trained on (10 classes), it fell to 62% when it needed to recognize locations of new proteins. However, this is the important problem: to determine the location of new proteins, not the ones the system was trained on.

As the title says: Recognition of an Organelle Marker is not the Same as Recognition of the Organelle


Over the next few posts I will explain how we tested this & then, finally, how we got some better results on this harder problem.

[1] There is an exception that I know of, from the beginning of the field: Danckaert et al. 2003 in Traffic. They used a neural network directly on the pixels with a single hidden layer. It would be very interesting to re-attempt this approach for cell images with the new technology in deep learning that was developed in the meanwhile (I don’t have enough time to do it myself, so feel free to take this idea and run with it; or get in touch if you want to do it together).
[2] This image is in Wikipedia, but I put it there, so I don’t need to credit it.

New Paper: Determining the subcellular location of new proteins from microscope images using local features

I have a new paper out:

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]

(It’s not open access, but feel free to email me for a preprint.)

Nuclear examples

As you can see, this was 10 months in review, so I am very happy that it is finally out. To be fair, the final version is much improved due to some reviewer comments (alas, not all reviewer comments were constructive).

There are two main ideas in this paper. We could perhaps have broken this up into two minimum publishable units, but the first idea immediately brings up a question. We went ahead and answered that too.

The is the main point of the paper is that:

1. The evaluation of bioimage classification systems (in the context of subcellular classification, but others too) has under-estimated the problem.

Almost all evaluations have used the following mode [1]:

  1. Define the classes of interest, such as the organelles: nuclearGolgimitochondria, …
  2. For each of these, select a representative marker (ie, DAPI for the nuclear class, &c).
  3. Collect multiple images of different cells tagged with the representative marker for each protein.
  4. Test whether a system trained on some images of that marker can recognise other images of the same marker.
  5. Use cross-validation over these images. Get good results. Publish!

Here is the point of this paper: By using a single marker (a tagged protein or other fluorescent marker) for each class, we are unable to distinguish between two hypothesis: (a) the system is good at distinguishing the classes and (b) the system is good at distinguishing the markers. We show empirically that, in many cases, you are distinguishing markers and not locations!

This is a complex idea, and I will have at least another post just on this idea.

The natural follow-up question is how can we get better results in this new problem?

2. Local features work very well for bioimage analysis. Using SURF and an adaptation of SURF we obtained a large accuracy boost. The code is available in my library mahotas.

I had pointed out in my review of Liscovitch et al. that we had similarly obtained good results with local features.

I will have a few posts on this paper, including at least one on things that we left out because they did not work very well.

[1] All that I know. I may be biased towards the subcellular location literature (which I know very well), but other literatures may have been aware of this problem. Add a comment below if you know of something.