FAQ: How Many Clusters Did You Use?

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]  [Previous discussion on this blog]

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220  [Previous discussion on this blog]

Both of my Bioinformatics papers above use the concept of bag of visual words. The first for classification, the second for pattern unmixing.

Visual words are formed by clustering local appearance descriptors. The descriptors may have different origins (see the papers above and the references below) and the visual words are used differently, but the clustering is a common intermediate step.

A common question when I present this work is how many clusters do I use? Here’s the answer: it does not matter too much.

I used to just pick a round number like 256 or 512, but for the local features paper, I decided to look at the issue a bit closer. This is one of the panels from the paper, showing accuracy (y-axis) as a function of the number of clusters (x-axis):


As you can see, if you use enough clusters, you’ll do fine. If I had extended the results rightwards, then you’d see a plateau (read the full paper & supplements for these results) and then a drop-off. The vertical line shows N/4, where N is the number of images in the study. This seems like a good heuristic across several datasets.

One very interesting result is that choosing clusters by minimising AIC can be counter-productive! Here is the killer data (remember, we would be minimizing the AIC):


Minimizing the AIC leads to lower accuracy! AIC was never intended to be used in this context, of course, but it is often used as a criterion to select the number of clusters. I’ve done it myself.

Punchline: If doing classification using visual words, minimsing AIC may be detrimental, try using N/4 (N=nr of images).

Other References

This paper (reviewed before on this blog) presents supporting data too:

Noa Liscovitch, Uri Shalit, & Gal Chechik (2013). FuncISH: learning a functional representation of neural ISH images Bioinformatics DOI: 10.1093/bioinformatics/btt207

Is Cell Segmentation Needed for Cell Analysis?

Having just spent some posts discussing a paper on nuclear segmentation (all tagged posts), let me ask the question:

Is cell segmentation needed? Is this a necessary step in an analysis pipeline dealing with fluorescent cell images?

This is a common FAQ whenever I give a talk on my work which does not use segmentation, for example, using local features for classification (see the video). It is a FAQ because, for many people, it seems obvious that the answer is that Yes, you need cell segmentation. So, when they see me skip that step, they ask: shouldn’t you have segmented the cell regions?

Here is my answer:

Remember Vapnik‘s dictum [1]do not solve, as an intermediate step, a harder problem than the problem you really need to solve.

Thus the question becomes: is your scientific problem dependent on cell segmentation? In the case, for example, of subcellular location determination, it is not: all the cells in the same field display the same phenotype, your goal being the find out what it is. Therefore, you do not need to have an answer for each cell, only for the whole field.

In other problems, you may need to have a per-cell answer: for example in some kinds of RNAi experiment only a fraction of the cells in a field display the RNAi phenotype and the others did not take up the RNAi. Therefore, segmentation may be necessary. Similarly, if a measurement such as distance of fluorescent bodies to cell membrane is meaningful, by itself (as opposed to being used as a feature for classification), then you need segmentation.

However, sometimes you can get away without segmentation.


An important point to note is the following: while it may be good to have access to perfect classification, imperfect classification (i.e., the type you actually get), may not help as much as the perfect kind.


Just to be sure, I was not the first person to notice that you do not need segmentation for subcellular location determination. I think this is the first reference:

Huang, Kai, and Robert F. Murphy. “Automated classification of subcellular patterns in multicell images without segmentation into single cells.” Biomedical Imaging: Nano to Macro, 2004. IEEE International Symposium on. IEEE, 2004. [Google scholar link]

[1] I’m quoting from memory. It may a bit off. It sounds obvious when you put it this way, but it is still often not respected in practice.

Bioimage Analysis Proceedings Session at #ISBMECCB (Live Blogging)

This afternoon, I sat in the Bioimage Informatics Proceedings session.

Automated cellular annotation for high-resolution images of adult Caenorhabditis elegans by Sarah J. Aerni et al. [DOI]

Work uses C. elegans for studying development, where we can uniquely identy all 959 cells. The question is how to do so automatically as it takes hours/days to do so viusally.

Unlike previous work, they use morphological features of cells and not just expected location. They also allow for variable cell division. The result is higher accuracy in labeled data.

FuncISH: learning a functional representation of neural ISH images by Noa Liscovitch et al. [DOI]

(I blogged about this paper before)

This work looks at gene expression in the brain. Images are represented using local features. They do not use the scale invariance of the SIFT representation as the images are all at the same scale.

The genes are mapped to functional annotations, which is more effective than the previously published baselines, which only used the images. This can pick up similarity of genes that are expressed in different cell regions.

Automated annotation of gene expression image sequences via non-parametric factor analysis and conditional random fields by Iulian Pruteanu-Malinici et al. [DOI]

Work with in-situ hybridization images on Drosophila embryos across genes and time. Features were extracted using a sparse Bayesian factor model. Then, the temporal aspect of the data is modeled using a conditional random field, which improves results when compared to considering the inputs as independent.

A high-throughput framework to detect synapses in electron microscopy images by Saket Navlakha et al. [DOI]

Presentation of methodological advances in detecting synapses, involving both new laboratorial and new computational methods. The basic lab technique was a now-unused 50 year-old method. The most interesting aspect is that the experimental technique is justifiedbecause it makes (automatic) analysis easier.

They also tackled the typically ignore problem of generalizing a model learned on a particular set of samples to a new set of similar but not quite the same of samples. They empirically showed that Co-training works well for this problem if you are careful. Nice!

Video Abstract for Our Paper

Available on figshare. Check it out!

Can’t embed because WordPress would require me to pay a bit too much right now [it will probably be the only video I’ll post this year].

I did this on Linux and was surprised at how much the open source video editing software has grown. Everything worked well and the interfaces were good. It only took a few hours (which seems a lot, but this might reach as many people as if I gave a talk and I’d certainly spend as much time preparing for a talk).

I wish I had a better microphone than my laptop microphone, though.

New Paper: Determining the subcellular location of new proteins from microscope images using local features

I have a new paper out:

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]

(It’s not open access, but feel free to email me for a preprint.)

Nuclear examples

As you can see, this was 10 months in review, so I am very happy that it is finally out. To be fair, the final version is much improved due to some reviewer comments (alas, not all reviewer comments were constructive).

There are two main ideas in this paper. We could perhaps have broken this up into two minimum publishable units, but the first idea immediately brings up a question. We went ahead and answered that too.

The is the main point of the paper is that:

1. The evaluation of bioimage classification systems (in the context of subcellular classification, but others too) has under-estimated the problem.

Almost all evaluations have used the following mode [1]:

  1. Define the classes of interest, such as the organelles: nuclearGolgimitochondria, …
  2. For each of these, select a representative marker (ie, DAPI for the nuclear class, &c).
  3. Collect multiple images of different cells tagged with the representative marker for each protein.
  4. Test whether a system trained on some images of that marker can recognise other images of the same marker.
  5. Use cross-validation over these images. Get good results. Publish!

Here is the point of this paper: By using a single marker (a tagged protein or other fluorescent marker) for each class, we are unable to distinguish between two hypothesis: (a) the system is good at distinguishing the classes and (b) the system is good at distinguishing the markers. We show empirically that, in many cases, you are distinguishing markers and not locations!

This is a complex idea, and I will have at least another post just on this idea.

The natural follow-up question is how can we get better results in this new problem?

2. Local features work very well for bioimage analysis. Using SURF and an adaptation of SURF we obtained a large accuracy boost. The code is available in my library mahotas.

I had pointed out in my review of Liscovitch et al. that we had similarly obtained good results with local features.

I will have a few posts on this paper, including at least one on things that we left out because they did not work very well.

[1] All that I know. I may be biased towards the subcellular location literature (which I know very well), but other literatures may have been aware of this problem. Add a comment below if you know of something.

Paper Review: FuncISH: learning a functional representation of neural ISH images

Noa Liscovitch, Uri Shalit, & Gal Chechik (2013). FuncISH: learning a functional representation of neural ISH images Bioinformatics DOI: 10.1093/bioinformatics/btt207

This is part of the ISMB 2013 Proceedings series, which I am interested in as I’ll be going to Berlin and is a Bioimage Informatics paper, which I’m keen to cover, so it was only natural I’d review it here.


The authors are analysing in-situ hybridization (ISH) images from the Allen Brain Atlas. Figure 1 in the paper shows an example:



The authors use the images an input for a functional classifier. The input to this classifier is an image and the output are functional GO terms. At least a confidence level for each GO term in the vocabulary.

You can read the details in Section 3.1, but the system works to predict functional GO terms. Especially, as one would expect, neuronal categories. This is very interesting and I hope that the authors (or others) will pick up on the specific biology that is being predicted here and see if it can be used further. [1]

Alternatively, you can see this model as a dimensionality reduction approach, whereby images are projected into the space of GO terms. For this, one considers the continuous confidence levels rather than binary classifications.

In this space, it is possible to compute similarity scores between images, which operate at a functional rather than simply appearance level. The results are much better than simply comparing the image features directly (see Figure 4 for details). There is a lot of added value in considering the functional annotations rather than simple appearance.


I was very interested in the methods and the details, as the authors used SIFT and a bag-of-words approach. I have a paper coming out showing that SURF+bag-of-words works very well for subcellular determination. This paper provides additional evidence that this family of techniques works well in bioimage analysis, even if the problem areas are different.

They do make an interesting a few interesting remarks which I’ll highlight here:

Although their name suggest differently, SIFT descriptors at several scales capture different types of patterns.

The original SIFT were developed for natural image matching where the scale is unknown and may even vary within the same image (if a person is standing close-by and another one is far away, they will be at different scales). However, this is not the case with bioimage analysis.


Interestingly, the four visual words with the highest contribution to classification were the words counting the zero descriptors in each scale. This means that the highest information content lies in ‘least informative’ descriptors, and that overall expression levels (‘sparseness’ of expression) are important factors in functional prediction of genes based on their spatial expression.

This is interesting, although an alternative hypothesis is that the null descriptors capture a very different type of information. Since there are only 4 of them, these capture all this content. The other 2000 words are often highly correlated. Thus, they have high information content per group. Because of the penalized regression (in L2), the weight is spread around the correlated values.


Finally, I agree with this statement:

Combining local and global patterns of expression is, therefore, an important topic for further research.

[1] Unfortunately, my understanding of neuroscience does not go much beyond if I drink too much coffee, I get a headache. So, I cannot comment on whether these predictions make much sense.