Old Work: Unsupervised Subcellular Pattern Unmixing

Continuing down nostalgia lane, here is another old paper of mine:

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220

I have already discussed the subcellular location determination problem. This is Given images of a protein, can we assign it to an organelle?

This is, however, a simplified version of the world: many proteins are present in multiple organelles. They may move between organelles in response to a stimulus or as part of the cell cycle. For example, here is an image of mitochondria in green (nuclei in red):


Here is one of lysosomes:


And here is a mix of both!:


This is a dataset constructed for the purpose of this work, so we know what is happening, but it simulates the situation where a protein is present in two locations simultaneously.

Thus, we can move beyond simple assignment of a protein to an organelle to assigning it to multiple organelles. In fact, some work (both from the Murphy group and others) has looked at subcellular location classification using multiple labels per image. This, however, is still not enough: we want to quantify this.

This is the pattern unmixing problem. The goal is to go from an image (or a set of images) to something like the following: This is 30% nuclear and 70% cytoplasmic, which is very different from 70% nuclear and 30% cytoplasmic. The basic organelles can serve as the base patterns [1].

Before our paper, there was some work in approaching this problem from a supervised perspective: Given examples of different organelles (ie, of markers that locate to a single organelle), can we automatically build a system which when given images of a protein which is distributed in multiple organelles, can figure out which fraction comes from each organelle?

Our paper extended this to work to the unsupervised case: can you learn a mixture when you do not know which are the basic patterns?


Determining the distribution of probes between different subcellular locations through automated unmixing of subcellular patterns Tao Peng, Ghislain M. C. Bonamy, Estelle Glory-Afshar, Daniel R. Rines, Sumit K. Chanda, and Robert F. Murphy PNAS 2010 107 (7) 2944-2949; published ahead of print February 1, 2010, doi:10.1073/pnas.0912090107

Object type recognition for automated analysis of protein subcellular location T Zhao, M Velliste, MV Boland, RF Murphy Image Processing, IEEE Transactions on 14 (9), 1351-1359

[1] This is still a limited model because we are not sure even how many base patterns we should consider, but it will do for now.

Old papers: Structured Literature Image Finder (SLIF)

Still going down memory lane, I am presenting a couple of papers:

Structured literature image finder: extracting information from text and images in biomedical literature LP Coelho, A Ahmed, A Arnold, J Kangas, AS Sheikh, EP Xing, WW Cohen, RF Murphy Linking Literature, Information, and Knowledge for Biology, 23-32 [DOI] [Murphylab PDF]

Structured literature image finder: Parsing text and figures in biomedical literature A Ahmed, A Arnold, LP Coelho, J Kangas, AS Sheikh, E Xing, W Cohen, RF Murphy Web Semantics: Science, Services and Agents on the World Wide Web 8 (2), 151-154 [DOI]

These papers refer to SLIF, which was the Subcellular Location Image Finder and later the Structured Literature Image Finder.

The initial goals of this project were to develop a system which parsed the scientific literature and extracted figures (including the caption). Using text-processing, the system attempted to guess what the image depicted and using computer vision, the system attempted to interpret the image.

In particular, the focus was on subcellular image analysis for different proteins from fluorescent micrographs in published literature.



Additionally, there was a topic-model based navigation based on both images and the caption-text. This allowed for latent model based navigation. Unfortunately, the site is currently offline, but our user-study showed that it was a meaningful navigation model.


The final result was a proof-of-concept system. Most of the subsystems worked at reasonably high accuracy, but it was not sufficient for the overall inferrences to be of very high accuracy (if there are six steps in an inferene step and each has 90% accuracy, then you are just about 50/50, which is much better than random guessing in large inference spaces, but still not directly trustable).

I think the vision is still valid and eventually the technology will be good enough. There is a lot of information inside the biological literature which is not always so obvious to get at and that much of this is in the form of image. SLIF was a first stab at getting at this data in addition to the text-based approaches that are more well known.


More information about SLIF (including references to the initial SLIF papers, of which I was not a part).

Is Cell Segmentation Needed for Cell Analysis?

Having just spent some posts discussing a paper on nuclear segmentation (all tagged posts), let me ask the question:

Is cell segmentation needed? Is this a necessary step in an analysis pipeline dealing with fluorescent cell images?

This is a common FAQ whenever I give a talk on my work which does not use segmentation, for example, using local features for classification (see the video). It is a FAQ because, for many people, it seems obvious that the answer is that Yes, you need cell segmentation. So, when they see me skip that step, they ask: shouldn’t you have segmented the cell regions?

Here is my answer:

Remember Vapnik‘s dictum [1]do not solve, as an intermediate step, a harder problem than the problem you really need to solve.

Thus the question becomes: is your scientific problem dependent on cell segmentation? In the case, for example, of subcellular location determination, it is not: all the cells in the same field display the same phenotype, your goal being the find out what it is. Therefore, you do not need to have an answer for each cell, only for the whole field.

In other problems, you may need to have a per-cell answer: for example in some kinds of RNAi experiment only a fraction of the cells in a field display the RNAi phenotype and the others did not take up the RNAi. Therefore, segmentation may be necessary. Similarly, if a measurement such as distance of fluorescent bodies to cell membrane is meaningful, by itself (as opposed to being used as a feature for classification), then you need segmentation.

However, sometimes you can get away without segmentation.


An important point to note is the following: while it may be good to have access to perfect classification, imperfect classification (i.e., the type you actually get), may not help as much as the perfect kind.


Just to be sure, I was not the first person to notice that you do not need segmentation for subcellular location determination. I think this is the first reference:

Huang, Kai, and Robert F. Murphy. “Automated classification of subcellular patterns in multicell images without segmentation into single cells.” Biomedical Imaging: Nano to Macro, 2004. IEEE International Symposium on. IEEE, 2004. [Google scholar link]

[1] I’m quoting from memory. It may a bit off. It sounds obvious when you put it this way, but it is still often not respected in practice.

The “Label it twice” Principle

In many machine learning based applications, you need labeled data. This often means asking “experts” to label your data. I hereby introduce the label it twice principle:

Whenever you ask experts to label data, always get some data independently labeled by more than one.

I have seen projects where two people are deemed capable of labeling data, simply split the data 50/50. This is a huge missed opportunity. If you cannot afford to have the two labelers label all the data, split it 40-20-40, please: 40% for labeler 1, 40% for labeler 2, and 20% overlap.


There is so much evidence that human labelers can be unreliable and that inter-operator differences can be huge, that it always worth to have some data to quantify this effect for your problem.


It often even works for the advantage of the automated method. When your method gets 90% accuracy, it is nice to be able to compare to what a human could do.

In fact, in bioimage informatics, it is often the case that the conversation goes like this (rarely so clean and nice for my side, but it’s my blog and I’ll abridge if I want to):

Me: Our automated method gets 90% accuracy.

Audience member: Doesn’t that just show that it’s not ready for prime time? I mean, if it fails 10% of the time…

Me: The alternative right now is human visual analysis.

Audience member: Experts will know better.

Me: We measured, they don’t. You think this is an easy problem by picturing extreme phenotypes in your mind. Many real cells are actually much more subtle, especially in high throughput data.

Audience member: OK, that’s a fair point. How well do people do?

Me: 90%, give or take.

Audience member: Oh. And could I use this automated method on my problem?


References for unreliable labelers:

MacLeod, Norman, Mark Benfield, and Phil Culverhouse. “Time to Automate Identification.” Nature 467.7312 (2010): 154–155.

Human vs. machine: evaluation of fluorescence micrographs TW Nattkemper, T Twellmann, H Ritter, W Schubert Computers in biology and medicine 33 (1), 31-43

What is a Gene? The Definitive Answer

I think the GenBank file spec gets the definition just right:

gene: A region of biological interest identified as a gene and for which a name has been assigned.

That’s basically it. If people call it a gene, it’s a gene.


They could mean:

  • a region in the genome that gets transcribed (or translated; but are introns no longer part of it?)
  • the nucleotide or amino-acid code in those regions
  • the “reference” nucleotide code that is expected in that region
  • the homologs of that (or othologs or paralogs or purposefully remaining fuzzy because it’s hard to say what’s what)
  • the regions of genome that cluster together across different organisms
  • a higher level concept that groups several proteins together through inferred orthology
  • (or perhaps even convergent evolution)
  • the protein encoded by the gene (or the general cluster of proteins)

In many discussions, gene is a good word to rationalist taboo. It clears up many mistakes when people are obliged to say what they mean by this tricky word.


Another good word to taboo is species when the organisms are bacteria

To even use the same word as we have for animals is probably a mistake. We need a word for “bacteria whose rRNA clusters together in nucleotide space” without all of the baggage of species.

And so we would replace a veneral biological concept with a computational definition: progress!