Classifying protists into 155 (hierarchically organized) classes

An important component of my recent paper (previous post) on imaging protist (micro-eukaryotes) communities is a classifier that classifies each individual object into one of 155 classes. These classes are organized hierarchically, so that the first level corresponds to living/non-living object; then, if living, classifies it into phyla, and so on. This is the graphical representation we have in the paper:

Using a large training set (>18,000), we built a classifier capable of classifying objects into one these 155 classes with >82%.

What is the ML architecture we use? In the end, we use the traditional system: we compute many features and use a random forest trained on the full 155 classes. Why a random forest?

A random forest should be the first thing you try on a supervised classification problem (and perhaps also the last, lest you overfit). I did spent a few weeks trying different variations on this idea and none of them beat this simplest possible system. Random forests are also very fast to train (especially if you have a machine with many cores, as each tree can be learned independently).

As usual, the features were where the real work went. A reviewer astutely asked whether we really needed so many features (we compute 480 of them). The answer is yes. Even when selecting just the best features (which we wouldn’t know apriori, but let’s assume we had an oracle), it seems that we really do need a lot of features:

(This is Figure 3 — supplement 4: https://elifesciences.org/articles/26066/figures#fig3s4sdata1)

We need at least 200 features and it never really saturates. Furthermore, features are computed in groups (Haralick features, Zernike features, …), so we would not gain much

In terms of implementation, features were computed with mahotas (paper) and machine learning was done with scikit-learn (paper).

§

What about Deep Learning? Could we have used CNNs? Maybe, maybe not. We have a fair amount of data (>18,000 labeled samples), but some of the classes are not as well represented (in the pie chart above, the width of the classes represents how many objects are in the training set). A priori, it’s not clear it would have helped much.

Also, we may already be at the edge of what’s possible. Accuracy above 80% is already similar to human performance (unlike some of the more traditional computer vision problems, where humans perform with almost no mistakes and computers had very high error rates prior to the neural network revolution).

Advertisements

New papers I: imaging environmental samples of micro eukaryotes

This week, I had two first author papers published:

  1. Quantitative 3D-imaging for cell biology and ecology of environmental microbial eukaryotes 
  2. Jug: Software for Parallel Reproducible Computation in Python

I intend to post on both of them over the next week or so, but I will start with the first one.

The basic idea is that just as metagenomics was the application of lab techniques (sequencing) that had been developed for pure cultures to environmental samples, we are moving from imaging cell cultures (the type of work I did during my PhD and shortly afterwards) to imaging environmental samples. These are, thus, mixed samples of microbes (micro-eukaryotes, not bacteria, but remember: protists are microbes too).

Figure 1 from paper

Figure 1 from the paper depicting the process (a) and the results (b & c).

The result is a phenotypic view of the whole community, not just the elements that you can easily grow in the lab. As it is not known apriori which organisms will be present, we use generic eukaryotic dyes, tagging DNA, membranes, and the exterior. In addition, chlorophyll is auto-fluorescence, so we get a free extra channel.

With automated microscopes and automated analysis, we obtained images of 300,000 organisms, which were classified into 155 classes. A simple machine-learning system can perform this classification with 82% accuracy, which is similar to (or better than) the inter-operator variability in similar problems.

The result is both a very large set of images as well as a large set of features, which can be exploited for understanding the microbial community.

Paper Review: Approaches to automatic parameter fitting in a microscopy image segmentation pipeline: An exploratory parameter space analysis

Held C, Nattkemper T, Palmisano R, Wittenberg T. Approaches to automatic parameter fitting in a microscopy image segmentation pipeline: An exploratory parameter space analysis. J Pathol Inform 2013;4:5. DOI: 10.4103/2153-3539.109831

I once heard Larry Wasserman claim that all problems in statistics are solved, except one, how to set λ. By which he meant (or I understood or I remember; in fact, he may not even have claimed this and I am just assigning a nice quip to a famous name) that we have methods that work very well on most settings, but they tend to come with parameters and adjusting these parameters (often called λ₁, λ₂… in statistics) is what is pretty hard.

In traditional image processing, parameters abound too. Thresholds and weights are abundant in the published literature. Often, tuning them to a specific dataset is an unfortunate necessity. It also makes the published results from different authors almost incomparable as they often tune their own algorithms much harder than those of others.

In this paper, the problem of setting the parameters is viewed as an optimization problem using a supervised machine learning approach where the goal is to set parameters that reproduce a gold standard.

The set up is interesting and it’s definitely a good idea to explore this way of thinking. Unfortunately, the paper is very short (just as it’s getting good, it ends). Thus, there aren’t a lot of results, except the observations that local minima can be a problem and that genetic algorithms do pretty well at a high computational cost. For example, there is a short discussion of the human behaviour in parameter tuning and one is hoping for an experimental validation of these speculations (particularly given that the second author is well-known for earlier work on this theme).

I will be looking out for follow-up work from the same authors.

Paper Review: Automated prior knowledge-based quantification of neuronal patterns in the spinal cord of zebrafish

Automated prior knowledge-based quantification of neuronal patterns in the spinal cord of zebrafish by Johannes Stegmaier, Maryam Shahid, Masanari Takamiya, Lixin Yang, Sepand Rastegar, Markus Reischl, Uwe Strähle, and Ralf Mikut. in Bioinformatics (2013) [DOI]

It’s been a while since I’ve had a paper review, even though one of my goals is to give more space to bioimage informatics. So, I will try to make up for it in the next few weeks. This is a paper which is not exactly hot off the press (it came out two months ago), but still very recent.

The authors are working with zebrafish. Unfortunately, I am unable to evaluate the biological results as I do now know much about zebrafish, but I can appreciate the methodological contributions. I will illustrate some of the methods based on a Figure (Fig 2) from the paper:

Figure 2

The top panel is the data (a fish spinal coord, cropped out of a larger field), the next two a binarization of the same data and a line fit (in red). Finally, the bottom panel shows the effect of straightening the image to a line. This allows for comparison between different images by morphing them all to a common template. The alignment is performed on only one of the channels, while the others can carry complementary information.

§

This is very similar to work that has been done in straightening C. elegans images (e.g., Peng et al., 2008) in both intent and some of the general methods (although there you often morph the whole space instead of just a band of interest). It is a bit unfortunate that the bioimage informatics literature sometimes aggregates by model system when many methods can profitably be used across problems.

§

Finally, I really like this visualization, but I need to give you a bit of background to explain it (if I understood it correctly). Once a profile has been straightened (panel D in the figure above), you can summarize it by averaging along the horizontal dimension to get the average intensity at each location (where zero is the centre of the spinal coord) [1]. You can then stack these profiles (analogously to what you’d do to obtain a kinograph) as a function of your perturbation (in this case, a drug concentration):

Figure 6

This is Figure 6 in the paper.

The effect of the drug (and saturation) become obvious.

§

As a final note, I’ll leave you with this quote from the paper, which validates some of what I said before: the quality of human evaluation is consistently over-estimated:

Initial tests unveiled intra-expert and inter-expert variations of the extracted values, leading to the conclusion that even a trained evaluator is not able to satisfactorily reproduce results.

[1] The authors average a different marker than the one used for straightening, but since I know little about zebrafish biology, I focus on the methods.

FAQ: How Many Clusters Did You Use?

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]  [Previous discussion on this blog]

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220  [Previous discussion on this blog]

Both of my Bioinformatics papers above use the concept of bag of visual words. The first for classification, the second for pattern unmixing.

Visual words are formed by clustering local appearance descriptors. The descriptors may have different origins (see the papers above and the references below) and the visual words are used differently, but the clustering is a common intermediate step.

A common question when I present this work is how many clusters do I use? Here’s the answer: it does not matter too much.

I used to just pick a round number like 256 or 512, but for the local features paper, I decided to look at the issue a bit closer. This is one of the panels from the paper, showing accuracy (y-axis) as a function of the number of clusters (x-axis):

profile-field-dna+-RT-widefield-gs

As you can see, if you use enough clusters, you’ll do fine. If I had extended the results rightwards, then you’d see a plateau (read the full paper & supplements for these results) and then a drop-off. The vertical line shows N/4, where N is the number of images in the study. This seems like a good heuristic across several datasets.

One very interesting result is that choosing clusters by minimising AIC can be counter-productive! Here is the killer data (remember, we would be minimizing the AIC):

accuracy-aic-rt-widefield-gs

Minimizing the AIC leads to lower accuracy! AIC was never intended to be used in this context, of course, but it is often used as a criterion to select the number of clusters. I’ve done it myself.

Punchline: If doing classification using visual words, minimsing AIC may be detrimental, try using N/4 (N=nr of images).

Other References

This paper (reviewed before on this blog) presents supporting data too:

Noa Liscovitch, Uri Shalit, & Gal Chechik (2013). FuncISH: learning a functional representation of neural ISH images Bioinformatics DOI: 10.1093/bioinformatics/btt207

Unsupervised subcellular pattern unmixing. Part II

On Friday, I presented the pattern unmixing problem. Today, I’ll discuss how we solved it.

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220

The first step is to extract objects. For this, we use a combination of global & local thresholding: this means that a pixel is on if it is both above a global threshold which identifies the cells from the background and a local threshold (which identifies subcellular objects [1]).

We then group the objects found using k-means clustering. Here is what we obtain for a lysosomal picture (different colours mean different clusters) [2].

unmixing-colours-1

and the equivalent for the mitochondrial image:

unmixing-colours-0

You will see that the mitochondrial image has many green things and less dark purple objects, but both mitochondrial and lysosomal images have all of the groups. Now (and this is an important point): we do not attempt to classify each individual object, only to estimate the mixture.

§

Of course, if we had the identity of each object, the mixture would be trivially estimated. But we do not need to identify each object. In fact, to attempt to do so would be a gross violation of Vapnik’s Dictum (which says do not solve, as an intermediate step, a harder problem than the one you are trying to solve). It is easier to just estimate the mixtures [3].

In this formulation it might not even matter much that some of the objects we detect correspond to multiple biological objects!

§

How do we solve the mixture problem? Latent Dirichlet allocation or basis pursuit. The details are in the paper, but I will jump to the punchline.

We tested the method using a dataset where we had manipulated the cell tagging so we know the ground truth (but the algorithm, naturally, does not see it). On the graph below, the x-axis is the (hidden) truth and the y-axis is the automated estimate. In green, the perfect diagonal; and each dot represents one condition:

unmixing_corrcoef_lda

§

I will note that each individual dot in the above plot represents several images from each condition. On a single image (or single cell) level the prediction is not so accurate. Only by aggregating a large number of objects can the model predict well.

This also points out why it may be very difficult for humans to perform this task (nobody has tried to do it, actually).

[1] A global threshold did not appear to be sufficient for this because there is a lot of in-cell background light (auto-fluorescence and auto-focus light).
[2] For this picture, I used 5 clusters to get 5 different colours. The real process used a larger number, obtained by minimising BIC.
[3] Sure, we can then reverse engineer and obtain a probability distribution for each individual object, but that is not the goal.

Old Work: Unsupervised Subcellular Pattern Unmixing

Continuing down nostalgia lane, here is another old paper of mine:

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220

I have already discussed the subcellular location determination problem. This is Given images of a protein, can we assign it to an organelle?

This is, however, a simplified version of the world: many proteins are present in multiple organelles. They may move between organelles in response to a stimulus or as part of the cell cycle. For example, here is an image of mitochondria in green (nuclei in red):

img17

Here is one of lysosomes:

img71

And here is a mix of both!:

img77-2

This is a dataset constructed for the purpose of this work, so we know what is happening, but it simulates the situation where a protein is present in two locations simultaneously.

Thus, we can move beyond simple assignment of a protein to an organelle to assigning it to multiple organelles. In fact, some work (both from the Murphy group and others) has looked at subcellular location classification using multiple labels per image. This, however, is still not enough: we want to quantify this.

This is the pattern unmixing problem. The goal is to go from an image (or a set of images) to something like the following: This is 30% nuclear and 70% cytoplasmic, which is very different from 70% nuclear and 30% cytoplasmic. The basic organelles can serve as the base patterns [1].

Before our paper, there was some work in approaching this problem from a supervised perspective: Given examples of different organelles (ie, of markers that locate to a single organelle), can we automatically build a system which when given images of a protein which is distributed in multiple organelles, can figure out which fraction comes from each organelle?

Our paper extended this to work to the unsupervised case: can you learn a mixture when you do not know which are the basic patterns?

References

Determining the distribution of probes between different subcellular locations through automated unmixing of subcellular patterns Tao Peng, Ghislain M. C. Bonamy, Estelle Glory-Afshar, Daniel R. Rines, Sumit K. Chanda, and Robert F. Murphy PNAS 2010 107 (7) 2944-2949; published ahead of print February 1, 2010, doi:10.1073/pnas.0912090107

Object type recognition for automated analysis of protein subcellular location T Zhao, M Velliste, MV Boland, RF Murphy Image Processing, IEEE Transactions on 14 (9), 1351-1359

[1] This is still a limited model because we are not sure even how many base patterns we should consider, but it will do for now.