Classifying protists into 155 (hierarchically organized) classes

An important component of my recent paper (previous post) on imaging protist (micro-eukaryotes) communities is a classifier that classifies each individual object into one of 155 classes. These classes are organized hierarchically, so that the first level corresponds to living/non-living object; then, if living, classifies it into phyla, and so on. This is the graphical representation we have in the paper:

Using a large training set (>18,000), we built a classifier capable of classifying objects into one these 155 classes with >82%.

What is the ML architecture we use? In the end, we use the traditional system: we compute many features and use a random forest trained on the full 155 classes. Why a random forest?

A random forest should be the first thing you try on a supervised classification problem (and perhaps also the last, lest you overfit). I did spent a few weeks trying different variations on this idea and none of them beat this simplest possible system. Random forests are also very fast to train (especially if you have a machine with many cores, as each tree can be learned independently).

As usual, the features were where the real work went. A reviewer astutely asked whether we really needed so many features (we compute 480 of them). The answer is yes. Even when selecting just the best features (which we wouldn’t know apriori, but let’s assume we had an oracle), it seems that we really do need a lot of features:

(This is Figure 3 — supplement 4: https://elifesciences.org/articles/26066/figures#fig3s4sdata1)

We need at least 200 features and it never really saturates. Furthermore, features are computed in groups (Haralick features, Zernike features, …), so we would not gain much

In terms of implementation, features were computed with mahotas (paper) and machine learning was done with scikit-learn (paper).

§

What about Deep Learning? Could we have used CNNs? Maybe, maybe not. We have a fair amount of data (>18,000 labeled samples), but some of the classes are not as well represented (in the pie chart above, the width of the classes represents how many objects are in the training set). A priori, it’s not clear it would have helped much.

Also, we may already be at the edge of what’s possible. Accuracy above 80% is already similar to human performance (unlike some of the more traditional computer vision problems, where humans perform with almost no mistakes and computers had very high error rates prior to the neural network revolution).

Advertisements

New papers I: imaging environmental samples of micro eukaryotes

This week, I had two first author papers published:

  1. Quantitative 3D-imaging for cell biology and ecology of environmental microbial eukaryotes 
  2. Jug: Software for Parallel Reproducible Computation in Python

I intend to post on both of them over the next week or so, but I will start with the first one.

The basic idea is that just as metagenomics was the application of lab techniques (sequencing) that had been developed for pure cultures to environmental samples, we are moving from imaging cell cultures (the type of work I did during my PhD and shortly afterwards) to imaging environmental samples. These are, thus, mixed samples of microbes (micro-eukaryotes, not bacteria, but remember: protists are microbes too).

Figure 1 from paper

Figure 1 from the paper depicting the process (a) and the results (b & c).

The result is a phenotypic view of the whole community, not just the elements that you can easily grow in the lab. As it is not known apriori which organisms will be present, we use generic eukaryotic dyes, tagging DNA, membranes, and the exterior. In addition, chlorophyll is auto-fluorescence, so we get a free extra channel.

With automated microscopes and automated analysis, we obtained images of 300,000 organisms, which were classified into 155 classes. A simple machine-learning system can perform this classification with 82% accuracy, which is similar to (or better than) the inter-operator variability in similar problems.

The result is both a very large set of images as well as a large set of features, which can be exploited for understanding the microbial community.