Friday Links

1. DNA-driven breeding as an alternative to standard GMOs

2. Taxing the rich could boost growth by giving us fewer bankers and more teachers & scientists

I, for one, am happy that someone is trying to solve the problem of too many job openings for scientists and too few scientists by thinking up ways to make any alternative less appealing.

3. Better than human face recognition. Naturally, we generally do face recognition based on a 3D presence, but this is like the traditional definition of artificial intelligence that real intelligence is whatever computers cannot yet do.

  1. Changing my mind, a personal history

Mahotas 1.1.0 Released!

Mahotas 1.1 Released

released mahotas 1.1.0 yesterday.

Use pip install mahotas --upgrade to upgrade.

Mahotas is my computer vision library for Python.

Summary of Changes

It adds the functions resize_to and resize_rgb_to, which can be used like:

import mahotas as mh
lena = mh.demos.load('lena')
big = mh.resize.resize_rgb_to(lena, [1024, 1024])

As well as remove_regions_where, which is useful for handling labeled images:

import mahotas as mh
nuclear = mh.demos.load('nuclear')
nuclear = mh.gaussian_filter(nuclear, 2)
labeled,_ = mh.label(nuclear > nuclear.mean())

# Ok, now remove small regions:

sizes = mh.labeled.labeled_size(labeled)

labeled = mh.labeled.remove_regions_where(
        labeled, sizes < 100)

Moments computation can now be done in a normalized mode, which is robust against scale changes:

import mahotas as mh
lena = mh.demos.load('lena', as_grey=1)
print mh.features.moments.moments(lena, 1, 2, normalize=1)
print mh.features.moments.moments(lena[::2], 1, 2, normalize=1)
print mh.features.moments.moments(lena[::2,::3], 1, 2, normalize=1)

prints 126.609789161 126.618233592 126.640228523

You can even spell the keyword argument “normalise”!

print mh.features.moments.moments(lena[::2,::3], 1, 2, normalise=1)

This release also contains some bugfixes to SLIC superpixels and to convolutions of very small images.

(If you like and use mahotas, please cite the software paper.)

FAQ: How Many Clusters Did You Use?

Luis Pedro Coelho, Joshua D. Kangas, Armaghan Naik, Elvira Osuna-Highley, Estelle Glory-Afshar, Margaret Fuhrman, Ramanuja Simha, Peter B. Berget, Jonathan W. Jarvik, and Robert F. Murphy, Determining the subcellular location of new proteins from microscope images using local features in Bioinformatics, 2013 [Advanced Access]  [Previous discussion on this blog]

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220  [Previous discussion on this blog]

Both of my Bioinformatics papers above use the concept of bag of visual words. The first for classification, the second for pattern unmixing.

Visual words are formed by clustering local appearance descriptors. The descriptors may have different origins (see the papers above and the references below) and the visual words are used differently, but the clustering is a common intermediate step.

A common question when I present this work is how many clusters do I use? Here’s the answer: it does not matter too much.

I used to just pick a round number like 256 or 512, but for the local features paper, I decided to look at the issue a bit closer. This is one of the panels from the paper, showing accuracy (y-axis) as a function of the number of clusters (x-axis):

profile-field-dna+-RT-widefield-gs

As you can see, if you use enough clusters, you’ll do fine. If I had extended the results rightwards, then you’d see a plateau (read the full paper & supplements for these results) and then a drop-off. The vertical line shows N/4, where N is the number of images in the study. This seems like a good heuristic across several datasets.

One very interesting result is that choosing clusters by minimising AIC can be counter-productive! Here is the killer data (remember, we would be minimizing the AIC):

accuracy-aic-rt-widefield-gs

Minimizing the AIC leads to lower accuracy! AIC was never intended to be used in this context, of course, but it is often used as a criterion to select the number of clusters. I’ve done it myself.

Punchline: If doing classification using visual words, minimsing AIC may be detrimental, try using N/4 (N=nr of images).

Other References

This paper (reviewed before on this blog) presents supporting data too:

Noa Liscovitch, Uri Shalit, & Gal Chechik (2013). FuncISH: learning a functional representation of neural ISH images Bioinformatics DOI: 10.1093/bioinformatics/btt207

Old Work: Unsupervised Subcellular Pattern Unmixing

Continuing down nostalgia lane, here is another old paper of mine:

Coelho, Luis Pedro, Tao Peng, and Robert F. Murphy. “Quantifying the Distribution of Probes Between Subcellular Locations Using Unsupervised Pattern Unmixing.” Bioinformatics 26.12 (2010): i7–i12. DOI: 10.1093/bioinformatics/btq220

I have already discussed the subcellular location determination problem. This is Given images of a protein, can we assign it to an organelle?

This is, however, a simplified version of the world: many proteins are present in multiple organelles. They may move between organelles in response to a stimulus or as part of the cell cycle. For example, here is an image of mitochondria in green (nuclei in red):

img17

Here is one of lysosomes:

img71

And here is a mix of both!:

img77-2

This is a dataset constructed for the purpose of this work, so we know what is happening, but it simulates the situation where a protein is present in two locations simultaneously.

Thus, we can move beyond simple assignment of a protein to an organelle to assigning it to multiple organelles. In fact, some work (both from the Murphy group and others) has looked at subcellular location classification using multiple labels per image. This, however, is still not enough: we want to quantify this.

This is the pattern unmixing problem. The goal is to go from an image (or a set of images) to something like the following: This is 30% nuclear and 70% cytoplasmic, which is very different from 70% nuclear and 30% cytoplasmic. The basic organelles can serve as the base patterns [1].

Before our paper, there was some work in approaching this problem from a supervised perspective: Given examples of different organelles (ie, of markers that locate to a single organelle), can we automatically build a system which when given images of a protein which is distributed in multiple organelles, can figure out which fraction comes from each organelle?

Our paper extended this to work to the unsupervised case: can you learn a mixture when you do not know which are the basic patterns?

References

Determining the distribution of probes between different subcellular locations through automated unmixing of subcellular patterns Tao Peng, Ghislain M. C. Bonamy, Estelle Glory-Afshar, Daniel R. Rines, Sumit K. Chanda, and Robert F. Murphy PNAS 2010 107 (7) 2944-2949; published ahead of print February 1, 2010, doi:10.1073/pnas.0912090107

Object type recognition for automated analysis of protein subcellular location T Zhao, M Velliste, MV Boland, RF Murphy Image Processing, IEEE Transactions on 14 (9), 1351-1359

[1] This is still a limited model because we are not sure even how many base patterns we should consider, but it will do for now.

Paper review: Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis

Paper review:

Assessing the efficacy of low-level image content descriptors for computer-based fluorescence microscopy image analysis by L. Shamir in Journal of Microscopy, 2011 [DOI]

This is an excellent simple paper [1]. I will jump to the punchline (slightly edited by me for brevity):

This paper demonstrates that microscopy images that were previously used for developing and assessing the performance of bioimage classification algorithms can be classified even when the biological content is removed from the images [by replacing them with white squares], showing that previously reported results might be biased, and that the computer analysis could be driven by artefacts rather than by the actual biological content.

Here is an example of what the author means:

white_squares

Basically, the author shows that even after modifying the images by drawing white boxes where the cells are, classifiers still manage to do apparently well. Thus, they are probably picking up on artefacts instead of signal.

This is (and this analogy is from the paper, although not exactly in this form) like a face recognition system which seems to work very well because all of the images it has of me have me wearing the same shirt. It can perform very well on the training data, but will be fooled by anyone who wears the same shirt.

§

This is a very important work as it points to the fact that many previous results were probably overinflated. Looking at the dates when this work was done, this was probably at the same time that I was working on my own paper on evaluation of subcellular location determination (just that it took a while for that one to appear in print).

I expect that my proposed stricter protocol for evaluation (train and test on separate images) would be more protected against this sort of effect [2]: we are now modeling the real problem instead of a proxy problem.

§

I believe two things about image analysis of biological samples:

  1. Computers can be much better than humans at this task.
  2. Some (most? much of?) published literature overestimates how well computers do with the method being presented.

Note that there is no contradiction between the two, except that point 2, if widely believed, can make it harder to convince people of point 1.

(There is also a third point which is most people overestimate how well humans do.)

[1] Normally, I’d review recent papers only, but this not only had this one escaped my attention when it came out (in my defense, it came out just when I was trying to finish my PhD thesis), but it deals with themes I have blogged about before.
[2] I tried a bit of testing around here, but it is hard to automate the blocking of the cells. Automatic thresholding does not work because it depends on the shape of the signal! This is why the author of this paper drew squares by hand.

Is Cell Segmentation Needed for Cell Analysis?

Having just spent some posts discussing a paper on nuclear segmentation (all tagged posts), let me ask the question:

Is cell segmentation needed? Is this a necessary step in an analysis pipeline dealing with fluorescent cell images?

This is a common FAQ whenever I give a talk on my work which does not use segmentation, for example, using local features for classification (see the video). It is a FAQ because, for many people, it seems obvious that the answer is that Yes, you need cell segmentation. So, when they see me skip that step, they ask: shouldn’t you have segmented the cell regions?

Here is my answer:

Remember Vapnik‘s dictum [1]do not solve, as an intermediate step, a harder problem than the problem you really need to solve.

Thus the question becomes: is your scientific problem dependent on cell segmentation? In the case, for example, of subcellular location determination, it is not: all the cells in the same field display the same phenotype, your goal being the find out what it is. Therefore, you do not need to have an answer for each cell, only for the whole field.

In other problems, you may need to have a per-cell answer: for example in some kinds of RNAi experiment only a fraction of the cells in a field display the RNAi phenotype and the others did not take up the RNAi. Therefore, segmentation may be necessary. Similarly, if a measurement such as distance of fluorescent bodies to cell membrane is meaningful, by itself (as opposed to being used as a feature for classification), then you need segmentation.

However, sometimes you can get away without segmentation.

§

An important point to note is the following: while it may be good to have access to perfect classification, imperfect classification (i.e., the type you actually get), may not help as much as the perfect kind.

§

Just to be sure, I was not the first person to notice that you do not need segmentation for subcellular location determination. I think this is the first reference:

Huang, Kai, and Robert F. Murphy. “Automated classification of subcellular patterns in multicell images without segmentation into single cells.” Biomedical Imaging: Nano to Macro, 2004. IEEE International Symposium on. IEEE, 2004. [Google scholar link]

[1] I’m quoting from memory. It may a bit off. It sounds obvious when you put it this way, but it is still often not respected in practice.

Mahotas-imread Now Accepts Options When Writing

This week, I committed to mahotas-imread, some code to allow for setting options when saving:

from imread import imsave
image = ...
imsave('file.jpeg', image, opts={ 'jpeg:quality': 95 })

This saves the image array to file file.jpeg with quality 95 (out of 100).

§

This is only available in the version from github (at the moment), but I will probably put up a new release soon.

(If you like and use mahotas, please cite the software paper.)

Why Pixel Counting is not Adequate for Evaluating Segmentation

Let me illustrate what I was trying to say in a comment to João Carriço:

Consider the following three shapes:

segmentatioin

If the top (red) image is your reference and green and blue are two candidate solutions, then pixel counting (which forms the basis of the Rand and Jaccard indices) will say that green is worse than blue. In fact, green differs by 558 pixels, while blue only by 511 pixels.

However, the green image is simply a fatter version of red (with a circa 2 pixel boundary). Since boundaries cannot be really drawn at pixel level anyway (it is a fuzzy border between background and foreground), it is not an important difference. The blue image, however, has an extra blob and so is qualitatively different.

The Hausdorff distance or my own normalized sum of distances, on the other hand, would say that green is very much like red, while blue is more different. Thus they capture the important differences better than pixel counting. I think this is why we found that these are better measures than Rand or Jaccard (or Dice) for evaluation of segmentation.

(Thanks João for prompting this example. I used this when I gave a talk or two about this paper, but it was lost in the paper because of page limits.)

Reference

NUCLEAR SEGMENTATION IN MICROSCOPE CELL IMAGES: A HAND-SEGMENTED DATASET AND COMPARISON OF ALGORITHMS by Luis Pedro Coelho, Aabid Shariff, and Robert F. Murphy in Biomedical Imaging: From Nano to Macro, 2009. ISBI ’09. IEEE International Symposium on, 2009. DOI: 10.1109/ISBI.2009.5193098 [Pubmed Central open access version]

Mahotas software paper published

I got a new paper published [1]:

Mahotas: open source software for scriptable computer vision by Luis Pedro Coelho in Journal of Open Research Software [DOI]

This is about my computer vision software package, mahotas. It started as a way to do bioimage informatics, but the sotware is actually generic to computer vision.

Figure 1

Earlier, I posted a tutorial on image segmentation which used mahotas.

The journal calls these metapapers in that they are not the work but a reference to the work, i.e., the software. This is an interesting new iniciative to reward scientific software development. As I argued before, release of scientific software is a collective action problem: It is better for science to have software released, but not for the individual researcher. I also wrote that it would be a good idea to change the incentives to make it more profitable to do the right thing. The Journal of Open Research Software is exactly one such step.

It has already been used in a few publications, both by myself and others:

As you can see, this is not all about Bioimage Informatics, which is sort of nice as it means that this is actually useful outside of the field in which it was initially developed.

[1] Yes it has been a good month for publications.