Paper Review: Approaches to automatic parameter fitting in a microscopy image segmentation pipeline: An exploratory parameter space analysis

Held C, Nattkemper T, Palmisano R, Wittenberg T. Approaches to automatic parameter fitting in a microscopy image segmentation pipeline: An exploratory parameter space analysis. J Pathol Inform 2013;4:5. DOI: 10.4103/2153-3539.109831

I once heard Larry Wasserman claim that all problems in statistics are solved, except one, how to set λ. By which he meant (or I understood or I remember; in fact, he may not even have claimed this and I am just assigning a nice quip to a famous name) that we have methods that work very well on most settings, but they tend to come with parameters and adjusting these parameters (often called λ₁, λ₂… in statistics) is what is pretty hard.

In traditional image processing, parameters abound too. Thresholds and weights are abundant in the published literature. Often, tuning them to a specific dataset is an unfortunate necessity. It also makes the published results from different authors almost incomparable as they often tune their own algorithms much harder than those of others.

In this paper, the problem of setting the parameters is viewed as an optimization problem using a supervised machine learning approach where the goal is to set parameters that reproduce a gold standard.

The set up is interesting and it’s definitely a good idea to explore this way of thinking. Unfortunately, the paper is very short (just as it’s getting good, it ends). Thus, there aren’t a lot of results, except the observations that local minima can be a problem and that genetic algorithms do pretty well at a high computational cost. For example, there is a short discussion of the human behaviour in parameter tuning and one is hoping for an experimental validation of these speculations (particularly given that the second author is well-known for earlier work on this theme).

I will be looking out for follow-up work from the same authors.

Advertisements

Paper Review: Unsupervised Clustering of Subcellular Protein Expression Patterns in High-Throughput Microscopy Images Reveals Protein Complexes and Functional Relationships between Proteins

Handfield, L., Chong, Y., Simmons, J., Andrews, B., & Moses, A. (2013). Unsupervised Clustering of Subcellular Protein Expression Patterns in High-Throughput Microscopy Images Reveals Protein Complexes and Functional Relationships between Proteins PLoS Computational Biology, 9 (6) DOI: 10.1371/journal.pcbi.1003085

This is an excellent paper that came out in PLoS CompBio last week.

The authors present a high-throughput analysis of yeast fluorescent microscopy images of tagged proteins. Figure 8, panel B (doi:10.1371/journal.pcbi.1003085.g008) shows a few example images from their collection

Figure 8

One interesting aspect is that they work on the dynamic aspects of protein distributions only from snapshots. I was previously involved in a similar project (ref. 18 in the paper [1]) and so I was happy to see others working in this fashion.

Budding yeast, as the name says, buds. A mother cell will create a new bud, that bud will grow and eventually it will split off and become a new daughter cell.

By leveraging the bud size as a marker of cell stage, the authors can build dynamic protein profiles and cluster these. This avoids the need for either (i) chemical synchronization [which has other side-effects in the cell] or (ii) movie acquisition [which besides taking longer, itself damages the cells through photoxicity].

In all of the examples above, you can see a change in protein distribution as the bud grows.

§

They perform an unsupervised analysis of their data, noting that

Unsupervised analysis also has the advantage that it is unbiased by prior ‘expert’ knowledge, such as the arbitrary discretization of protein expression patterns into easily recognizable classes.

Part of my research goals is to move beyond supervised/unsupervised into mixed models (take the supervision, but take it with a grain of salt). However, this is not yet something that we can do with current machine learning technologies.

The clusters are obtained are found to group together functionally similar genes (details in the paper).

§

The authors are Bayesian about their estimates in a very interesting way. They evaluate their segmentations against training data, which gives them a confidence measure:

Our confidence measure allows us to distinguish correctly identified cells from artifacts and misidentified objects, without specifying what the nature of artifacts might be.

This is because their measure is a density estimate derived from training based on features of the shape. Now, comes the nice Bayesian point:

This allows us to weight probabilistically data points according to the posterior probability. For classes of cells where our model does not fit as well, such as very early non-ellipsoidal buds, we expect to downweight all the data points, but we can still include information from these data points in our analysis. This is in contrast to the situation where we used a hard threshold to exclude artifacts.

(emphasis mine)

§

Unlike the authors, I do not tend to care so much about interpretable features in my work. However, it is interesting that such a small number (seven) of features got such good results.

There is more in the paper which I did not mention here: the image processing pipeline (which is fairly standard if you’re familiar with the field, but this unglamorous aspect of the business is where you always spend a lot of time);

§

One of my goals is to raise the profile of Bioimage Informatics, so I will try to have more papers in this field on the blog.

[1] We worked on mammalian cells, not budding yeast. Their cell cycles are very different and the methods that work in one do not necessarily work in the other.

Paper Review: Dual Host-Virus Arms Races Shape an Essential Housekeeping Protein

Demogines, A., Abraham, J., Choe, H., Farzan, M., & Sawyer, S. (2013). Dual Host-Virus Arms Races Shape an Essential Housekeeping Protein PLoS Biology, 11 (5) DOI: 10.1371/journal.pbio.1001571

This paper is not really related to my research, but I always enjoy a good cell biology story. My review is thus mostly a retelling of what I think were the highlights of the story.

In wild rodent populations, the retrovirus MMTV and New World arenaviruses both exploit Transferrin Receptor 1 (TfR1) to enter the cells of their hosts. Here we show that the physical interactions between these viruses and TfR1 have triggered evolutionary arms race dynamics that have directly modified the sequence of TfR1 and at least one of the viruses involved.

What is most interesting is that TfR1 is a housekeeping gene involved in iron uptake, which is essential for survival. Thus, it is probably highly constrained in its defensive evolution as even a small loss of function can be deleterious for the host.

The authors looked at the specific residues which seem to mutate rapidly in rodent species and they map to known virus/protein contact regions (which are known from X-ray crystallography).

Interestingly, the same evolutionary patterns are visible in rodent species for which no known virus use this entry point. However (and this is cool) we can find viral fossils in the genome of these rodents (i.e., we can parts of the viral sequence in the genome, which indicate that somewhere in the evolutionary past of these animals, a retrovirus integrated into the genome).

§

This process also explains why some viruses infect some species and not others: divergent evolution of the virus itself to catch up with the defensive evolution of different hosts makes them unable to infect across species. Thus, whenever the host mutates, it forces the virus gene to make an awkward choice: does it want to chase the new host surface and specialize to this species or let this species go as a possible target?

How Long Does PLoS Take to Review a Paper? All PLoS Journals Now

Due to popular demand (at least two people asked, surely that’s demand), here is a generalization of Monday’s work to include a few more PLoS journals. (This was mostly because it was easier to generalize my scripts to process process any PLoS journal.)

Here are the images for all PLoS journals. PLoS 1 at the end is the same figure I posted on Monday.

PLoS Pathogens:

stataccept-plospathogens

PLoS Neglected Tropical Diseases:

stataccept-plosntds

PLoS Medicine:

stataccept-plosmedicine

PLoS Genetics:

stataccept-plosgenetics

PLoS CompBio:

stataccept-ploscompbiol

PLoS Biology:

stataccept-plosbiology

PLoS One:

stataccept-plosone

For PLoS journals, except PLoS Medicine, it seems that the average acceptance time is 5 months. PLoS Medicine takes 7 months.

PLoS Medicine is the journal that takes the longest to review, PLoS One is the fastest (although some of the papers may have been reviewed in another PLoS journal before, speeding up the process).

§

Script are on github.

Paper Review: Probabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences

 

PaperProbabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences by Dazhi Jiao, Yuzhen Ye, and Haixu Tang

This is a recent paper in Plos CompBio which looks at the following problem:

  1. Metagenomics (the sequence of genomes in mixed population samples) gives us many genes. Many of these genes can be mapped to enzymes.
  2. Those enzyme can be mapped (with a database such as KEGG) to reactions, but these assignments are ambiguous.
  3. Can we obtain a consensus interaction network?

The authors approach the problem probabilisticly by computing, for each possible interaction network, a probability. Their model is based on the notion that real interaction networks probably have fewer metabolites than all possible combinations. From the paper (my emphasis):

However, if the product of a gene is annotated to catalyze multiple reactions, some of these reactions may be excluded from the sampled subnetwork, as long as at least one of these reactions is included. We note that, according to this condition, each sampled subnetwork represents a putative reconstruction of the collective metabolic network of the metagenome, among which we assume the subnetworks containing fewer metabolites are more likely to represent the actual metabolism of the microbial community than the ones containing more metabolites.

From this idea, using standard MCMC they are able to assign to each reaction a probability that it is part of a community.

They validate their method using clustering. They show that using their probability assignments results in better separation of samples that relying on naïve assignments of all enzymes to all possible reactions. The result is nice and clean.

To reduce this to bare essentials, the point is that their method (on the right) gets the separation between the different types of samples (represented by different symbols) better than any alternatives.

Hierarchical clustering of 44 IMG/M metagenomics samples represented in dendrograms.

They also suggest that they are able to extract differentially present reactions better than the baseline methods. Unfortunately, due to the lack of a validated result, it is really impossible to know whether they just got more false positives. I do not really know how to do it better, though. This is just one of those fundamental problems in the field: the lack of validated information to build upon.

However, it is good to be able to even talk of differentially expressed reactions instead of just genes or orthologous groups.

In global, the authors present an interesting formulation of a hard problem. I always like the idea of handling uncertainty probabilistically and it is good to see that it really does work.

This is the sort of paper that opens up a bunch of questions immediately on extensions:

  • Can similar methods handle uncertainty in the basic gene assignments?
  • Or KEGG annotations?

Currently, they assume that all enzymes are actually present and perform one of the functions listed, but neither of these statements is always true.

Another area where their method could be taken is whether to move up from computing marginal probabilities of single reactions and into computing small subnetworks. I hope that the authors are exploring some of these questions and present us with some follow up work in the near future.