A few interesting statistical facts

1. This one has gotten a lot of press recently, so it’s not so new; so bear with me if you’ve heard before:

For almost everyone, your friends have more friends than you do.

It remains true if instead of friends you substitute most other types of contacts: most of the people you follow on twitter have more followers than you do.

(The fact that most of your sexual partners had more sexual partners than you had is another reason to practice safe sex.)

2. I don’t know if this one has a name, but it’s about full busses, so we can call it the bus paradox: most people ride in busses that are fuller than average.

Let’s say that there are two types of bus: 25% of busses are completely full, the other 75% are empty (except for the driver). Then, riders will always experience full busses even if most busses are empty. This is true even if only 1% of busses are full, but the 25% number is a bit closer to reality. During rush hour, half the busses are very full (those going into town in the morning; out of town in the evening), even though the typical bus is pretty empty.

It comes up in other contexts, of course: restaurants are on average emptier than is experienced by the typical patron. It can even have public policy implications: the expensive publicly-funded football stadium is less used than the typical visitor realizes (“everytime I go there, it’s full, so it must have been a good investment” is wrong).

3. (This one is true in some countries in Europe): Most families only have a single child, but most children have siblings.

This is a variation on the bus paradox above. Let’s say 66% of families have 1 child, and 34% of families have more than 1. Then, most of the children are coming from that 34% of families with many children (at least 68 for every 66 single children, probably more) and they’ll have siblings.

Larger families are over-represented in the next cohort (in a country with a 1.2~1.3 birthrate, a family of 5 is four times over represented in the younger cohort).

What all of these have in common is that the fact that you are an observer makes you biased. They also remind us that it is a mistake to generalize too much from our own experience as the fact that we are observing something can itself be a confounding effect.

Paper Review: Probabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences


PaperProbabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences by Dazhi Jiao, Yuzhen Ye, and Haixu Tang

This is a recent paper in Plos CompBio which looks at the following problem:

  1. Metagenomics (the sequence of genomes in mixed population samples) gives us many genes. Many of these genes can be mapped to enzymes.
  2. Those enzyme can be mapped (with a database such as KEGG) to reactions, but these assignments are ambiguous.
  3. Can we obtain a consensus interaction network?

The authors approach the problem probabilisticly by computing, for each possible interaction network, a probability. Their model is based on the notion that real interaction networks probably have fewer metabolites than all possible combinations. From the paper (my emphasis):

However, if the product of a gene is annotated to catalyze multiple reactions, some of these reactions may be excluded from the sampled subnetwork, as long as at least one of these reactions is included. We note that, according to this condition, each sampled subnetwork represents a putative reconstruction of the collective metabolic network of the metagenome, among which we assume the subnetworks containing fewer metabolites are more likely to represent the actual metabolism of the microbial community than the ones containing more metabolites.

From this idea, using standard MCMC they are able to assign to each reaction a probability that it is part of a community.

They validate their method using clustering. They show that using their probability assignments results in better separation of samples that relying on naïve assignments of all enzymes to all possible reactions. The result is nice and clean.

To reduce this to bare essentials, the point is that their method (on the right) gets the separation between the different types of samples (represented by different symbols) better than any alternatives.

Hierarchical clustering of 44 IMG/M metagenomics samples represented in dendrograms.

They also suggest that they are able to extract differentially present reactions better than the baseline methods. Unfortunately, due to the lack of a validated result, it is really impossible to know whether they just got more false positives. I do not really know how to do it better, though. This is just one of those fundamental problems in the field: the lack of validated information to build upon.

However, it is good to be able to even talk of differentially expressed reactions instead of just genes or orthologous groups.

In global, the authors present an interesting formulation of a hard problem. I always like the idea of handling uncertainty probabilistically and it is good to see that it really does work.

This is the sort of paper that opens up a bunch of questions immediately on extensions:

  • Can similar methods handle uncertainty in the basic gene assignments?
  • Or KEGG annotations?

Currently, they assume that all enzymes are actually present and perform one of the functions listed, but neither of these statements is always true.

Another area where their method could be taken is whether to move up from computing marginal probabilities of single reactions and into computing small subnetworks. I hope that the authors are exploring some of these questions and present us with some follow up work in the near future.