What is a Gene? The Definitive Answer

I think the GenBank file spec gets the definition just right:

gene: A region of biological interest identified as a gene and for which a name has been assigned.

That’s basically it. If people call it a gene, it’s a gene.


They could mean:

  • a region in the genome that gets transcribed (or translated; but are introns no longer part of it?)
  • the nucleotide or amino-acid code in those regions
  • the “reference” nucleotide code that is expected in that region
  • the homologs of that (or othologs or paralogs or purposefully remaining fuzzy because it’s hard to say what’s what)
  • the regions of genome that cluster together across different organisms
  • a higher level concept that groups several proteins together through inferred orthology
  • (or perhaps even convergent evolution)
  • the protein encoded by the gene (or the general cluster of proteins)

In many discussions, gene is a good word to rationalist taboo. It clears up many mistakes when people are obliged to say what they mean by this tricky word.


Another good word to taboo is species when the organisms are bacteria

To even use the same word as we have for animals is probably a mistake. We need a word for “bacteria whose rRNA clusters together in nucleotide space” without all of the baggage of species.

And so we would replace a veneral biological concept with a computational definition: progress!

Paper Review: Probabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences


PaperProbabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences by Dazhi Jiao, Yuzhen Ye, and Haixu Tang

This is a recent paper in Plos CompBio which looks at the following problem:

  1. Metagenomics (the sequence of genomes in mixed population samples) gives us many genes. Many of these genes can be mapped to enzymes.
  2. Those enzyme can be mapped (with a database such as KEGG) to reactions, but these assignments are ambiguous.
  3. Can we obtain a consensus interaction network?

The authors approach the problem probabilisticly by computing, for each possible interaction network, a probability. Their model is based on the notion that real interaction networks probably have fewer metabolites than all possible combinations. From the paper (my emphasis):

However, if the product of a gene is annotated to catalyze multiple reactions, some of these reactions may be excluded from the sampled subnetwork, as long as at least one of these reactions is included. We note that, according to this condition, each sampled subnetwork represents a putative reconstruction of the collective metabolic network of the metagenome, among which we assume the subnetworks containing fewer metabolites are more likely to represent the actual metabolism of the microbial community than the ones containing more metabolites.

From this idea, using standard MCMC they are able to assign to each reaction a probability that it is part of a community.

They validate their method using clustering. They show that using their probability assignments results in better separation of samples that relying on naïve assignments of all enzymes to all possible reactions. The result is nice and clean.

To reduce this to bare essentials, the point is that their method (on the right) gets the separation between the different types of samples (represented by different symbols) better than any alternatives.

Hierarchical clustering of 44 IMG/M metagenomics samples represented in dendrograms.

They also suggest that they are able to extract differentially present reactions better than the baseline methods. Unfortunately, due to the lack of a validated result, it is really impossible to know whether they just got more false positives. I do not really know how to do it better, though. This is just one of those fundamental problems in the field: the lack of validated information to build upon.

However, it is good to be able to even talk of differentially expressed reactions instead of just genes or orthologous groups.

In global, the authors present an interesting formulation of a hard problem. I always like the idea of handling uncertainty probabilistically and it is good to see that it really does work.

This is the sort of paper that opens up a bunch of questions immediately on extensions:

  • Can similar methods handle uncertainty in the basic gene assignments?
  • Or KEGG annotations?

Currently, they assume that all enzymes are actually present and perform one of the functions listed, but neither of these statements is always true.

Another area where their method could be taken is whether to move up from computing marginal probabilities of single reactions and into computing small subnetworks. I hope that the authors are exploring some of these questions and present us with some follow up work in the near future.