Quick followups: NGLess benchmark & Notebooks as papers

A quick follow-up on two earlier posts:

We finalized the benchmark for ngless that I had discussed earlier:

As you can see, NGLess performs much better than either MOCAT or htseq-count. We tried to use featureCounts too, but that completely failed to produce results for some of the samples (we gave it a whopping 1TB of RAM, but it used it all up before crashing).

It also reveals that although ngless was developed in the context of our metagenomics work, it would also likely do well on the type of problems for which htseq-count is currently being used, in the domain of RNA-seq.


Earlier, I also  wrote skeptically about the idea of replacing papers with Jupyter notebooks:

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

The little script that generates the plot above is an excellent example of this. It is available online (on github: plot-comparison.py). It’s over 100 lines and, even then, the final result required some minor aesthetic manipulations in inkscape (so that, if you run it, the result is slightly uglier: in particular, the legend is absent).

Would it really add anything to the presentation of the manuscript to have those 100 lines of code be intermingled with the presentation of ngless as a metagenomics profiler?

In this case, I am confident that a Jupyter notebook would be worse than the current solution of a PDF as a main presentation with the data table and plotting scripts as supplemental material.


Journal subscriptions are negotiated, but article processing charges are fixed prices

Scientific publishing is moving to open access. This means that it’s moving from a subscription, reader-pays, to an author-pays model (normally termed article processing charges, or APCs).

This will have a couple of impacts:

1. Institutions which publish more and read less lose to institutions which read more and publish less. Thus, research universities will probably lose out to the pharmaceutical industry (as I’m sure that the pharmaceutical industry is proportionally reading a lot of papers, but not publishing as much).

This is not a big deal, but I thought I would mention it. Some people seem to be very angry at pharmaceutical companies all the time for not paying their fair share, but it’s a tough business and part of the point of publicly funded research is to enable downstream users (like pharmaceuticals) to flourish. Moving to APCs seems to be another move in supporting pharma (probably smaller biotech upstarts being the biggest beneficiary). Teaching-focused universities will also benefit.

2. More importantly, though, the move to APC (article processing charges) instead of subscriptions is also a move from a product that is sold at a variable price to one that is bought a fixed price.

Variable pricing (or price discrimination) is a natural feature of many of the markets that look like subscriptions, where fixed costs are more important than marginal costs (in this case, the extra cost of allowing access to the journal once the journal is done is, basically, zero).

Plane tickets and hotel rooms are less extreme cases where the price will fluctuate to attempt to charge more to those willing to pay more (typically, the wealthier; but sometimes the people willing to pay more and those counting their pennies are the same people, just in different situations: sometimes I really want a specific flight, other times I don’t even care exactly where I am going).

So, in the subscription model, some institutions will pay more. Maybe it’s not fair, but hedge funds such as Harvard will get bilked, while poorer institutions will be able to negotiate down. In the APC model, everyone will pay roughly the same (there may be some discounts for bulk purchasing, but not the variation you have today and it may even be the large institutions which will negotiate down, while the smaller players will pay full price).

Many publishers have policies to favor publications from very poor countries, charging them lower APCs. Naturally this is a good policy, but it will not be fine-grained. Universities in Bulgaria (EU member-state with a GDP per capita of 7,350 USD) are considered as wealthy as private American research universities. They will be expected to pay the same for their PlOS Biology papers.

NGLess timing benchmarks

As part of finalizing a manuscript on NGLess, we have run some basic timing benchmarks comparing NGLess to MOCAT2 (our previous tool) and another alternative for profiling a community based on a gene catalog, namely htseq-count.

The task being profiled is that performed in the NGLess tutorials for the human gut and the ocean: 3 metagenomes are functionally profiled by using a gene catalog as a reference. The time reported is for completing all 3 samples (repeated 3 times to get some variability measure).

The results are that NGLess is overall much faster than the alternatives (note that the Y-axis measures the number of seconds in log-scale).  For the gut dataset, MOCAT takes 2.5x, while for the ocean (tara) one, it takes 4x longer.


The Full  column contains the result of running the whole pipeline, where it is clear that NGLess is much faster than MOCAT2. The other elements are in MOCAT nomenclature:

  • ReadTrimFilter: preprocessing the FastQ files
  • Screen: mapping to the catalog
  • Filter: postprocessing the BAM files
  • Profile: generating feature counts from the BAM files

Htseq-count works well even for this settings which is outside of its original domain (it was designed for RNA-seq, where you have thousands of genes, as opposed to metagenomics, where millions are common). NGLess is still much faster, though.

Note too that for MOCAT, the time it takes for the Full step is simply the addition of the other steps, but in the case of NGLess, when running a complete pipeline, the interpreter can save time.

The htseq-count benchmark is still running, so final results will only be available next week.

I also tried to profile using featureCounts (website), but that tool crashed after using up 800GB of RAM. I might still try it on the larger machines (2TiB of RAM), but it seems pointless.

The scripts and preprocessed data for this benchmark are at https://github.com/BigDataBiology/ngless2018benchmark

A day has ~10⁵ seconds

This is trivial, but I make use of this all the time when doing some back-of-the-envelope calculation on whether some computation is going to take a few days or a few years:

A day has about 10⁵ seconds (24*60*60 = 86,400 ≈ 100,000).

So, if an operation takes 1 ms of CPU time, then you can do 10⁸ operations per day on one CPU and ~10¹¹ on 1,000 CPUs.

Using Gehan’s Test for Microbiome Longitudinal Studies

One of the nice things about blogging about my work is that I can use the blog to dig deep into a technical aspect that may would otherwise be buried in a little note in the Methods section. This is exactly such a post.

In our recent paper on the dog microbiome (see previous blog summary), we had a simple experimental design: All dogs ate the same food (Base diet) for 4 weeks. At the end of this period, we obtained a first sample from each dog. Then, dogs were randomly assigned to one of two diets (HPLC or LPHC) for 4 weeks, and we obtained a second sample.


Thus, for each dog, we had two samples: one baseline sample and one post-intervention sample. We can use tools like MOCAT2 or NGLess to obtain taxonomic (or functional) profiles of each of these samples (this project was done when we were still using MOCAT2; nowadays, we would use NGLess, but I digress). Half of these samples are from the HPLC diet, the other half from LPHC.

Now, the question is: how do we detect features (taxa) that are differentially affected by diet?

Attempt #1: Just use the endpoint values

The simplest way is to ignore the first (baseline) timepoint and test the effects on the second (post-intervention) sample. For example, we can use the Wilcoxon test and compare relative abundances (there are other methods too, but Mann-Whitney/Wilcoxon is still great).

In pseudo-code, this looks like:

data = data[timepoint == 'post-intervention']
for g in data.columns:
    data = data[g]
    print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])


Attempt #2: Test on the ratios

Still, intuitively, it should be more powerful (in the technical sense of statistical power) to use the baseline samples in addition to the just the end point. Thus, let’s do a simple transformation: for each dog and each feature, we compute the difference in abundance between the initial timepoint and the end point.

normed = data[timepoint == 'post-intervention'] - data[timepoint == 'baseline']
for g in normed.columns:
    g = normed[g]
    print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])

Most often, in biology, we care about ratios rather than differences. We report things as “10-fold more abundant.” So, it is a good idea to convert the data to logs before running the test.

The problem is (as usual): How do you handle the zeros? For many problems, there is a trivial solution, namely adding a pseudo count:

data10 = log10(pseudo_count + data)
normed = data10[timepoint == 'post-intervention'] - \
            data10[timepoint == 'baseline']

However, this does not work well in our dataset! We observed a loss of power (many fewer taxa were considered differentially abundant) and a little bit of investigation and thought revealed the issue:

Consider a subject where the abundance of a feature goes from 10⁻² to 10⁻⁴ when comparing baseline to post-intervention. On this subject, the intervention had an effect of -2 log-units. We can say that this was a more negative effect than a subject where the effect was 10⁻⁴ to 10⁻⁵. What about a subject where the abundance of that feature went from 10⁻⁴ to undetected? Well, if your pseudo-count is 10⁻⁶, then you would be claiming that the effect was -1 log unit, whilst if you used a lower pseudo-count, you would be claiming a larger effect.

Here is a visual representation of the effect (this is real data, but let’s abstract from the details by just saying that it’s Diet 1 and Diet 2 and we are comparing the effect on a particular taxon).gehan.png

You can probably intuit that the effect of Diet 2 is much stronger. However, if you test on the differences between post and baseline, you would find a non-significant effect (after multiple hypothesis correction). Comparing just the end-point, though, gives a strong signal. The problematic profiles are coloured blue, as they involve points below the detection limit on the post-intervention time point, which we set to 10⁻⁷.

Attempt #3 (solution): Use Gehan’s test

Before introducing Gehan’s Test, let’s review how Wilcoxon works:

The Wilcoxon test can be seen in the following way: take two samples A and B and compare all of the elements of A to all of the elements of B. Every time the object in A is greater than the object in B, you count +1 (you count 0.5 for draws). Add all of these values together and you get a score, U. If A and B are drawn from the same population, then U is roughly normally distributed with the mean being roughly half the comparisons.

Now, we can easily generalize this to take into account the fact that our data is only partially observed. Now, whenever we have zero in our data, we take that into account. If we estimate our detection limit to be 10⁻⁵, then when a subject goes from 10⁻⁴ to undetected, we know that the effect was at least -1 log units, but may be even more negative. The result is that we can claim that this effect is more negative than a subject where the abundance went up (or just slightly down). However, when we compare the effect to another subject where the effect was -2 units, then the comparison is indetermined.

Thus, we can adapt Wilcoxon to work in a slightly different way. We still compare all the objects in A to all the objects in B, however, the result can now be one of three options: (1) greater-than, (2) equal-to, (3) smaller-than, (4) indeterminate. If you work through this intuition to develop a proper statistical test, you get the Gehan Test (Gehan 1965a & Gehan 1965b)! Now, the test is empirically much more powerful when you have many samples close to detection limit!


Similarity of the dog and human gut microbiomes in gene content and response to diet

My paper Similarity of the dog and human gut microbiomes in gene content and response to diet was published yesterday in Microbiome. It was a long time in the making (and almost a year in the review process: submitted 11 May 2017), but now it’s finally published! It has been picking up quite a bit of press, which is nice too.

It’s open access, so everyone can read it, but here’s a basic summary:

  1. We built a non-redundant gene catalog for the dog gut microbiome. We then compared it to the equivalent gene catalogs for humans, mice, and pigs (since all these catalogs were built on Illumina data using MOCAT, they were easily comparable). Somewhat surprisingly, we found a high overlap between the dog microbiome genes and that of the human microbiome (higher than for the other non-human animals).
    Fig1f.pngWe can also map a much higher fraction of short reads from the dog gut microbiome to the human gut gene catalog than for the other non-human hosts.


  2. When we used metaSNV to analyse the SNV, we saw strain separation between the human and dog strains (for the same species). Thus, we do not share organisms with our dogs! Only similar species. I have presented this conclusion (in talks and informally) and different people in the field told be both that “of course strains are host specific” and “I was expecting that we’d be getting bacteria from our dogs all the time, this is not what I expected at all.”
  3. Diet shifts the microbiome of the dogs (the dogs were randomly assigned to either a high-protein or a low-protein diet). We had two samples from each dog: one after they ate the baseline diet (which was a low-protein diet) and a second after the random switch to a high-protein diet. We could thus  the microbiome of overweight dogs changes more than that of their healthier counterparts (see the Anna Karenina hypothesis: unhappy microbiomes are less alike).

    (in the Figure, HPLC/OW refers to overweight dogs on the high-protein diet).

  4. We also saw that some taxa dramatically changed their prevalence in response to the diet. In particular, Lactobacillus ruminis was completely absent in the high-protein diet. Not just lower abundant, but undetectable.

Those are the highlights, the full paper has a bit more. I’ll try to post a couple of extra posts with some interesting technical tidbits. For example, how we used the little-known Gehan statistic.

Why does natural evolution use genetic algorithms when it’s not a very good optimization method?

[Epistemic status: waiting for jobs to finish on the cluster, so I am writing down random ideas. Take it as speculation.]

Genetic algorithms are a pretty cool idea: when you have an objective function, you can optimize it by starting with some random guessing and then use mutations and cross-over to improve your guess. At each generation, you keep the most promising examples and eventually you will converge on a good solution. Just like evolution does.

Unfortunately, in practice, this idea does not pan out: genetic algorithms are not that effective. In fact, I am not aware of any general problem area where they are considered the best option. For example, in machine learning, the best methods tend to be variations of stochastic gradient descent.

And, yet, evolution uses genetic algorithms. Why doesn’t evolution use stochastic gradient descent or something better than genetic algorithms?

What would evolutionary gradient descent even look like?

First of all, let’s assure ourselves that we are not just using vague analogies to pretend we have deep thoughts. Evolutionary gradient descent is at least conceptually possible.

To be able to do gradient descent, a bacterium reproducing would need two pieces of information to compare itself to its mother cell: (1) how does it differ in genotype and (2) how much better is it doing than its parent. Here is one possible implementation of this idea: (1) tag the genome where it differs from the the mother cell (epigenetics!) and (2) have some memory of how fast it could grow. When reproducing, if we are performing better than our mother, then introduce more mutations in the regions where we differ from mother.

Why don’t we see something like this in nature? Here are some possible answers

Almost all mutations are deleterious

Almost always (viruses might be an exception), higher mutation rates are bad. Even in a controlled way (just the regions that seem to matter), adding more mutations will make things worse rather than better.

The signal is very noisy

Survival or growth is a very noisy signal of how good a genome is. Maybe we just got luckier than our parents in being born at a time of glucose plenty. If the environment is not stable, over reacting to a single observation may be the wrong thing to do.

The relationship between phenotype and genotype is very tenuous

What we’d really like to do is something like “well, it seems that in this environment, it is a good idea if membranes are more flexible, so I will mutate membrane-stiffness more”. However, the relationship between membrane stiffness and the genotype is complex. There is no simple “mutate membrane-stiffness” option for a bacterium. Epistatic effects are killers for simple ideas like this one.

On the other hand, the relationship between any particular weight in a deep CNN and the output is very weak. Yet, gradient descent still works there.

The cost of maintaining the information for gradient descent is too high

Perhaps, it’s just not worth keeping all this accounting information. Especially because it’s going to be yet another layer of noise.

Maybe there are examples of natural gradient descent, we just haven’t found them yet

There are areas of genomes that are more recombination prone than others (and somatic mutation in the immune system is certainly a mechanism of controlled chaos). Viruses may be another entity where some sort of gradient descent could be found. Maybe plants with their weird genomes are using all those extra copies to transmit information across generations like this.

As I said, this is a post of random speculation while I wait for my jobs to finish….