Using Gehan’s Test for Microbiome Longitudinal Studies

One of the nice things about blogging about my work is that I can use the blog to dig deep into a technical aspect that may would otherwise be buried in a little note in the Methods section. This is exactly such a post.

In our recent paper on the dog microbiome (see previous blog summary), we had a simple experimental design: All dogs ate the same food (Base diet) for 4 weeks. At the end of this period, we obtained a first sample from each dog. Then, dogs were randomly assigned to one of two diets (HPLC or LPHC) for 4 weeks, and we obtained a second sample.

Fig2a.png

Thus, for each dog, we had two samples: one baseline sample and one post-intervention sample. We can use tools like MOCAT2 or NGLess to obtain taxonomic (or functional) profiles of each of these samples (this project was done when we were still using MOCAT2; nowadays, we would use NGLess, but I digress). Half of these samples are from the HPLC diet, the other half from LPHC.

Now, the question is: how do we detect features (taxa) that are differentially affected by diet?

Attempt #1: Just use the endpoint values

The simplest way is to ignore the first (baseline) timepoint and test the effects on the second (post-intervention) sample. For example, we can use the Wilcoxon test and compare relative abundances (there are other methods too, but Mann-Whitney/Wilcoxon is still great).

In pseudo-code, this looks like:

data = data[timepoint == 'post-intervention']
for g in data.columns:
    data = data[g]
    print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])

 

Attempt #2: Test on the ratios

Still, intuitively, it should be more powerful (in the technical sense of statistical power) to use the baseline samples in addition to the just the end point. Thus, let’s do a simple transformation: for each dog and each feature, we compute the difference in abundance between the initial timepoint and the end point.

normed = data[timepoint == 'post-intervention'] - data[timepoint == 'baseline']
for g in normed.columns:
    g = normed[g]
    print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])

Most often, in biology, we care about ratios rather than differences. We report things as “10-fold more abundant.” So, it is a good idea to convert the data to logs before running the test.

The problem is (as usual): How do you handle the zeros? For many problems, there is a trivial solution, namely adding a pseudo count:

data10 = log10(pseudo_count + data)
normed = data10[timepoint == 'post-intervention'] - \
            data10[timepoint == 'baseline']

However, this does not work well in our dataset! We observed a loss of power (many fewer taxa were considered differentially abundant) and a little bit of investigation and thought revealed the issue:

Consider a subject where the abundance of a feature goes from 10⁻² to 10⁻⁴ when comparing baseline to post-intervention. On this subject, the intervention had an effect of -2 log-units. We can say that this was a more negative effect than a subject where the effect was 10⁻⁴ to 10⁻⁵. What about a subject where the abundance of that feature went from 10⁻⁴ to undetected? Well, if your pseudo-count is 10⁻⁶, then you would be claiming that the effect was -1 log unit, whilst if you used a lower pseudo-count, you would be claiming a larger effect.

Here is a visual representation of the effect (this is real data, but let’s abstract from the details by just saying that it’s Diet 1 and Diet 2 and we are comparing the effect on a particular taxon).gehan.png

You can probably intuit that the effect of Diet 2 is much stronger. However, if you test on the differences between post and baseline, you would find a non-significant effect (after multiple hypothesis correction). Comparing just the end-point, though, gives a strong signal. The problematic profiles are coloured blue, as they involve points below the detection limit on the post-intervention time point, which we set to 10⁻⁷.

Attempt #3 (solution): Use Gehan’s test

Before introducing Gehan’s Test, let’s review how Wilcoxon works:

The Wilcoxon test can be seen in the following way: take two samples A and B and compare all of the elements of A to all of the elements of B. Every time the object in A is greater than the object in B, you count +1 (you count 0.5 for draws). Add all of these values together and you get a score, U. If A and B are drawn from the same population, then U is roughly normally distributed with the mean being roughly half the comparisons.

Now, we can easily generalize this to take into account the fact that our data is only partially observed. Now, whenever we have zero in our data, we take that into account. If we estimate our detection limit to be 10⁻⁵, then when a subject goes from 10⁻⁴ to undetected, we know that the effect was at least -1 log units, but may be even more negative. The result is that we can claim that this effect is more negative than a subject where the abundance went up (or just slightly down). However, when we compare the effect to another subject where the effect was -2 units, then the comparison is indetermined.

Thus, we can adapt Wilcoxon to work in a slightly different way. We still compare all the objects in A to all the objects in B, however, the result can now be one of three options: (1) greater-than, (2) equal-to, (3) smaller-than, (4) indeterminate. If you work through this intuition to develop a proper statistical test, you get the Gehan Test (Gehan 1965a & Gehan 1965b)! Now, the test is empirically much more powerful when you have many samples close to detection limit!

 

Advertisements

Similarity of the dog and human gut microbiomes in gene content and response to diet

My paper Similarity of the dog and human gut microbiomes in gene content and response to diet was published yesterday in Microbiome. It was a long time in the making (and almost a year in the review process: submitted 11 May 2017), but now it’s finally published! It has been picking up quite a bit of press, which is nice too.

It’s open access, so everyone can read it, but here’s a basic summary:

  1. We built a non-redundant gene catalog for the dog gut microbiome. We then compared it to the equivalent gene catalogs for humans, mice, and pigs (since all these catalogs were built on Illumina data using MOCAT, they were easily comparable). Somewhat surprisingly, we found a high overlap between the dog microbiome genes and that of the human microbiome (higher than for the other non-human animals).
    Fig1f.pngWe can also map a much higher fraction of short reads from the dog gut microbiome to the human gut gene catalog than for the other non-human hosts.

    Fig1e.png

  2. When we used metaSNV to analyse the SNV, we saw strain separation between the human and dog strains (for the same species). Thus, we do not share organisms with our dogs! Only similar species. I have presented this conclusion (in talks and informally) and different people in the field told be both that “of course strains are host specific” and “I was expecting that we’d be getting bacteria from our dogs all the time, this is not what I expected at all.”
  3. Diet shifts the microbiome of the dogs (the dogs were randomly assigned to either a high-protein or a low-protein diet). We had two samples from each dog: one after they ate the baseline diet (which was a low-protein diet) and a second after the random switch to a high-protein diet. We could thus  the microbiome of overweight dogs changes more than that of their healthier counterparts (see the Anna Karenina hypothesis: unhappy microbiomes are less alike).

    (in the Figure, HPLC/OW refers to overweight dogs on the high-protein diet).

  4. We also saw that some taxa dramatically changed their prevalence in response to the diet. In particular, Lactobacillus ruminis was completely absent in the high-protein diet. Not just lower abundant, but undetectable.

Those are the highlights, the full paper has a bit more. I’ll try to post a couple of extra posts with some interesting technical tidbits. For example, how we used the little-known Gehan statistic.

Why does natural evolution use genetic algorithms when it’s not a very good optimization method?

[Epistemic status: waiting for jobs to finish on the cluster, so I am writing down random ideas. Take it as speculation.]

Genetic algorithms are a pretty cool idea: when you have an objective function, you can optimize it by starting with some random guessing and then use mutations and cross-over to improve your guess. At each generation, you keep the most promising examples and eventually you will converge on a good solution. Just like evolution does.

Unfortunately, in practice, this idea does not pan out: genetic algorithms are not that effective. In fact, I am not aware of any general problem area where they are considered the best option. For example, in machine learning, the best methods tend to be variations of stochastic gradient descent.

And, yet, evolution uses genetic algorithms. Why doesn’t evolution use stochastic gradient descent or something better than genetic algorithms?

What would evolutionary gradient descent even look like?

First of all, let’s assure ourselves that we are not just using vague analogies to pretend we have deep thoughts. Evolutionary gradient descent is at least conceptually possible.

To be able to do gradient descent, a bacterium reproducing would need two pieces of information to compare itself to its mother cell: (1) how does it differ in genotype and (2) how much better is it doing than its parent. Here is one possible implementation of this idea: (1) tag the genome where it differs from the the mother cell (epigenetics!) and (2) have some memory of how fast it could grow. When reproducing, if we are performing better than our mother, then introduce more mutations in the regions where we differ from mother.

Why don’t we see something like this in nature? Here are some possible answers

Almost all mutations are deleterious

Almost always (viruses might be an exception), higher mutation rates are bad. Even in a controlled way (just the regions that seem to matter), adding more mutations will make things worse rather than better.

The signal is very noisy

Survival or growth is a very noisy signal of how good a genome is. Maybe we just got luckier than our parents in being born at a time of glucose plenty. If the environment is not stable, over reacting to a single observation may be the wrong thing to do.

The relationship between phenotype and genotype is very tenuous

What we’d really like to do is something like “well, it seems that in this environment, it is a good idea if membranes are more flexible, so I will mutate membrane-stiffness more”. However, the relationship between membrane stiffness and the genotype is complex. There is no simple “mutate membrane-stiffness” option for a bacterium. Epistatic effects are killers for simple ideas like this one.

On the other hand, the relationship between any particular weight in a deep CNN and the output is very weak. Yet, gradient descent still works there.

The cost of maintaining the information for gradient descent is too high

Perhaps, it’s just not worth keeping all this accounting information. Especially because it’s going to be yet another layer of noise.

Maybe there are examples of natural gradient descent, we just haven’t found them yet

There are areas of genomes that are more recombination prone than others (and somatic mutation in the immune system is certainly a mechanism of controlled chaos). Viruses may be another entity where some sort of gradient descent could be found. Maybe plants with their weird genomes are using all those extra copies to transmit information across generations like this.

As I said, this is a post of random speculation while I wait for my jobs to finish….

The Scientific Paper of the Future is Probably a PDF

I do not mean to say the scientific paper of the future should be a PDF, I just mean that it will mostly likely be a PDF or some PDF-derived format. By future, I mean around 2040 (so, in 20-25 years).

I just read James Somers in the Atlantic, arguing that The Scientific Paper Is Obsolete (Here’s what’s next). In that article, he touts Mathematica notebooks as a model of what should be done and Jupyter as the current embodiment of this concept.

I will note that Mathematica came out in 1988 (a good 5 years before the PDF format) and has yet failed to take the world by storm (the article claims that “the program soon became as ubiquitous as Microsoft Word”, a claim which is really hard to reconcile with reality). Perhaps Mathematica was held back because it’s expensive and closed source (but so is Microsoft Word, and Word has taken the world by storm).

How long did it take to get to HTML papers?

For a very long time, the future of the scientific paper was going to be some smart version of HTML. We did eventually get to the point where most journals have decent HTML versions of their papers, but it’s mostly dumb HTML.

As far as I can tell, none of the ideas of having a semantically annotated paper panned out. About 10 years ago, the semantic web was going to revolutionize science. That didn’t happen and it’s even been a while since I heard someone arguing that that would be the future of the scientific paper.

Tools like Read Cube or Paperpile still parse the PDFs and try to infer what’s going on instead of relying on fancy semantic annotations.

What about future proofing the system?

About a week ago, I tweeted:

This is about a paper which is now in press. It’s embargoed, but I’ll post about it when it comes out in 2 weeks.

I have complained before about the lack of backwards compatibility in the Python ecosystem. I can open and print a PDF from 20 years ago (or a PostScript file from the early 1980s) without any issues, but I have trouble running a notebook from last year.

At this point, someone will say docker! and, yes, I can build a docker image (or virtual machine) with all my dependencies and freeze that, but who can commit to hosting/running these over a long period? What about the fact that even tech-savvy people struggle to keep all these things properly organized? I can barely get co-authors to move beyond the “let’s email Word files back and forth.”

With less technical co-authors, can you really imagine them downloading a docker container and properly mounting all the filesystems with OverlayFS to send me back edits? Sure, there are a bunch of cool startups with nicer interfaces, but will they be here in 2 years (let alone 20)?

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

Similarly, all the file paths and the 5 arguments you need to pass to pandas.read_table to read the data correctly: why should we care when we are just trying to get the gist of the results? One of our goals in NGLess is to try to separate some of this accidental complexity from the main processing pipeline, but this also limits what we can do with it (this is the tradeoff, it’s a domain specific tool; it’s hard to achieve the same with a general purpose tool like Jupyter/Python).

I do really like Jupyter for tutorials as the mix of code and text are a good fit. I will work to make sure I have something good for the students, but I don’t particularly enjoy working with the notebook interface, so I need to be convinced before I jump on the bandwagon more generally.

§

I actually do think that the future will eventually be some sort of smarter thing than simulated paper, but I also think that (1) it will take much longer than 20 years and (2) it probably won’t be Jupyter getting us there. It’s a neat tool for many things, but it’s not a PDF killer.

Bug-for-bug backwards compatibility in NGLess

Recently, I found a bug in NGLess. In some rare conditions, it would mess up and reads could be lost. Obviously, I fixed it.

If you’ve used NGLess before (or read about it), you’ll know that every ngless script starts with a version declaration:

ngless "x.y"

This indicates which version of NGLess should be running the code. Since the bug changed the results, I needed to make a new version (we are now at version 0.8).

The question is what should NGLess do when it runs a script that uses an older version declaration? I see three options:

1. Silently update everyone to the new behavior

This is the typical software behavior: the new system is better, why wouldn’t you want to upgrade? Because we’d be breaking our promise to make ngless reproducible. The whole point of having the version line is to ensure that you will always get the same results. We also don’t want to make people afraid of upgrading.

2. Refuse to run older scripts and force everyone to upgrade

This is another option: we could just refuse to run old code. Now, at the very least, there would be no silent changes. It’s still possible to install older versions (and bioconda/biocontainers makes this easy), so if you really needed to, you could still run the older scripts.

3. Emulate the old (buggy) behavior when the user requests the old versions

In the end, I went with this option.

The old behavior is not that awful. Some reads are handled completely wrong, but the reason why the bug was able to persist for so long is that it only shows up in a few reads in a million. Thus, while this means that NGLess will sometimes knowingly output results that are suboptimal, I found it the best solution. A warning is printed, asking the user to upgrade.

Python’s Weak Performance Matters

Here is an argument I used to make, but now disagree with:

Just to add another perspective, I find many “performance” problems in
the real world can often be attributed to factors other than the raw
speed of the CPython interpreter. Yes, I’d love it if the interpreter
were faster, but in my experience a lot of other things dominate. At
least they do provide low hanging fruit to attack first.

[…]

But there’s something else that’s very important to consider, which
rarely comes up in these discussions, and that’s the developer’s
productivity and programming experience.[…]

This is often undervalued, but shouldn’t be! Moore’s Law doesn’t apply
to humans, and you can’t effectively or cost efficiently scale up by
throwing more bodies at a project. Python is one of the best languages
(and ecosystems!) that make the development experience fun, high
quality, and very efficient.

(from Barry Warsaw)

I used to make this argument. Some of it is just a form of utilitarian programming: having a program that runs 1 minute faster but takes 50 extra hours to write is not worth it unless you run it >3000 times. For code that is written as part of data analysis, this is rarely the case. Now I think it is not as strong of an argument as I previously thought. Now I believe that the fact that CPython (the only widely used Python interpreter) is slow is a major disadvantage of the language and not just a small tradeoff for faster development time.

What changed in my reasoning?

First of all, I’m working on other problems. Whereas I used to do a lot of work that was very easy to map to numpy operations (which are fast as they use compiled code), now I write a lot of code which is not straight numerics. And, then, if I have to write it in standard Python, it is slow as molasses. I don’t mean slower in the sense of “wait a couple of seconds”, I mean “wait several hours instead of 2 minutes.”

At the same time, data keeps getting bigger and computers come with more and more cores (which Python cannot easily take advantage of), while single-core performance is only slowly getting better. Thus, Python is a worse and worse solution, performance-wise.

Other languages have also demonstrated that it is possible to get good performance with high-level code (using JIT or very aggressive compile-time optimizations). Looking from afar, the core Python development group seems uninterested in these ideas. They regularly pop-up in side projects: psyco, unladen swallow, stackless, shedskin, and pypy; the last one being the only one that is in active development; however, for all the buzz they generate they never make into CPython, which is still using the same basic bytecode stack-machine strategy that it used 20 years ago. Yes, optimizing a very dynamic language it’s not a trivial problem, but Javascript is at least as dynamic as Python is and it has several JIT-based implementations.

It is true that programmer time is more valuable than computer time, but waiting for results to finish computing is also a waste of my time (I suppose I could do something else in the meanwhile, but context switches are such a killer of my performance that I often just wait).

I have also sometimes found that, in order to make something fast in Python, I end up with complex code, almost unreadable, code. See this function for an example. The first time we wrote it, it was a loop based function, directly translating the formula it is computing. It took hours on a medium sized problem (it would take weeks on the real-life problems we want to tackle!). Now, it’s down to a few seconds, but unless you are much smarter than me, it’s not trivial to read out the underlying formula from the code.

The result is that I find myself doing more and more things in Haskell, which lets me write high-level code with decent performance (still slower than what I get if I go all the way down to C++, but with very good libraries). In fact, part of the reason that NGLess is written in Haskell and not Python is performance. I still use Jug (Python-based) to glue it all together, but it is calling Haskell code to do all the actual work.

I now sometimes prototype in Python, then do a kind of race: I start running the analysis on the main dataset, while at the same time reimplementing the whole thing in Haskell. Then, I start the Haskell version and try to make it finish before the Python-analysis completes. Many times, the Haskell version wins (even counting development time!).

Update: Here is a “fun” Python performance bug that I ran into the other day: deleting a set of 1 billion strings takes >12 hours. Obviously, this particular instance can be fixed, but this exactly the sort of thing that I would never have done a few years ago. A billion strings seemed like a lot back then, but now we regularly discuss multiple Terabytes of input data as “not a big deal”. This may not apply for your settings, but it does for mine.

Update 2: Based on a comment I made on hackernews, this is how I summarize my views:

The main motivation is to minimize total time, which is is TimeToWriteCode + TimeToRunCode.

Python has the lowest TimeToWriteCode, but very high TimeToRunCodeTimeToWriteCode is fixed as it is a human factor (after the initial learning curve, I am not getting that much smarter). However, as datasets grow and single-core performance does not get better TimeToRunCode keeps increasing, so that it is more and more worth it to spend more time writing code to decrease TimeToRunCode. C++ would give me the lowest TimeToRunCode, but at too high a cost in TimeToWriteCode (not so much the language, as the lack of decent libraries and package management). Haskell is (for me) a good tradeoff.

This is applicable to my work, where we do use large datasets as inputs. YMMV.

Bray-Curtis dissimilarity on relative abundance data is the Manhattan distance (aka L₁ distance)

Warning: super-technical post ahead, but I have made this point in oral discussions at least a few times, so I thought I would write it up. It is a trivial algebraic manipulation, but because “ℓ₁ norm” sounds too mathy for ecologists while “Bray-curtis” sounds too ecological and ad-hoc for mathematically minded people, it’s good to see that it’s the same thing on normalized data.

Assuming you have two feature vectors, Xᵢ, Xⱼ, if they have been normalized to sum to 1, then the Bray-Curtis dissimilarity is just their ℓ₁ distance, aka Manhattan distance (times ½, which is a natural normalization so that the result is between zero and one).

This is the Wikipedia definition of the Bray-Curtis dissimiliarity (there are a few other, equivalent, definitions around, but we’ll use this one):

BC = 1 – 2 Cᵢⱼ/(Sᵢ + Sⱼ), where Cᵢⱼ =Σₖmin(Xᵢₖ, Xⱼₖ)  and Sᵢ = ΣₖXᵢₖ.

While the Manhattan distance is D₁ = Σₖ|Xᵢₖ – Xⱼₖ|

We are assuming that they sum to 1, so Sᵢ=Sⱼ=1. Thus,

BC = 1 – Σₖmin(Xᵢₖ, Xⱼₖ)

Now, this still does not look like the Manhattan distance (D₁, above). But for any a and b ≥0, it holds that

min(a,b) = (a + b)/2 – |a – b|/2

(this is easiest to see graphically:  start at the midpoint, (a+b)/2 and subtract half of the difference, |a-b|/2).

Thus, BC = 1 -Σₖmin(Xᵢₖ, Xⱼₖ) = 1 – Σₖ{(Xᵢₖ + Xⱼₖ)/2 – |Xᵢₖ + Xⱼₖ|/2}

Because, we assumed that the vectors were normalized,  Σₖ(Xᵢₖ + Xⱼₖ)/2 =(ΣₖXᵢₖ +ΣₖXⱼₖ)/2 = 1, so

BC = 1 – 1 + Σₖ|Xᵢₖ + Xⱼₖ|/2 = D₁/2.