One of the nice things about blogging about my work is that I can use the blog to dig deep into a technical aspect that may would otherwise be buried in a little note in the Methods section. This is exactly such a post.

In our recent paper on the dog microbiome (see previous blog summary), we had a simple experimental design: All dogs ate the same food (Base diet) for 4 weeks. At the end of this period, we obtained a first sample from each dog. Then, dogs were randomly assigned to one of two diets (HPLC or LPHC) for 4 weeks, and we obtained a second sample.

Thus, *for each dog, we had two samples: one *baseline* sample and one *post-intervention *sample*. We can use tools like MOCAT2 or NGLess to obtain taxonomic (or functional) profiles of each of these samples (this project was done when we were still using MOCAT2; nowadays, we would use NGLess, but I digress). Half of these samples are from the HPLC diet, the other half from LPHC.

Now, the question is: how do we detect features (taxa) that are differentially affected by diet?

**Attempt #1: Just use the endpoint values**

The simplest way is to ignore the first (baseline) timepoint and test the effects on the second (post-intervention) sample. For example, we can use the Wilcoxon test and compare relative abundances (there are other methods too, but Mann-Whitney/Wilcoxon is still great).

In pseudo-code, this looks like:

data = data[timepoint == 'post-intervention']
for g in data.columns:
data = data[g]
print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])

**Attempt #2: Test on the ratios**

Still, intuitively, it should be more powerful (in the technical sense of statistical power) to use the baseline samples in addition to the just the end point. Thus, let’s do a simple transformation: for each dog and each feature, we compute the difference in abundance between the initial timepoint and the end point.

normed = data[timepoint == 'post-intervention'] - data[timepoint == 'baseline']
for g in normed.columns:
g = normed[g]
print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])

Most often, in biology, we care about *ratios rather than differences*. We report things as “10-fold more abundant.” So, it is a good idea to convert the data to logs before running the test.

The problem is (as usual): *How do you handle the zeros?* For many problems, there is a trivial solution, namely adding a pseudo count:

data10 = log10(pseudo_count + data)
normed = data10[timepoint == 'post-intervention'] - \
data10[timepoint == 'baseline']

However, **this does not work well in our dataset**! We observed a loss of power (many fewer taxa were considered differentially abundant) and a little bit of investigation and thought revealed the issue:

Consider a subject where the abundance of a feature goes from 10⁻² to 10⁻⁴ when comparing baseline to post-intervention. On this subject, the intervention had an effect of -2 log-units. We can say that this was a more negative effect than a subject where the effect was 10⁻⁴ to 10⁻⁵. What about a subject where the abundance of that feature went from 10⁻⁴ to undetected? Well, if your pseudo-count is 10⁻⁶, then you would be claiming that the effect was -1 log unit, whilst if you used a lower pseudo-count, you would be claiming a larger effect.

Here is a visual representation of the effect (this is real data, but let’s abstract from the details by just saying that it’s *Diet 1 *and *Diet 2* and we are comparing the effect on a particular taxon).

You can probably intuit that the effect of Diet 2 is much stronger. However, if you test on the differences between post and baseline, you would find a non-significant effect (after multiple hypothesis correction). Comparing just the end-point, though, gives a strong signal. The problematic profiles are coloured blue, as they involve points below the detection limit on the post-intervention time point, which we set to 10⁻⁷.

**Attempt #3 (solution): Use Gehan’s test**

Before introducing Gehan’s Test, let’s review how Wilcoxon works:

The Wilcoxon test can be seen in the following way: take two samples A and B and compare all of the elements of A to all of the elements of B. Every time the object in A is greater than the object in B, you count +1 (you count 0.5 for draws). Add all of these values together and you get a score, *U*. If A and B are drawn from the same population, then U is roughly normally distributed with the mean being roughly half the comparisons.

Now, we can easily generalize this to take into account the fact that our data is only partially observed. Now, whenever we have zero in our data, we take that into account. If we estimate our detection limit to be 10⁻⁵, then when a subject goes from 10⁻⁴ to undetected, we know that the effect was **at least -1 log units, but may be even more negative**. The result is that we can claim that this effect is more negative than a subject where the abundance went up (or just slightly down). However, when we compare the effect to another subject where the effect was -2 units, then the comparison is indetermined.

Thus, we can adapt Wilcoxon to work in a slightly different way. We still compare all the objects in A to all the objects in B, however, the result can now be one of three options: (1) greater-than, (2) equal-to, (3) smaller-than, (4) indeterminate. If you work through this intuition to develop a proper statistical test, you get the Gehan Test (Gehan 1965a & Gehan 1965b)! Now, the test is empirically much more powerful when you have many samples close to detection limit!