Using Gehan’s Test for Microbiome Longitudinal Studies

One of the nice things about blogging about my work is that I can use the blog to dig deep into a technical aspect that may would otherwise be buried in a little note in the Methods section. This is exactly such a post.

In our recent paper on the dog microbiome (see previous blog summary), we had a simple experimental design: All dogs ate the same food (Base diet) for 4 weeks. At the end of this period, we obtained a first sample from each dog. Then, dogs were randomly assigned to one of two diets (HPLC or LPHC) for 4 weeks, and we obtained a second sample.

Fig2a.png

Thus, for each dog, we had two samples: one baseline sample and one post-intervention sample. We can use tools like MOCAT2 or NGLess to obtain taxonomic (or functional) profiles of each of these samples (this project was done when we were still using MOCAT2; nowadays, we would use NGLess, but I digress). Half of these samples are from the HPLC diet, the other half from LPHC.

Now, the question is: how do we detect features (taxa) that are differentially affected by diet?

Attempt #1: Just use the endpoint values

The simplest way is to ignore the first (baseline) timepoint and test the effects on the second (post-intervention) sample. For example, we can use the Wilcoxon test and compare relative abundances (there are other methods too, but Mann-Whitney/Wilcoxon is still great).

In pseudo-code, this looks like:

data = data[timepoint == 'post-intervention']
for g in data.columns:
    data = data[g]
    print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])

 

Attempt #2: Test on the ratios

Still, intuitively, it should be more powerful (in the technical sense of statistical power) to use the baseline samples in addition to the just the end point. Thus, let’s do a simple transformation: for each dog and each feature, we compute the difference in abundance between the initial timepoint and the end point.

normed = data[timepoint == 'post-intervention'] - data[timepoint == 'baseline']
for g in normed.columns:
    g = normed[g]
    print(stats.mannwhitneyu(g[diet == 'LPHC'], g[diet == 'HPLC'])

Most often, in biology, we care about ratios rather than differences. We report things as “10-fold more abundant.” So, it is a good idea to convert the data to logs before running the test.

The problem is (as usual): How do you handle the zeros? For many problems, there is a trivial solution, namely adding a pseudo count:

data10 = log10(pseudo_count + data)
normed = data10[timepoint == 'post-intervention'] - \
            data10[timepoint == 'baseline']

However, this does not work well in our dataset! We observed a loss of power (many fewer taxa were considered differentially abundant) and a little bit of investigation and thought revealed the issue:

Consider a subject where the abundance of a feature goes from 10⁻² to 10⁻⁴ when comparing baseline to post-intervention. On this subject, the intervention had an effect of -2 log-units. We can say that this was a more negative effect than a subject where the effect was 10⁻⁴ to 10⁻⁵. What about a subject where the abundance of that feature went from 10⁻⁴ to undetected? Well, if your pseudo-count is 10⁻⁶, then you would be claiming that the effect was -1 log unit, whilst if you used a lower pseudo-count, you would be claiming a larger effect.

Here is a visual representation of the effect (this is real data, but let’s abstract from the details by just saying that it’s Diet 1 and Diet 2 and we are comparing the effect on a particular taxon).gehan.png

You can probably intuit that the effect of Diet 2 is much stronger. However, if you test on the differences between post and baseline, you would find a non-significant effect (after multiple hypothesis correction). Comparing just the end-point, though, gives a strong signal. The problematic profiles are coloured blue, as they involve points below the detection limit on the post-intervention time point, which we set to 10⁻⁷.

Attempt #3 (solution): Use Gehan’s test

Before introducing Gehan’s Test, let’s review how Wilcoxon works:

The Wilcoxon test can be seen in the following way: take two samples A and B and compare all of the elements of A to all of the elements of B. Every time the object in A is greater than the object in B, you count +1 (you count 0.5 for draws). Add all of these values together and you get a score, U. If A and B are drawn from the same population, then U is roughly normally distributed with the mean being roughly half the comparisons.

Now, we can easily generalize this to take into account the fact that our data is only partially observed. Now, whenever we have zero in our data, we take that into account. If we estimate our detection limit to be 10⁻⁵, then when a subject goes from 10⁻⁴ to undetected, we know that the effect was at least -1 log units, but may be even more negative. The result is that we can claim that this effect is more negative than a subject where the abundance went up (or just slightly down). However, when we compare the effect to another subject where the effect was -2 units, then the comparison is indetermined.

Thus, we can adapt Wilcoxon to work in a slightly different way. We still compare all the objects in A to all the objects in B, however, the result can now be one of three options: (1) greater-than, (2) equal-to, (3) smaller-than, (4) indeterminate. If you work through this intuition to develop a proper statistical test, you get the Gehan Test (Gehan 1965a & Gehan 1965b)! Now, the test is empirically much more powerful when you have many samples close to detection limit!

 

Advertisements

What surprised me in 2016

2016 made me reassess an important component of my view of the world. No, not Brexit or Trump becoming President (although, it’s not unrelated).

At the end of 2016, I realized that almost all psychology is pseudo-science. Not hyperbole, not oversold, but pseudo-science.

People used to joke that Parapsychology is the control group for science: i.e., a bunch of people ostentatiously following the scientific method in a situation where every result should come out negative. It’s a null field: the null hypothesis (that there is no effect) is true. Thus, the fact that you can still get positive effects should be worrisome. Turns out the true joke was that psychology is the true control group. Parapsychology was a bad control as most scientists were already predisposed to disbelieve them. Psychology is a much better control.

I had heard of the “Replication Crisis” before, but had not delved into the details. I thought psychology was like microbiome studies: over-hyped but, fundamentally, correct. We may see reports that the microbiome makes you be rude to your uber driver or whatever silly effect. We often read about the effects of the microbiome on obesity, as if it didn’t matter that our diets are not as healthy as they should be and it was all down to microbes. Jonathan Eisen collects these as overselling the microbiome.

Still, to say that people oversell the microbiome is not to say that there is no effect. The microbes do not single-handedly cause obesity, but they have an impact on the margin (a few BMI points up or down), which is enough to be significant for the population. They may not cause nor cure cancer, but they seem to influence the effect of immunotherapy enough that we may need to adjust dosages/drug combinations. And so on…

I thought that when it came to psychology, the same was true: sure, a lot of hype, but I thought there was a there there. There isn’t.

My basic mistake was that I had shared Daniel Kahneman’s view of the situation:

My position […] was that if a large body of evidence published in reputable journals supports an initially implausible conclusion, then scientific norms require us to believe that conclusion. Implausibility is not sufficient to justify disbelief, and belief in well-supported scientific conclusions is not optional. This position still seems reasonable to me – it is why I think people should believe in climate change.

This was exactly my position until I read this long Andrew Gelman post. Since then, I started to read up on this and find that psychology (as a field) has sold us a bill of goods.

No, computers are not setting us up for disaster

Yesterday, the Guardian published a long essay by Tim Harford on the dangers of automation. The argument is not new (I first heard it on the econtalk episode with David Mindell), and the characteristic example is that of the Air France flight that crashed in the middle of the ocean after the autopilot handed control back to the human pilots who immediately proceeded to crash the plane. As I read it the argument runs as follows: (a) full automation is impossible, (b) partial automation erodes skills, therefore (c) we should be wary of over-automating.

On twitter, I responded with the snark that that medium encourages:

But I feel I should make a longer counter-argument.

1. Despite being a good visual (a plane crash is dramatic), the example of an airplane crash in 2009 is a terrible one. Commercial civil aviation is incredibly safe. Commercial aviation is so safe, I wouldn’t be surprised to read a contrarian Marginal Revolution post arguing it’s now too safe and we need more dangerous planes. I would be very careful in arguing that somehow whatever the aviation community does, is not working based on a single incident that happened 7 years ago. If this was happening every 7 weeks, then it would be a very worrying problem, but it doesn’t.

2. Everything I’ve heard and read about that Air France accident seems to agree that the pilots were deeply incompetent. I have also gotten the distinct impression that if the system had not handed back control to the humans, they would not have crashed the plane. It is simply asserted that we cannot have completely autonomous planes, but without evidence. Perhaps at the very least, it should be harder for the humans to override the automated control. Fully automated planes would also not be hijackable in a 9/11 way nor by their own pilots committing suicide (which given how safe planes are, may now be a significant fraction of airplane deaths!).

3. Even granting the premise of the article, that (a) full automation is impossible and (b) partial automation can lead to skill erosion, the conclusion that “the database and the algorithm, like the autopilot, should be there to support human decision-making” is a non sequitor. It assumes that the human is always a better decision maker, which is completely unproven. In fact, I rather feel that the conclusion is the opposite: the pilot should be there (if a pilot is needed, but let’s grant that) to support the autopilot. Now, we should ask: what’s the best way for pilots to support automated systems? If it is to intervene in times of rare crisis, then pilots should perhaps train like other professionals who are there for crises: a lot of simulations and war games for the cases that we hope never happen. Perhaps, we’ll get to a world where success is measured by having pilots spend their whole careers without ever flying a plane, much like a Secret Service agent trains for the worst, but hopes to never have to throw themselves in front of a bullet.

4. Throughout the essay, it is taken as a given that humans are better and computers are there to save on effort. There is another example, that of meteorologists who now trust the computer instead of being able to intuit when the computer has screwed up, which is what used to happen, but I don’t see an argument that their intuition is better than the computer. If you tell me that the veteran meteorologists can beat the modern systems, I’ll buy that, but I would also think that maybe it’s because the veteran meteorologists were working when the automated systems weren’t as good as the modern ones.

5. The essay as a whole needs to be more quantitative. Even if computers do cause different types of accident, we need to have at least an estimate of whether the number of deaths is larger or smaller than using other systems (humans). I understand that authors do not always choose their titles, but I wouldn’t have responded if title of the essay had been “It won’t be perfect: how automated systems will still have accidents”.

6. The skill erosion effect is interesting per se and there is some value in discussing it and being aware of it. However, I see no evidence that it completely erases the gains from automation (rather than being a small “tax” or clawback on the benefits of automation) and that the solution involves less automation rather than either more automation or a different kind of human training.

7. My horse riding skills are awful.

At the Olympics, the US is underwhelming, Russia still overperforms, and what’s wrong with Southern Europe (except Italy)?

Russia is doing very well. The US and China, for all their dominance of the raw medal tables are actually doing just as well as you’d expect.

Portugal, Spain, and Greece should all be upset at themselves, while the fourth little piggy, Italy, is doing quite alright.

What determines medal counts?

I decided to play a data game with Olympic Gold medals and ask not just “Which countries get the most medals?” but a couple of more interesting questions.

My first guess of what determines medal counts was total GDP. After all, large countries should get more medals, but economic development should also matter. Populous African countries do not get that many medals after all and small rich EU states still do.

Indeed, GDP (at market value), does correlate quite well with the weighted medal count (an artificial index where gold counts 5 points, silver 3, and bronze just 1)

Much of the fit is driven by the two left-most outliers: US and China, but the fit explains 64% of the variance, while population explains none.

Adding a few more predictors, we can try to improve, but we don’t actually do that much better. I expect that as the Games progress, we’ll see the model fits become tighter as the sample size (number of medals) increases. In fact, the model is already performing better today than it was yesterday.

Who is over/under performing?

The US and China are right on the fit above. While they have more medals than anybody else, it’s not surprising. Big and rich countries get more medals.

The more interesting question is: which are the countries that are getting more medals than their GDP would account for?

Top 10 over performers

These are the 10 countries which have a bigger ratio of actual total medals to their predicted number of medals:

                delta  got  predicted     ratio
Russia       6.952551   10   3.047449  3.281433
Italy        5.407997    9   3.592003  2.505566
Australia    3.849574    7   3.150426  2.221921
Thailand     1.762069    4   2.237931  1.787366
Japan        4.071770   10   5.928230  1.686844
South Korea  1.750025    5   3.249975  1.538473
Hungary      1.021350    3   1.978650  1.516185
Kazakhstan   0.953454    3   2.046546  1.465884
Canada       0.538501    4   3.461499  1.155569
Uzbekistan   0.043668    2   1.956332  1.022322

Now, neither the US nor China are anywhere to be seen. Russia’s performance validates their state-funded sports program: the model predicts they’d get around 3 medals, they’ve gotten 10.

Italy is similarly doing very well, which surprised me a bit. As you’ll see, all the other little piggies perform poorly.

Australia is less surprising: they’re a small country which is very much into sports.

After that, no country seems to get more than twice as many medals as their GDP would predict, although I’ll note how Japan/Thailand/South Kore form a little Eastern Asia cluster of overperformance.

Top 10 under performers

This brings up the reverse question: who is underperforming? Southern Europe, it seems: Spain, Portugal, and Greece are all there with 1 medal against predictions of 9, 6, and 6.

France is country which is missing the most medals (12 predicted vs 3 obtained)! Sometimes France does behave like a Southern European country after all.

                delta  got  predicted     ratio
Spain       -8.268615    1   9.268615  0.107891
Poland      -6.157081    1   7.157081  0.139722
Portugal    -5.353673    1   6.353673  0.157389
Greece      -5.342835    1   6.342835  0.157658
Georgia     -4.814463    1   5.814463  0.171985
France      -9.816560    3  12.816560  0.234072
Uzbekistan  -3.933072    2   5.933072  0.337093
Denmark     -3.566784    3   6.566784  0.456845
Philippines -3.557424    3   6.557424  0.457497
Azerbaijan  -2.857668    3   5.857668  0.512149
The Caucasus (Georgia, Uzbekistan, Azerbaijan) may show up as their wealth is mostly due to natural resources and not development per se (oil and natural gas do not win medals, while human capital development does).
§
I expect that these lists will change as the Games go on as maybe Spain is just not as good at the events that come early in the schedule. Expect an updated post in a week.
Technical details

The whole analysis was done as a Jupyter notebook, available on github. You can use mybinder to explore the data. There, you will even find several little widgets to play around.

Data for medal counts comes from the medalbot.com API, while GDP/population data comes from the World Bank through the wbdata package.

I cannot find a decent offline email client

I spend a lot of time on trains and airplanes without internet connection and keep struggling with the fact that it has become significantly harder to use email without an internet connection. I would like to be able to write a few emails, which would be sent out when I reached civilization and had enough wireless coverage to tether the computer to my phone. Ten years ago, this would have been easy; today, not so much.

(Of course, if there was decent wireless data coverage everywhere, this would not be an issue [1]; except perhaps for data charges. However, as the world stands today, I can get neither a good data connection nor decent software to handle this.)

§

The first email clients were on networked machines are were online only. When dialup came along, they became desktop applications: your mail would be stored on your machine and minimizing the amount of time spent online was often a goal: you would write your messages, save in the Outbox, and then go online to get new messages and send your stored messages. As things moved to the cloud (especially with the appearance of Hotmail and the concept of webmail), desktop email started losing users (it probably never had many customers) and is slowly dying.

I would happily pay for a decent desktop offline-capable email client, but cannot find one. I tried Postbox, but it’s just a reheated version of thunderbird. Thunderbird itself is no longer being developed (as Mozilla is focusing on other priorities [2]). I used to use Kmail, which was not great, but worked in KDE3; completely stopped working with the advent of KDE4. Opera Mail is also no longer begin developed.

I could try to use mutt or another command line client, but the command line has not adapted to rich displays [3] and HTML is a nuisance in those [4]. I also enjoy drag&dropping attachments.

§

It is interesting to see how even with software, functionality can be lost if not continuously maintained.

[1] Perhaps part of the issue is that the US does have better wireless data coverage than Germany does. Since most technology is developed for that market…
[2] I also suspect the technology hit a local maximum and could not easily be improved and the many bugs encountered are very hard to fix.
[3] There is no technical reason to not have colour and HTML on the command line, but it’s hard to move these very old technology and the track record is that they’ll be with us for a few more decades.
[4] I still remember when it was considered rude by some to send HTML email. I too had a phase like that.

Self-driving cars kill, but they can improve; humans just kill

A Tesla self-driving car killed its “driver”:

According to Tesla’s account of the crash, the car’s sensor system, against a bright spring sky, failed to distinguish a large white 18-wheel truck and trailer crossing the highway.

Pretty awful failure of compute vision.

§

I have had more than one person tell me that while they agree with me that self-driving cars can be safer, they will never be allowed by the regulators because the politics are not working in their favor. While I understand some of the concerns, I don’t understand how you can believe such a thing and not turn into a full-blown libertarian.

If you believe that government regulations cause the unnecessary death of circa 92 people a day in the US (many more globally) how can you not become a libertarian raging against government intervention?

I actually have a better opinion of both regulators (who can delay life saving products, sure, but will probably not stop such a technology all together) and lobbyists (lobbyist can and do influence regulations).

§

There is a point to this post though: I’m pretty confident that pretty soon (in a matter of days, perhaps), Tesla will roll out an update that will make sure that white truck trailers against a bright sun are properly detected. So, this particular type of accident will be less likely to ever happen again. Self-driving car accidents expose flaws in the system which can be fixed.

Oh, but there are many other types of accidents, I hear you say. Of course, thousands of them, but if every time one of them happens, that particular type of accident gets fixed, the death rate will keep going down.

Compare with how “we” improve human drivers: educational campaigns and fines for every bad behaviour we uncover (and how many people still text while driving). Sure, it has an effect, but it’s not half as efficient as a patch.

Self-driving cars will have accidents, but unlike human drivers, they can improve rapidly. Until eventually, driving will be as safe as flying.