Scipy’s mannwhitneyu function

Without looking it up, can you say what the following code does:

import numpy as np
from scipy import stats
a = np.arange(25)
b = np.arange(25)+4
print(stats.mannwhitneyu(a , b))

You probably guessed that it computes the Mann-Whitney test between two samples, but exactly which test? The two-sided or the one-sided test?

You can’t tell from the code because it depends on which version of scipy you are running and it has gone back and forth between the two! Pre-0.17.0 it used the one-sided test with the side being decided based on the input data. This was obviously the wrong thing to do. Then, the API was fixed in 0.17.0 to do the two-sided test. This was considered a bad thing because it broke backwards compatibility and now it’s back to performing the one-sided test! I wish I was making this up. 

Reading through the github issues (#4933, #6034,  #6062, #6100)  is an example of how open source projects can stagnate. There is a basic, simple, solution to the issue: create a corrected version of the function with a new name and deprecate the old one. This keeps backwards compatibility while allowing the project to fix its API. Once the issue had been identified, this should have been a 20 minute job. Reading through the issues, this simple solution is proposed, discussed, seemingly agreed to. Instead, something else happens and at this point, it’d take me longer than 20 minutes to just read through the whole discussions.

This is not the first time I have run into numpy/scipy’s lack of respect for backwards compatibility either. Fortunately, there is a solution to this case, which is to use the full version:

stats.mannwhitneyu(a, b, alternative='two-sided')

Repost: BLAST deserves a Nobel Prize

Given that tomorrow (Monday October 3) the 2016 Nobel Prize in Physiology or Medicine will be announced, I am linking to my 2-year old post arguing that BLAST deserves a Nobel Prize:

In terms of impact in the field, it’s undeniable that BLAST has been huge. These people created a verb! What modern biologist does not know what “blasting a sequence” means? The BLAST paper was, at one point, the most highly cited paper in history. The impact on physiology is undeniable.

Lipman and Gene Myers stand out for their contributions to the computational processing of biological sequences. (See how I phrased that in a Nobel Committee way).

[…]

[One] counterargument I’ve heard is that BLAST is mostly a method, but so was GFP […] Does anybody believe that just the 1962 discovery of a jellyfish protein would have sufficed for a Nobel?

[…]

BLAST was definitely one of the most largest advances in the field of physiology in the last few decades. For this reason, David Lipman and Gene Myers should get a Physiology and Medicine Nobel Prize.

I also add that current favorite CRISPR is also mostly a method (the CRISPR Prize, when it comes, will be awarded for the method not the discovery of some DNA processing mechanism in a Streptococcus species.

Anscombe’s Quartet Animated

Anscombe’s Quartet is a set of four 2D datasets which have the same mean and variance in both X & Y as well as the same relationship between the two variables, even though they look very different.

I built a little animation to show all four datasets and a smooth transition between them:

Animation showing Anscombe's Quartet

Animation showing Anscombe’s Quartet

The black line is the mean Y value and the two dotted lines represent the mean ± std dev., the blue line is the least square regression between x and y. These are recomputed at each frame. In a sense, all the frames are like Anscombe sets.

*

The script for generating these is on github. I enjoyed playing around with theano for easy automatic differentiation (these type of derivatives are easy, but somehow I always get a sign wrong or a factor of 2 missing in the first try).

The UK medal count really is impressive, the US is just as expected

Repeating my analysis from last week on medal counts. To recap, let’s look for models that predict a country’s medal count based on GDP/population and then check which countries over- or under-perform their size and wealth. In the end, simple total GDP at market rates was the best predictor.

Measured by ratio of obtained medals to predicted medals, Russia was still overperforming (and they got banned from several of the events, so that is very impressive, although we’ll never know of which of the other sports they should have gotten banned from). Interestingly, we also see several caucusian countries showing up. And the big winner of this year’s Olympics, Great Britain, does show up as getting many more medals than their GDP predicts.

Finally, note that France, which was underperfoming at the beginning of the Olympics, not only caught up, but made it to the over-performers table (très bien, la France!).

Over performing countries
=========================
                   delta  got  predicted     ratio
Russia         41.352770   56  14.647230  3.823248
Azerbaijan     11.717356   18   6.282644  2.865036
Great Britain  42.256925   67  24.743075  2.707828
New Zealand    10.872079   18   7.127921  2.525280
Kazakhstan      9.783999   17   7.216001  2.355875
Hungary         8.201280   15   6.798720  2.206298
Kenya           6.573411   13   6.426589  2.022846
Uzbekistan      6.550104   13   6.449896  2.015536
Australia      13.899721   29  15.100279  1.920494
France         19.639542   42  22.360458  1.878316

Note that neither the US nor China show up. If anything, they are performing slightly below expectations.

Now, for the bottom half:

Under performing countries
==========================
                          delta  got  predicted     ratio
India                -18.620390    2  20.620390  0.096991
Nigeria               -8.497863    1   9.497863  0.105287
Austria               -7.754915    1   8.754915  0.114222
United Arab Emirates  -7.728790    1   8.728790  0.114563
Singapore             -7.190392    1   8.190392  0.122094
Philippines           -7.185019    1   8.185019  0.122174
Finland               -6.753509    1   7.753509  0.128974
Portugal              -6.539120    1   7.539120  0.132641
Qatar                 -6.316772    1   7.316772  0.136672
Puerto Rico           -5.873931    1   6.873931  0.145477

India got two medals (neither of which gold, one silver and one bronze) even though they are on track to becoming one of the world’s largest economies (right now, their GDP is comparable to Italy’s, but growing fast, while Italy is stagnant).

Several oil countries (unearned wealth) are listed there. The 3 richest countries not to win a medal at all are Saudi ArabiaPakistan, and Chile; another trio of resource rich countries.

*

You can run the whole analysis on a mybinder repo.

At the Olympics, the US is underwhelming, Russia still overperforms, and what’s wrong with Southern Europe (except Italy)?

Russia is doing very well. The US and China, for all their dominance of the raw medal tables are actually doing just as well as you’d expect.

Portugal, Spain, and Greece should all be upset at themselves, while the fourth little piggy, Italy, is doing quite alright.

What determines medal counts?

I decided to play a data game with Olympic Gold medals and ask not just “Which countries get the most medals?” but a couple of more interesting questions.

My first guess of what determines medal counts was total GDP. After all, large countries should get more medals, but economic development should also matter. Populous African countries do not get that many medals after all and small rich EU states still do.

Indeed, GDP (at market value), does correlate quite well with the weighted medal count (an artificial index where gold counts 5 points, silver 3, and bronze just 1)

Much of the fit is driven by the two left-most outliers: US and China, but the fit explains 64% of the variance, while population explains none.

Adding a few more predictors, we can try to improve, but we don’t actually do that much better. I expect that as the Games progress, we’ll see the model fits become tighter as the sample size (number of medals) increases. In fact, the model is already performing better today than it was yesterday.

Who is over/under performing?

The US and China are right on the fit above. While they have more medals than anybody else, it’s not surprising. Big and rich countries get more medals.

The more interesting question is: which are the countries that are getting more medals than their GDP would account for?

Top 10 over performers

These are the 10 countries which have a bigger ratio of actual total medals to their predicted number of medals:

                delta  got  predicted     ratio
Russia       6.952551   10   3.047449  3.281433
Italy        5.407997    9   3.592003  2.505566
Australia    3.849574    7   3.150426  2.221921
Thailand     1.762069    4   2.237931  1.787366
Japan        4.071770   10   5.928230  1.686844
South Korea  1.750025    5   3.249975  1.538473
Hungary      1.021350    3   1.978650  1.516185
Kazakhstan   0.953454    3   2.046546  1.465884
Canada       0.538501    4   3.461499  1.155569
Uzbekistan   0.043668    2   1.956332  1.022322

Now, neither the US nor China are anywhere to be seen. Russia’s performance validates their state-funded sports program: the model predicts they’d get around 3 medals, they’ve gotten 10.

Italy is similarly doing very well, which surprised me a bit. As you’ll see, all the other little piggies perform poorly.

Australia is less surprising: they’re a small country which is very much into sports.

After that, no country seems to get more than twice as many medals as their GDP would predict, although I’ll note how Japan/Thailand/South Kore form a little Eastern Asia cluster of overperformance.

Top 10 under performers

This brings up the reverse question: who is underperforming? Southern Europe, it seems: Spain, Portugal, and Greece are all there with 1 medal against predictions of 9, 6, and 6.

France is country which is missing the most medals (12 predicted vs 3 obtained)! Sometimes France does behave like a Southern European country after all.

                delta  got  predicted     ratio
Spain       -8.268615    1   9.268615  0.107891
Poland      -6.157081    1   7.157081  0.139722
Portugal    -5.353673    1   6.353673  0.157389
Greece      -5.342835    1   6.342835  0.157658
Georgia     -4.814463    1   5.814463  0.171985
France      -9.816560    3  12.816560  0.234072
Uzbekistan  -3.933072    2   5.933072  0.337093
Denmark     -3.566784    3   6.566784  0.456845
Philippines -3.557424    3   6.557424  0.457497
Azerbaijan  -2.857668    3   5.857668  0.512149
The Caucasus (Georgia, Uzbekistan, Azerbaijan) may show up as their wealth is mostly due to natural resources and not development per se (oil and natural gas do not win medals, while human capital development does).
§
I expect that these lists will change as the Games go on as maybe Spain is just not as good at the events that come early in the schedule. Expect an updated post in a week.
Technical details

The whole analysis was done as a Jupyter notebook, available on github. You can use mybinder to explore the data. There, you will even find several little widgets to play around.

Data for medal counts comes from the medalbot.com API, while GDP/population data comes from the World Bank through the wbdata package.

Update on CBT-vs-Medication

Update a recent post: Should “we” prefer more expensive medications?

Today I ran across this paper, which assessed what happens if you ask people if they prefer CBT (talk therapy) or medication. The number of patients is too small for any strong conclusions, but it seems that getting the treatment of your choice has some beneficial effects, particularly for talk therapy (it has no statistically significant impact on the case on medication, but, again, the number of patients is very small).

HT @CoyneoftheRealm

I cannot find a decent offline email client

I spend a lot of time on trains and airplanes without internet connection and keep struggling with the fact that it has become significantly harder to use email without an internet connection. I would like to be able to write a few emails, which would be sent out when I reached civilization and had enough wireless coverage to tether the computer to my phone. Ten years ago, this would have been easy; today, not so much.

(Of course, if there was decent wireless data coverage everywhere, this would not be an issue [1]; except perhaps for data charges. However, as the world stands today, I can get neither a good data connection nor decent software to handle this.)

§

The first email clients were on networked machines are were online only. When dialup came along, they became desktop applications: your mail would be stored on your machine and minimizing the amount of time spent online was often a goal: you would write your messages, save in the Outbox, and then go online to get new messages and send your stored messages. As things moved to the cloud (especially with the appearance of Hotmail and the concept of webmail), desktop email started losing users (it probably never had many customers) and is slowly dying.

I would happily pay for a decent desktop offline-capable email client, but cannot find one. I tried Postbox, but it’s just a reheated version of thunderbird. Thunderbird itself is no longer being developed (as Mozilla is focusing on other priorities [2]). I used to use Kmail, which was not great, but worked in KDE3; completely stopped working with the advent of KDE4. Opera Mail is also no longer begin developed.

I could try to use mutt or another command line client, but the command line has not adapted to rich displays [3] and HTML is a nuisance in those [4]. I also enjoy drag&dropping attachments.

§

It is interesting to see how even with software, functionality can be lost if not continuously maintained.

[1] Perhaps part of the issue is that the US does have better wireless data coverage than Germany does. Since most technology is developed for that market…
[2] I also suspect the technology hit a local maximum and could not easily be improved and the many bugs encountered are very hard to fix.
[3] There is no technical reason to not have colour and HTML on the command line, but it’s hard to move these very old technology and the track record is that they’ll be with us for a few more decades.
[4] I still remember when it was considered rude by some to send HTML email. I too had a phase like that.