NIXML: nix + YAML for easy reproducible environments

The rise and fall of bioconda

A year ago, I remember a conversation which went basically like this:

Them: So, to distribute my package, what do you think I should use?

Me: You should use bioconda.

Them: OK, that’s interesting, but what about …?

Me: No, you should use bioconda.

Them: I will definitely look into it, but maybe it doesn’t fit my package and maybe I will just …

Me: No, you should use bioconda.

That was a year ago. Mind you, I knew there were some issues with conda, but it had a decent UX (user-experience), and more importantly, the community was growing.

Since then, conda has hit scalability concerns, which means that running it is increasingly frustrating: it is slow (an acknowledged issue, but I have had multiple instances of wait 20 minutes for an error message, which didn’t even help me solve the problem); mysterious errors are not uncommon, things that used to work now fail (I have had this more and more recently).

Thus, I no longer recommend bioconda so enthusiastically. What before seemed like some esoteric concerns about guaranteed correctness are now biting us.

The nix model

nix is a Linux distribution with a focus on declarativereproducible builds.

You write a little file (often called default.nix) which describes exactly what you want and the environment is generated from this, exactly the same each time. It has a lot going for it in terms of potential for science:

  1. Can be reproducible to a fault (Byte-for-Byte reproducibility, almost).
  2. Declarative means that the best practice of store your environment for later use is very easy to implement1

Unfortunately, the UX of nix is not great and making the environments reproducible, although possible is not so trivial (although it is now much easier). Nix is very powerful, but it uses a complicated domain-specific language and a semi-documented, ever evolving, set of build conventions which makes it hard for even experienced users to use it directly. There is no way that I can recommend it for general use.

The stack model

Stack is a tool for Haskell which uses the following concept for reproducible environments:

  1. The user specifies a list of packages that they want to use
  2. The user specifies a snapshot of the package directory.

The snapshot determines the versions of all of the packages, which automated testing has revealed to work together (at least up to the limits of unit testing). Furthermore, there is no need to say “version X.Y.Z of package A; version Q.R.S of package B,…”: you specify a single, globally encompassing version (note that this is one of the principles we adopted in NGLess, as we describe in the manuscript).

I really like this UX:

  • Want to update all your packages? just change this one number.
  • Didn’t work? just change it back: you are back where you started. This  is the big advantage of declarative approaches: what you did before does not matter, only the current state of the project.
  • Want to recreate an environment? just use this easy to read text file (for technical reasons, two files, but you get the drift).

Enter NIXML

https://github.com/luispedro/nixml

This is an afternoon hack, but the idea is to combine nix’s power with stack‘s UX by allowing you specify a set of packages in nix, using YaML

For example, start with this env.nlm file,

nixml: v0.0
snapshot: stable-19.03
packages:
  - lang: python
    version: 2
    modules:
      - numpy
      - scipy
      - matplotlib
      - mahotas
      - jupyter
      - scikitlearn
  - lang: nix
    modules:
      - vim

Now, running

nixml shell

returns a shell with the packages listed. Running

nixml shell –pure

returns a shell with only the packages listed, so you can be sure to not rely on external packages.

Internally, this just creates a nix file and runs it, but it adds the stack-like interface:

  1. it is always automatically pinned: you see the stable-19.03 thing? That means, the version of these packages that was available in the stable branch on March 2019.
  2. the syntax is simple, no need to know about python2.withPackages or any other nix internals like that. This means a loss of power for the user, but it will be a better trade-off 99% of the time.

Very much a Work in progress right now, but I am putting it out there as it is already usable for Python-based projects.


  1. There are two types of best practices advice: the type that, most people, once they try it out, adopt; and the type that you need to keep hammering into people’s heads. The second type should be seen as a failure of the tool: “best practices” are a user-experience smell.

Quick followups: NGLess benchmark & Notebooks as papers

A quick follow-up on two earlier posts:

We finalized the benchmark for ngless that I had discussed earlier:

As you can see, NGLess performs much better than either MOCAT or htseq-count. We tried to use featureCounts too, but that completely failed to produce results for some of the samples (we gave it a whopping 1TB of RAM, but it used it all up before crashing).

It also reveals that although ngless was developed in the context of our metagenomics work, it would also likely do well on the type of problems for which htseq-count is currently being used, in the domain of RNA-seq.

§

Earlier, I also  wrote skeptically about the idea of replacing papers with Jupyter notebooks:

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

The little script that generates the plot above is an excellent example of this. It is available online (on github: plot-comparison.py). It’s over 100 lines and, even then, the final result required some minor aesthetic manipulations in inkscape (so that, if you run it, the result is slightly uglier: in particular, the legend is absent).

Would it really add anything to the presentation of the manuscript to have those 100 lines of code be intermingled with the presentation of ngless as a metagenomics profiler?

In this case, I am confident that a Jupyter notebook would be worse than the current solution of a PDF as a main presentation with the data table and plotting scripts as supplemental material.

Python’s Weak Performance Matters

Here is an argument I used to make, but now disagree with:

Just to add another perspective, I find many “performance” problems in
the real world can often be attributed to factors other than the raw
speed of the CPython interpreter. Yes, I’d love it if the interpreter
were faster, but in my experience a lot of other things dominate. At
least they do provide low hanging fruit to attack first.

[…]

But there’s something else that’s very important to consider, which
rarely comes up in these discussions, and that’s the developer’s
productivity and programming experience.[…]

This is often undervalued, but shouldn’t be! Moore’s Law doesn’t apply
to humans, and you can’t effectively or cost efficiently scale up by
throwing more bodies at a project. Python is one of the best languages
(and ecosystems!) that make the development experience fun, high
quality, and very efficient.

(from Barry Warsaw)

I used to make this argument. Some of it is just a form of utilitarian programming: having a program that runs 1 minute faster but takes 50 extra hours to write is not worth it unless you run it >3000 times. For code that is written as part of data analysis, this is rarely the case. Now I think it is not as strong of an argument as I previously thought. Now I believe that the fact that CPython (the only widely used Python interpreter) is slow is a major disadvantage of the language and not just a small tradeoff for faster development time.

What changed in my reasoning?

First of all, I’m working on other problems. Whereas I used to do a lot of work that was very easy to map to numpy operations (which are fast as they use compiled code), now I write a lot of code which is not straight numerics. And, then, if I have to write it in standard Python, it is slow as molasses. I don’t mean slower in the sense of “wait a couple of seconds”, I mean “wait several hours instead of 2 minutes.”

At the same time, data keeps getting bigger and computers come with more and more cores (which Python cannot easily take advantage of), while single-core performance is only slowly getting better. Thus, Python is a worse and worse solution, performance-wise.

Other languages have also demonstrated that it is possible to get good performance with high-level code (using JIT or very aggressive compile-time optimizations). Looking from afar, the core Python development group seems uninterested in these ideas. They regularly pop-up in side projects: psyco, unladen swallow, stackless, shedskin, and pypy; the last one being the only one that is in active development; however, for all the buzz they generate they never make into CPython, which is still using the same basic bytecode stack-machine strategy that it used 20 years ago. Yes, optimizing a very dynamic language it’s not a trivial problem, but Javascript is at least as dynamic as Python is and it has several JIT-based implementations.

It is true that programmer time is more valuable than computer time, but waiting for results to finish computing is also a waste of my time (I suppose I could do something else in the meanwhile, but context switches are such a killer of my performance that I often just wait).

I have also sometimes found that, in order to make something fast in Python, I end up with complex code, almost unreadable, code. See this function for an example. The first time we wrote it, it was a loop based function, directly translating the formula it is computing. It took hours on a medium sized problem (it would take weeks on the real-life problems we want to tackle!). Now, it’s down to a few seconds, but unless you are much smarter than me, it’s not trivial to read out the underlying formula from the code.

The result is that I find myself doing more and more things in Haskell, which lets me write high-level code with decent performance (still slower than what I get if I go all the way down to C++, but with very good libraries). In fact, part of the reason that NGLess is written in Haskell and not Python is performance. I still use Jug (Python-based) to glue it all together, but it is calling Haskell code to do all the actual work.

I now sometimes prototype in Python, then do a kind of race: I start running the analysis on the main dataset, while at the same time reimplementing the whole thing in Haskell. Then, I start the Haskell version and try to make it finish before the Python-analysis completes. Many times, the Haskell version wins (even counting development time!).

Update: Here is a “fun” Python performance bug that I ran into the other day: deleting a set of 1 billion strings takes >12 hours. Obviously, this particular instance can be fixed, but this exactly the sort of thing that I would never have done a few years ago. A billion strings seemed like a lot back then, but now we regularly discuss multiple Terabytes of input data as “not a big deal”. This may not apply for your settings, but it does for mine.

Update 2: Based on a comment I made on hackernews, this is how I summarize my views:

The main motivation is to minimize total time, which is is TimeToWriteCode + TimeToRunCode.

Python has the lowest TimeToWriteCode, but very high TimeToRunCodeTimeToWriteCode is fixed as it is a human factor (after the initial learning curve, I am not getting that much smarter). However, as datasets grow and single-core performance does not get better TimeToRunCode keeps increasing, so that it is more and more worth it to spend more time writing code to decrease TimeToRunCode. C++ would give me the lowest TimeToRunCode, but at too high a cost in TimeToWriteCode (not so much the language, as the lack of decent libraries and package management). Haskell is (for me) a good tradeoff.

This is applicable to my work, where we do use large datasets as inputs. YMMV.

Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

This week, a discussion broke out about the stability of the Python scientific ecosystem. It was triggered by a blogpost from Konrad Hinsen, which led to several twitter follow ups.

First of all, let me  say that numpy/scipy are great. I use them and recommend them all the time. I am not disparaging the projects as a whole or the people who work on them. It’s just that I would prefer if they were stabler. Given twitter’s limitations, perhaps this was not as clear as I would like on my own twitter response:

I pointed out that I have been bit by API changes:

All of these are individually minor (and can be worked around), but these are just the issues that I have personally ran into and caused enough problems for me to remember them. The most serious was the mannwhitneyu change, which was a silent change (i.e., the function started returning a different result rather than raising an exception or another type of error).

*

Konrad had pointed out the Linux kernel project as one extreme version of “we never break user code”:

The other extreme is the Silicon-Valley-esque “move fast and break stuff”, which is appropriate for a new project. These are not binary decisions, but two extremes of a continuum. I would hope to see numpy move more towards the “APIs are contracts we respect” side of the spectrum as I think it currently behaves too much like a startup.

Numpy does not use semantic versioning, but if it did almost all its releases would be major releases as they almost always break some code. We’d be at Numpy 14.0 by now. Semantic versioning would allow for a smaller number of “large, possibly-breaking releases” (every few years) instead of a constant stream of minor backwards-incompatible changes. We’d have Numpy 4.2 now, and a list of deprecated features to be removed by 5.0.

Some of the problems that have arisen could have been solved by (1) introducing a new function with the better behaviour, (2) deprecating the old one, (3) waiting a few years and removing the original version (in a major release, for example). This would avoid the most important problem, silent changes.

*

A typical response is to say “well, just use anaconda (or similar) to freeze your dependencies”. Again, let me be clear, I use and recommend anaconda for everything. It’s great. But, in the context of the backwards compatibility problem, I don’t think this recommendation is really thought through as it only solves a fraction of the problem at hand (again, an important fraction but it’s not a magic wand).  (See also this post by Titus Brown).

What does anaconda not solve? It does not solve the problem of the intermediate layer, libraries which use numpy, but are to be used by final users. What is the expectation here? That I release my computer vision code (mahotas) with a note: Only works on Numpy 1.11? What if I want a project that uses both mahotas and scikit-learn, but scikit-learn is for Numpy 1.12 only? Is the conclusion that you cannot mix mahotas and scikit-learn? This would be worse than the Python 2/3 split. A typical project of mine might use >5 different numpy-dependent libraries. What are the chances that they all expect the exact same numpy version?

Right now, the solution I use in my code is “if I know that this numpy function has changed behaviour, either work around it, avoid it, or reimplement it (perhaps by copying and pasting from numpy)”. For example, some functions return views or copies depending on which version of numpy you have. To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.

It also happens all the time that I have an anaconda environment, add a new package and numpy is upgraded/downgraded. Is this to be considered buggy behaviour by anaconda? Anaconda currently does not upgrade everything to Python 3 when you request a package that is not available on Python 2, nor does it downgrade from 3 to 2; why should it treat numpy any differently if there is no guarantee that behaviour is consistent across numpy verions?

Sometimes the code at hand is not even an officially released library, but some code from another project. Let’s say that I have code that takes a metagenomics abundance matrix, does some preprocessing and computes stats and plots. I might have written it originally for a paper a few years back, but now want to do the same analysis on new data. Is the recommendation that I always write from scratch because it’s a new numpy version? What if it’s someone else asking me for the code? Is the recommendation that I ask “are you still on numpy 1.9, because I only really tested it there”. Note that Python’s dynamic nature actually makes this problem worse than in statically checked languages.

What about training materials? As I also wrote on twitter, it’s when teaching Python that I suffer most from Python 2-vs-Python 3 issues. Is the recommendation that training materials clearly state “This is a tutorial for numpy 1.10 only. Please downgrade to that version or search for a more up to date tutorial”? Note that beginners are the ones most likely to struggle with these issues. I can perfectly understand what it means that: “array == None and array != None do element-wise comparison”(from the numpy 1.13 release notes). But if I was just starting out, would I understand it immediately?

Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.

ANN : Diskhash. Disk-based, persistent hash tables

A few weeks ago, I decided to finally scratch an itch I’ve had for a while: I had a few days off from work and implemented a persistent, disk-based, hash table. Funnily enough, I’m now intensively using it at work, but a priori it felt more like a side project than a work one (it’s often a fuzzy border).

A disk based hashtable

The idea is very simple: it’s a basic hash table which is run on mmap()ed memory so that it can be loaded from disk with a single system call. I’ve heard this type of system to be referred to as “baked data”: you build structures in memory that can be written from/to disk without any need for parsing/converting.

I implemented it all in C (because it is the lowest-common denominator), but there are interfaces in C++, Python, and Haskell. The disk format is fixed, so all these interfaces can work with the same tables. You can jump to the bottom of the post to see code examples. 

Performance

My usage is mostly to build the hashtable once and then reuse it many times. Several design choices reflect this bias and so does performance. Building the hash table can take a while. A big (roughly 1 billion entries) table took almost 1 hour to build. This compares to about 10 minutes for building a Python hashtable of the same size.

On disk, this table takes up 32GB (just the keys and data use up 21GB so I find the overhead acceptable). This compares with almost 200GB for the Python version. Additionally, several processes on the same machine can share the memory map (the operating system will do this automatically for you), further reducing memory usage when more than one process is running.

Using the C++ interface, I measured lookups as taking circa 10-20 microseconds per lookup. When doing the same from Python, it takes 400-800 microseconds. The big difference depends on whether the cache is hot or cold (doing the same lookup twice is much faster than two different lookups as the memory is already in cache). A raw Python hash table takes ca. 40 microseconds. My guess is that the extra overhead of diskhash in Python is boxing/unboxing of types, while the Python version uses boxed types (which is also responsible for the extra memory usage). Still, this is very acceptable.

Design

The format on disk is pretty simple:

[HEADER]
    - magic number (versioned)
    - options
    - size of table
    - number of used slots
[TABLE OF INDICES]
    - integer indices into data table [with value 0 representing NULL and other indices in 1-based format]
[DATA TABLE]
    - [key/value] pairs

The format on disk is the same as the format on memory, thus loading is simply calling mmap(). Conflicts are handled using linear indexing (table load is kept at <50%). When it is necessary to expand the table, a completely new table is built (that is 1.7x as large as the current one), all the elements are inserted into this table and, then, we switch to that table. This can be quite expensive, but is amortized so, insertions are still O(1) and it is possible to pre-allocate a large table if desired.

The indirection (there is a table of indices pointing to a data table) keeps disk space down at the cost of an extra step (and probably an extra memory access) at lookup time. The code is smart enough to switch from 32 to 64 bit indices as the table grows.

There is currently no support for deleting keys.

Experience coding this

C is a pain, but compiling C is fast

I had actually not written any C code in many years. I often use C++, but raw C code is very different. Making sure that the every cleanup path is correct leads to a lot of boilerplate and copy&pasting. Without exceptions and destructors, checking the return value of all functions that we call is a pain. It is not hard, but it sure is tedious.

One thing that was very cool is how fast compilation is. The first time I ran gcc, I thought there must have been something wrong as the command was instantaneous.

Nope, compilation of the library and the test driver takes <0.2s (slightly slower if you use optimizations; it goes all the way up to 0.3s).

This means that compiling and running C is about as fast as starting an interpreter.

Writing a disk based hash is easy, packaging the code is hard

The two hardest things in computer science are not naming things or cache invalidation but installing packages on Linux and solving packaging errors.

I first wrote a Python wrapper using ctypes, but while it was trivial to write and it worked well, I could not find a way to package it. Finally, I decided it was easier to just use the raw C API instead of figuring out how to convince setuptools to do what I wanted.

The haskell packaging was slightly easier, but it still required a few tries until all the right files were correctly included in the package (which is why there were 3 releases until it worked: the code is the same, it was just me fiddling with packaging).

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python’s interface is more limited and only integers are supported as values in the hash table (they are stored as 64-bit integers).

import diskhash
tb = diskhash.Str2int("testing.dht", 15)
tb.insert("key", 9)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it’s not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

Scipy’s mannwhitneyu function

Without looking it up, can you say what the following code does:

import numpy as np
from scipy import stats
a = np.arange(25)
b = np.arange(25)+4
print(stats.mannwhitneyu(a , b))

You probably guessed that it computes the Mann-Whitney test between two samples, but exactly which test? The two-sided or the one-sided test?

You can’t tell from the code because it depends on which version of scipy you are running and it has gone back and forth between the two! Pre-0.17.0 it used the one-sided test with the side being decided based on the input data. This was obviously the wrong thing to do. Then, the API was fixed in 0.17.0 to do the two-sided test. This was considered a bad thing because it broke backwards compatibility and now it’s back to performing the one-sided test! I wish I was making this up. 

Reading through the github issues (#4933, #6034,  #6062, #6100)  is an example of how open source projects can stagnate. There is a basic, simple, solution to the issue: create a corrected version of the function with a new name and deprecate the old one. This keeps backwards compatibility while allowing the project to fix its API. Once the issue had been identified, this should have been a 20 minute job. Reading through the issues, this simple solution is proposed, discussed, seemingly agreed to. Instead, something else happens and at this point, it’d take me longer than 20 minutes to just read through the whole discussions.

This is not the first time I have run into numpy/scipy’s lack of respect for backwards compatibility either. Fortunately, there is a solution to this case, which is to use the full version:

stats.mannwhitneyu(a, b, alternative='two-sided')

Anscombe’s Quartet Animated

Anscombe’s Quartet is a set of four 2D datasets which have the same mean and variance in both X & Y as well as the same relationship between the two variables, even though they look very different.

I built a little animation to show all four datasets and a smooth transition between them:

Animation showing Anscombe's Quartet
Animation showing Anscombe’s Quartet

The black line is the mean Y value and the two dotted lines represent the mean ± std dev., the blue line is the least square regression between x and y. These are recomputed at each frame. In a sense, all the frames are like Anscombe sets.

*

The script for generating these is on github. I enjoyed playing around with theano for easy automatic differentiation (these type of derivatives are easy, but somehow I always get a sign wrong or a factor of 2 missing in the first try).

At the Olympics, the US is underwhelming, Russia still overperforms, and what’s wrong with Southern Europe (except Italy)?

Russia is doing very well. The US and China, for all their dominance of the raw medal tables are actually doing just as well as you’d expect.

Portugal, Spain, and Greece should all be upset at themselves, while the fourth little piggy, Italy, is doing quite alright.

What determines medal counts?

I decided to play a data game with Olympic Gold medals and ask not just “Which countries get the most medals?” but a couple of more interesting questions.

My first guess of what determines medal counts was total GDP. After all, large countries should get more medals, but economic development should also matter. Populous African countries do not get that many medals after all and small rich EU states still do.

Indeed, GDP (at market value), does correlate quite well with the weighted medal count (an artificial index where gold counts 5 points, silver 3, and bronze just 1)

Much of the fit is driven by the two left-most outliers: US and China, but the fit explains 64% of the variance, while population explains none.

Adding a few more predictors, we can try to improve, but we don’t actually do that much better. I expect that as the Games progress, we’ll see the model fits become tighter as the sample size (number of medals) increases. In fact, the model is already performing better today than it was yesterday.

Who is over/under performing?

The US and China are right on the fit above. While they have more medals than anybody else, it’s not surprising. Big and rich countries get more medals.

The more interesting question is: which are the countries that are getting more medals than their GDP would account for?

Top 10 over performers

These are the 10 countries which have a bigger ratio of actual total medals to their predicted number of medals:

                delta  got  predicted     ratio
Russia       6.952551   10   3.047449  3.281433
Italy        5.407997    9   3.592003  2.505566
Australia    3.849574    7   3.150426  2.221921
Thailand     1.762069    4   2.237931  1.787366
Japan        4.071770   10   5.928230  1.686844
South Korea  1.750025    5   3.249975  1.538473
Hungary      1.021350    3   1.978650  1.516185
Kazakhstan   0.953454    3   2.046546  1.465884
Canada       0.538501    4   3.461499  1.155569
Uzbekistan   0.043668    2   1.956332  1.022322

Now, neither the US nor China are anywhere to be seen. Russia’s performance validates their state-funded sports program: the model predicts they’d get around 3 medals, they’ve gotten 10.

Italy is similarly doing very well, which surprised me a bit. As you’ll see, all the other little piggies perform poorly.

Australia is less surprising: they’re a small country which is very much into sports.

After that, no country seems to get more than twice as many medals as their GDP would predict, although I’ll note how Japan/Thailand/South Kore form a little Eastern Asia cluster of overperformance.

Top 10 under performers

This brings up the reverse question: who is underperforming? Southern Europe, it seems: Spain, Portugal, and Greece are all there with 1 medal against predictions of 9, 6, and 6.

France is country which is missing the most medals (12 predicted vs 3 obtained)! Sometimes France does behave like a Southern European country after all.

                delta  got  predicted     ratio
Spain       -8.268615    1   9.268615  0.107891
Poland      -6.157081    1   7.157081  0.139722
Portugal    -5.353673    1   6.353673  0.157389
Greece      -5.342835    1   6.342835  0.157658
Georgia     -4.814463    1   5.814463  0.171985
France      -9.816560    3  12.816560  0.234072
Uzbekistan  -3.933072    2   5.933072  0.337093
Denmark     -3.566784    3   6.566784  0.456845
Philippines -3.557424    3   6.557424  0.457497
Azerbaijan  -2.857668    3   5.857668  0.512149
The Caucasus (Georgia, Uzbekistan, Azerbaijan) may show up as their wealth is mostly due to natural resources and not development per se (oil and natural gas do not win medals, while human capital development does).
§
I expect that these lists will change as the Games go on as maybe Spain is just not as good at the events that come early in the schedule. Expect an updated post in a week.
Technical details

The whole analysis was done as a Jupyter notebook, available on github. You can use mybinder to explore the data. There, you will even find several little widgets to play around.

Data for medal counts comes from the medalbot.com API, while GDP/population data comes from the World Bank through the wbdata package.

Mounting My Phone as a Filesystem

After a lot of time spent trying to find the right app/software for getting things off my phone into my computer (I have spent probably days now, accumulated over my lifetime of phone-ownership; this shouldn’t be so hard), I just gave up and wrote a little FUSE script to mount the phone as a directory in Linux.

It is very basic and relies on adb (android debug bridge) being installed on the PATH, but if it is present, I was able to just type:

$ mkdir -p phone
$ python android-fuse.py phone &

To get the phone mounted as a directory:

$ cd phone
$ ls -l
total 656 
drwxr-xr-x 1 root      root           0 Jan  1 00:44 acct 
drwxrwx--- 1 luispedro      2001      0 Jan  6 12:30 cache[...]

Now, I could navigate the directories and files from the phone to the computer (uploading files in is not available as I don’t need it).

What was nice was that, using fusepy (which also needs to be available), the code wasn’t too hard to write even. Then I was able to see where my phone hard drive disk was going (I keep running out of disk space) and delete a few Gigabytes of pictures I had anyone already saved somewhere else (by making sure the hashes matched).

It’s available at: https://github.com/luispedro/android-fuse. But rely on it at your own risk! It a best-attempt code and works on my phone, but I didn’t vet it 100%. It’s also kind of slow to list directories and such.

(It may also work on Mac, as Mac also has FUSE; but I cannot test it).