How Notebooks Should Work

Joel Grus’ presentation on why he does not like notebooks sparked a flurry of notebook-related discussion.

I like the idea of notebooks more than I like actual notebooks. I tried to use them in my analyses for a long time, but eventually gave up as there are too many small annoyances (some that the talk goes over, others that it does not, such as the fact that they do not integrate well with git).

Here is how I think they should work instead:

There is no hidden state. Cells are always run from top to bottom.
If you change a cell in the middle, you immediately clear its output and all those below and the whole thing is run from the top.

For example:

[1] : Code
Output

[2] : Code
Output

[3] : Code
Output

[4] : Code
Output

[5] : Code
Output

Now, if you edit Cell 3, you would get:

[1] : Code
Output

[2] : Code
Output

[3] : New Code
New Output

[ ] : Code

[ ] : Code

If you want, you can run the whole thing now and get the full output:

[1] : Code
Output

[2] : Code
Output

[3] : New Code
New Output

[4] : Code
New Output

[5] : Code
New Output

This way, the whole notebook is always up to date.

But won’t this be incredibly slow if you always have to run it from the top?

Yes, if you implement it naïvely where the kernel really does always re-run from the top, which is not likely to be usable, but you could do a bit of smart caching and keep some intermediate states alive. It would require some engineering, but I think you could keep a few live kernels in intermediate states to make the experience usable so that if you edit cell number 35, it does not need to go back to the first cell, but maybe there is a cached kernel that has the state of cell 30 and only 31 and onwards would need to be rerun.

It would take a lot of engineering and it may even be impossible with the current structure of jupyter kernels, but, from a human point-of-view, I think this would be a better user experience.

Quick followups: NGLess benchmark & Notebooks as papers

A quick follow-up on two earlier posts:

We finalized the benchmark for ngless that I had discussed earlier:

As you can see, NGLess performs much better than either MOCAT or htseq-count. We tried to use featureCounts too, but that completely failed to produce results for some of the samples (we gave it a whopping 1TB of RAM, but it used it all up before crashing).

It also reveals that although ngless was developed in the context of our metagenomics work, it would also likely do well on the type of problems for which htseq-count is currently being used, in the domain of RNA-seq.

Earlier, I also wrote skeptically about the idea of replacing papers with Jupyter notebooks:

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

The little script that generates the plot above is an excellent example of this. It is available online (on github: plot-comparison.py). It’s over 100 lines and, even then, the final result required some minor aesthetic manipulations in inkscape (so that, if you run it, the result is slightly uglier: in particular, the legend is absent).

Would it really add anything to the presentation of the manuscript to have those 100 lines of code be intermingled with the presentation of ngless as a metagenomics profiler?

In this case, I am confident that a Jupyter notebook would be worse than the current solution of a PDF as a main presentation with the data table and plotting scripts as supplemental material.

The Scientific Paper of the Future is Probably a PDF

I do not mean to say the scientific paper of the future should be a PDF, I just mean that it will mostly likely be a PDF or some PDF-derived format. By future, I mean around 2040 (so, in 20-25 years).

I just read James Somers in the Atlantic, arguing that The Scientific Paper Is Obsolete (Here’s what’s next). In that article, he touts Mathematica notebooks as a model of what should be done and Jupyter as the current embodiment of this concept.

I will note that Mathematica came out in 1988 (a good 5 years before the PDF format) and has yet failed to take the world by storm (the article claims that “the program soon became as ubiquitous as Microsoft Word”, a claim which is really hard to reconcile with reality). Perhaps Mathematica was held back because it’s expensive and closed source (but so is Microsoft Word, and Word has taken the world by storm).

How long did it take to get to HTML papers?

For a very long time, the future of the scientific paper was going to be some smart version of HTML. We did eventually get to the point where most journals have decent HTML versions of their papers, but it’s mostly dumb HTML.

As far as I can tell, none of the ideas of having a semantically annotated paper panned out. About 10 years ago, the semantic web was going to revolutionize science. That didn’t happen and it’s even been a while since I heard someone arguing that that would be the future of the scientific paper.

Tools like Read Cube or Paperpile still parse the PDFs and try to infer what’s going on instead of relying on fancy semantic annotations.

What about future proofing the system?

About a week ago, I tweeted:

A comment on both publication delays and the Python data science ecosystem:

Code that ran with package versions that were current when the paper was submitted will not run with versions that are current when the paper was accepted.

— Luis Pedro Coelho (@luispedrocoelho) April 4, 2018

This is about a paper which is now in press. It’s embargoed, but I’ll post about it when it comes out in 2 weeks.

I have complained before about the lack of backwards compatibility in the Python ecosystem. I can open and print a PDF from 20 years ago (or a PostScript file from the early 1980s) without any issues, but I have trouble running a notebook from last year.

At this point, someone will say docker! and, yes, I can build a docker image (or virtual machine) with all my dependencies and freeze that, but who can commit to hosting/running these over a long period? What about the fact that even tech-savvy people struggle to keep all these things properly organized? I can barely get co-authors to move beyond the “let’s email Word files back and forth.”

With less technical co-authors, can you really imagine them downloading a docker container and properly mounting all the filesystems with OverlayFS to send me back edits? Sure, there are a bunch of cool startups with nicer interfaces, but will they be here in 2 years (let alone 20)?

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

Similarly, all the file paths and the 5 arguments you need to pass to pandas.read_table to read the data correctly: why should we care when we are just trying to get the gist of the results? One of our goals in NGLess is to try to separate some of this accidental complexity from the main processing pipeline, but this also limits what we can do with it (this is the tradeoff, it’s a domain specific tool; it’s hard to achieve the same with a general purpose tool like Jupyter/Python).

I do really like Jupyter for tutorials as the mix of code and text are a good fit. I will work to make sure I have something good for the students, but I don’t particularly enjoy working with the notebook interface, so I need to be convinced before I jump on the bandwagon more generally.

I actually do think that the future will eventually be some sort of smarter thing than simulated paper, but I also think that (1) it will take much longer than 20 years and (2) it probably won’t be Jupyter getting us there. It’s a neat tool for many things, but it’s not a PDF killer.

At the Olympics, the US is underwhelming, Russia still overperforms, and what’s wrong with Southern Europe (except Italy)?

Russia is doing very well. The US and China, for all their dominance of the raw medal tables are actually doing just as well as you’d expect.

Portugal, Spain, and Greece should all be upset at themselves, while the fourth little piggy, Italy, is doing quite alright.

What determines medal counts?

I decided to play a data game with Olympic Gold medals and ask not just “Which countries get the most medals?” but a couple of more interesting questions.

My first guess of what determines medal counts was total GDP. After all, large countries should get more medals, but economic development should also matter. Populous African countries do not get that many medals after all and small rich EU states still do.

Indeed, GDP (at market value), does correlate quite well with the weighted medal count (an artificial index where gold counts 5 points, silver 3, and bronze just 1)

Much of the fit is driven by the two left-most outliers: US and China, but the fit explains 64% of the variance, while population explains none.

Adding a few more predictors, we can try to improve, but we don’t actually do that much better. I expect that as the Games progress, we’ll see the model fits become tighter as the sample size (number of medals) increases. In fact, the model is already performing better today than it was yesterday.

Who is over/under performing?

The US and China are right on the fit above. While they have more medals than anybody else, it’s not surprising. Big and rich countries get more medals.

The more interesting question is: which are the countries that are getting more medals than their GDP would account for?

Top 10 over performers

These are the 10 countries which have a bigger ratio of actual total medals to their predicted number of medals:

                delta  got  predicted     ratio
Russia       6.952551   10   3.047449  3.281433
Italy        5.407997    9   3.592003  2.505566
Australia    3.849574    7   3.150426  2.221921
Thailand     1.762069    4   2.237931  1.787366
Japan        4.071770   10   5.928230  1.686844
South Korea  1.750025    5   3.249975  1.538473
Hungary      1.021350    3   1.978650  1.516185
Kazakhstan   0.953454    3   2.046546  1.465884
Canada       0.538501    4   3.461499  1.155569
Uzbekistan   0.043668    2   1.956332  1.022322

Now, neither the US nor China are anywhere to be seen. Russia’s performance validates their state-funded sports program: the model predicts they’d get around 3 medals, they’ve gotten 10.

Italy is similarly doing very well, which surprised me a bit. As you’ll see, all the other little piggies perform poorly.

Australia is less surprising: they’re a small country which is very much into sports.

After that, no country seems to get more than twice as many medals as their GDP would predict, although I’ll note how Japan/Thailand/South Kore form a little Eastern Asia cluster of overperformance.

Top 10 under performers

This brings up the reverse question: who is underperforming? Southern Europe, it seems: Spain, Portugal, and Greece are all there with 1 medal against predictions of 9, 6, and 6.

France is country which is missing the most medals (12 predicted vs 3 obtained)! Sometimes France does behave like a Southern European country after all.

                delta  got  predicted     ratio
Spain       -8.268615    1   9.268615  0.107891
Poland      -6.157081    1   7.157081  0.139722
Portugal    -5.353673    1   6.353673  0.157389
Greece      -5.342835    1   6.342835  0.157658
Georgia     -4.814463    1   5.814463  0.171985
France      -9.816560    3  12.816560  0.234072
Uzbekistan  -3.933072    2   5.933072  0.337093
Denmark     -3.566784    3   6.566784  0.456845
Philippines -3.557424    3   6.557424  0.457497
Azerbaijan  -2.857668    3   5.857668  0.512149

The Caucasus (Georgia, Uzbekistan, Azerbaijan) may show up as their wealth is mostly due to natural resources and not development per se (oil and natural gas do not win medals, while human capital development does).

I expect that these lists will change as the Games go on as maybe Spain is just not as good at the events that come early in the schedule. Expect an updated post in a week.

Technical details

The whole analysis was done as a Jupyter notebook, available on github. You can use mybinder to explore the data. There, you will even find several little widgets to play around.

Data for medal counts comes from the medalbot.com API, while GDP/population data comes from the World Bank through the wbdata package.

	PLOS ONE：我該不該投這本期刊？… on How Long Does Plos One Take to…
	Ravindra Puli on ANN : Diskhash. Disk-based, pe…
	Covid-falcões, covid… on Did anyone in Santa Clara Coun…
	Marcelo Huerta on Grit: a non-existent alternati…
	luispedro on Friday Links (one day late,…