What’s the deal with computational irreproducibility?

How can computational results be so hard to reproduce? Even with the same input and the same code one can get different results. Shouldn’t computers always return the same results for the same computation?

Let’s look at a few classes of problems, from the easiest to solve to the most complicated.

Different, but equivalent, results

Two different gzip files can uncompress to the same result.

This is obviously a meaningless difference. When we promise that we return a certain result, we should not bound ourselves to specific ways of encoding it.

On NGLess, we did learn this the hard way because some of our tests seemed to be flaky at some point as we were comparing the compressed files. So, depending on the machine, tests would either pass or fail. Now, we test the uncompressed versions

Incompletely specified results

What does a sort algorithm return? Well, obviously it should return a sorted version of its inputs. The problem comes when there are “equivalent” (but not identical) items in the set: in which order should they be returned?

In this case, one can use stable sorting, which preserve the order of “equivalent” input elements. Unfortunately, the fastest sorting algorithms are not stable and use randomness (see below). Alternatively, one can use some tie breaking system so that no two elements compare equal. Now, the results are fully specified by the inputs. This can be done even on attributes that would otherwise be meaningless: for example, if you want to display the results of your processing so that the highest scoring sequences come first, you can sort by scores and, if the scores are identical, break ties using the sequence itself (it’s pretty meaningless that sequences starting with Alanines should come before those starting with Valines, but it means that the output is completely specified).

Another problem is when results depend on the environment. For example, if your tool sorts strings, then it will depend on the environment how this sorting is done! This is a huge rabbit hole and arguably a big mistake in API design that the default sort function in programming languages is not a pure function but depends on some deeply hidden state, but we have seen it cause problems where partial results were sorted in incompatible ways. In NGLess, we always use UTF-8 and we always sort in the same way (our results matrices are sorted by row name and those always use the same sort). The cost is that we will not respect all the nuances in how sorting “should” be done differently in Canadian French vs. European French. The gain is that Canadians and French will get the same results.


Many algorithms use random numbers. In practice, one rarely needs truly random numbers and can use pseudo-random numbers, which only behave random, but are perfectly reproducible.

Furthermore, these methods can take a seed which sets their internal machinery to known values so that one can obtain the same sequence every time.

As an alternative to setting it to the same value (which is not appropriate for every situation), one can also set it to a data-dependent value. For example, if you process sequences by batches of 100 sequences, it may be inappropriate to reuse the same seed for every new batch as this could easily create biases. Instead, one can set the seed based on a simple computation from the input data itself (a quick hash of the inputs). Now, each batch will be (1) reproducible and (2) have a different pseudo-random pattern.

Non-deterministic results

Some processes can become truly non-deterministic, not just pseudo-random. This can easily happen if, for example, threads are used.

In the example above, I mentioned resetting the seed for each batch. In a sequential system, this would be complete overkill, but if batches are being processed by separate threads, it is the only way.

Even with these tricks, one can easily have non-deterministic results if there is any state shared between batches or if the order in which they are computed influences the result.

Heuristics and approximations

Now, we get into the really complicated cases: very often, we do not have a true solution to the problem. When we use a tool like bwa we are not really solving the problem of find all the best alignments given a specific definition. There is software that solves that (e.g., Swipe), but it is too slow. Instead, bwa employs a series of very smart heuristics that will give you a very good approximate solution at a small fraction of the (computational) cost.

Now, definition becomes the output is the result of running bwa. This seems qualitatively different from saying the output is a sorted version of the input.


If we now admit that our results are defined by this is the result of running program X on the data as opposed to a more classical mathematical definition, then it becomes imperative that we specify which program it is. Furthermore, we need to specify which version of the program it is. In fact, specifying version is a well-recognized best practice in computational software.

The problem is that it is very hard to version the full stack. You may write in your manuscript that you are using version 1.6.3 of tool X, but if that tool depends on Numpy and Python, you may need to define the full version of those as well (and even that may not be enough). So, while it may be true that computers return the same result for the same computation, this means that we need to present the computer with the same computation all the way from the script code we wrote through to the device drivers.

In this respect, R is a bit better than Python at keeping compatibility, but even R has changed elements such as the random number generator it uses so that even if you were setting the seed to a fixed value as we discussed above, it would give you different results.

My preference is that, if people are going to be providing versions, that they that they provide a machine-readable way to generate the full environment (e.g., a default.nix file, a environment.yml conda file, …). Otherwise, while it is not completely useless, it is often not that informative either.

Nonetheless, this comes with costs: it becomes harder to compose. If tool 1 needs Python 3.6.4, tool 2 needs Python 3.5.3, and tool 3 needs Python 3.5.1, we must have all of them available and switch between them. We do have more and more infrastructure to make this switches fast-enough, but we still end installing Gigabytes of dependencies to run a script of 230 lines.

This also points to another direction: the more we can move away from this is the result of running X v1.2.3 to having outputs be defined by their inputs, the less dependent on specific versions of the tools we become. It may be impossible to get this 100%, but maybe we can get better than we have now. With NGLess, we have tried to move that way in minor ways so that the result does not depend on the version of the tool being run, but we’re still not 100% there.

How Notebooks Should Work

Joel Grus’ presentation on why he does not like notebooks sparked a flurry of notebook-related discussion.

I like the idea of notebooks more than I like actual notebooks. I tried to use them in my analyses for a long time, but eventually gave up as there are too many small annoyances (some that the talk goes over, others that it does not, such as the fact that they do not integrate well with git).

Here is how I think they should work instead:

  1. There is no hidden state. Cells are always run from top to bottom.
  2. If you change a cell in the middle, you immediately clear its output and all those below and the whole thing is run from the top.

For example:

[1] : Code

[2] : Code

[3] : Code

[4] : Code

[5] : Code

Now, if you edit Cell 3, you would get:

[1] : Code

[2] : Code

[3] : New Code
New Output

[ ] : Code

[ ] : Code

If you want, you can run the whole thing now and get the full output:

[1] : Code

[2] : Code

[3] : New Code
New Output

[4] : Code
New Output

[5] : Code
New Output

This way, the whole notebook is always up to date.

But won’t this be incredibly slow if you always have to run it from the top?

Yes, if you implement it naïvely where the kernel really does always re-run from the top, which is not likely to be usable, but you could do a bit of smart caching and keep some intermediate states alive. It would require some engineering, but I think you could keep a few live kernels in intermediate states to make the experience usable so that if you edit cell number 35, it does not need to go back to the first cell, but maybe there is a cached kernel that has the state of cell 30 and only 31 and onwards would need to be rerun.

It would take a lot of engineering and it may even be impossible with the current structure of jupyter kernels, but, from a human point-of-view, I think this would be a better user experience.

The Ecosystem of Unix and the Difficulty of Teaching It

Plos One published an awful paper comparing Word vs LaTeX where the task was to copy down a piece of text. Because Word users did better than LaTeX users at this task, the authors conclude that Word is more efficient.

First of all, this fits perfectly with my experience: Word [1] is faster for single page documents, where I don’t care about precise formatting, such as a letter. It says nothing about how it performs on large documents which are edited over months (or years). The typical Word failure mode are “you paste some text here and your image placement is now screwed up seven pages down” or “how do I copy & paste between these two documents without messing up the formatting?” This does not happen so much with a single page document.

Of course, the authors are not happy with the conclusion that Word is better for copying down a short piece of predefined text and instead generalize to “that even experienced LaTeX users may suffer a loss in productivity when LaTeX is used, relative to other document preparation systems.” This is a general failure mode of psychological research: here is a simple, well-defined experimental result in a very artificial setting. Now, let me completely overgeneralize to the real world. The authors of the paper actually say this in their defense: “We understand that people who are not familiar with the experimental methods of psychology (and usability testing) are surprised about our decision to use predefined texts.” That is to say, in our discipline, we are always sort of sloppy, but reviewers in the discipline do the same, so it’s fine.


Now, why waste time bashing a Plos One paper in usability research?

Because one interesting aspect of the discussion is that several people have pointed out that Word is better for collaboration because of the Track Changes features. For many of us, this is laughable because one of the large advantages of LaTeX is that you can use version control on the files. You can easily compare the text written today with a version from two months ago, it makes it easier to have multiple people working, &c.[2] In Word, using Track Changes is still “pass the baton” collaboration, whereby you email stuff around and say “now it’s your turn to edit it” [3].

However, this is only valid if you know how to use version control. In that case, it’s clear that using a text-based format is a good idea and it makes collaboration easier. The same way, I actually think that some of the test subjects in the paper had with LaTeX was simply that they did not use an editor with a spell-checker.

The underlying concept is that LaTeX works in an ecosystem of tools working together, which is a concept that we do not, in general, teach people. I have been involved with Software Carpentry and even before that I was trying teach people who are not trained in computers about these sort of tools, but we do not do that great of a job at teaching this concept, of the ecosystem. It is abstract and not directly clear to students why it is useful.

Spending a few hours going through the basic Unix commands seems like a brain-dead activity when people cannot connect this to their other knowledge or pressing needs.

On the other hand, it is very frustrating when somebody comes to me with a problem they have been struggling with for days and, in a minute, I can give them a solution because it’s often “oh, you can grep in extended mode and pipe it to gawk” (or worse, before they finish the description, I’ll say “run dos2unix and it will fix it” or “the problem you are describing is the exact use case of this excellent Python package, so you don’t need to code it from scratch”). Then they ask “how could I learn that? Is there a book/course?” and I just don’t have an answer better than “do this for 10 years and you’ll slowly get it”.

It’s hard to teach the whole ecosystem at once, which means that it’s hard to teach the abstractions behind it. Or maybe, I just have not yet figured out how it would be possible.


Finally, let me just remark that LaTeX is a particularly crappy piece of software. It is so incredibly bad that it only survives because the alternatives manage to be even worse. It’s even sadder when you realise that LaTeX is now over 30 years old, while Word is an imitation of even older technology We still have not been able to come up with something that is clearly better.


This flawed paper probably had better altmetrics than anything I’ll ever write in science, again showing what a bad idea altmetrics are.

[1] feel free to read “Word or Word-like software” in this and subsequent sentences. I actually often use Google Docs nowadays.
[2] Latexdiff is also pretty helpful in generating diffed versions.
[3] Actually, for collaboration, the Google Docs model is vastly superior as you don’t have to email back-n-forth. It also includes a bit of version control.

Friday Links

1. Should Science put up with sloppiness?

It is interesting that correct papers get retracted for ethical reasons, but rarely just because they are wrong.

2.Fraud & Science

The proposal, submitted some years earlier to a funding agency on a different continent, was copied by one of the reviewers, a highly recognized scientist, and then submitted to the ERC. It was pure chance that the former applicant detected the fraud.

[…] However, the larger legal framework of the European Commission (EC) under which the ERC operates links “frauds” only to financial aspects. The ERC is then obliged to report any (accomplished or attempted) misbehavior to OLAF, the European antifraud police. Financial fraud, however, causes the least headaches. In the above case, the ERC was unable to take action against the mischievous applicant.

Contrast with how standard scientific practice becomes criminal when money is involved:

“If you applied this rule to scientists, a sizable proportion of them might be in jail today,” said Steven N. Goodman, a pediatrician and biostatistician at Stanford University who submitted a statement supporting Harkonen’s appeal.

3. Another weird science fraud case:

The authors of a paper published in July […] are not only unknown at the institution listed on the paper, but no trace of them as researchers can be found.

The paper […] is not the kind of prank that journals have encountered before, in which hoaxsters have submitted dummy papers to highlight weaknesses in the peer-review process. The paper’s reported findings […] are, in fact, true.

Bruce Spiegelman […] says that he has presented similar findings at about six research meetings, and is preparing to submit them to a journal. He suspects that the [paper by unknown authors] was intended as a spoiler of his own lab’s work.

4. More non-computational thinking, the stupid snob edition

David Salz’s thoroughly researched assault on USB’s sonic handicaps delivers a relaxed, well-defined, dynamically evocative, and rhythmically taut performance. The Silver Starlight projects strings without screechiness, which cannot be said of most USB cables. For those seeking a mid-priced USB cable with obviously high build-quality and performance, the Silver Starlight is a solid choice.

The Silver Starlight USB digital cable costs $275/m!

My point stands: it is easier to identify non-computational thinking than to define what computational thinking is.

/ht Carter T Schonwald (@cartazio) on twitter

Non-Computational Thinking

There is a lot of talk about Computational Thinking, typically followed by the observation that we don’t know what it is or cannot define it.

Which I think it is true, but it is perhaps easier if we try to watch out for non-computational thinking instead.


Recently, the MLA defined how to cite a tweet in their (widely used) style:

Begin the entry in the works-cited list with the author’s real name and, in parentheses, user name, if both are known and they differ. If only the user name is known, give it alone.

Next provide the entire text of the tweet in quotation marks, without changing the capitalization. Conclude the entry with the date and time of the message and the medium of publication (Tweet). For example:

Athar, Sohaib (ReallyVirtual). “Helicopter hovering above Abbottabad at 1AM (is a rare event).” 1 May 2011, 3:58 p.m. Tweet.


I think this is an example of non-computational thinking: tweets have unique numeric IDs, so that you can link to them but they are not mentioned. They do discuss that the time stamp is time-zone dependent.

(Although, truth be told, twitter does not make it super-obvious how to get at the tweet ID; not sure why. You need to click on the tweet date to get to the tweet link and read the URL. [I updated this parenthetical paragraph in response to a comment below by Cheng H. Lee])