How can computational results be so hard to reproduce? Even with the same input and the same code one can get different results. Shouldn’t computers always return the same results for the same computation?
Let’s look at a few classes of problems, from the easiest to solve to the most complicated.
Different, but equivalent, results
Two different gzip files can uncompress to the same result.
This is obviously a meaningless difference. When we promise that we return a certain result, we should not bound ourselves to specific ways of encoding it.
On NGLess, we did learn this the hard way because some of our tests seemed to be flaky at some point as we were comparing the compressed files. So, depending on the machine, tests would either pass or fail. Now, we test the uncompressed versions
Incompletely specified results
What does a sort algorithm return? Well, obviously it should return a sorted version of its inputs. The problem comes when there are “equivalent” (but not identical) items in the set: in which order should they be returned?
In this case, one can use stable sorting, which preserve the order of “equivalent” input elements. Unfortunately, the fastest sorting algorithms are not stable and use randomness (see below). Alternatively, one can use some tie breaking system so that no two elements compare equal. Now, the results are fully specified by the inputs. This can be done even on attributes that would otherwise be meaningless: for example, if you want to display the results of your processing so that the highest scoring sequences come first, you can sort by scores and, if the scores are identical, break ties using the sequence itself (it’s pretty meaningless that sequences starting with Alanines should come before those starting with Valines, but it means that the output is completely specified).
Another problem is when results depend on the environment. For example, if your tool sorts strings, then it will depend on the environment how this sorting is done! This is a huge rabbit hole and arguably a big mistake in API design that the default sort
function in programming languages is not a pure function but depends on some deeply hidden state, but we have seen it cause problems where partial results were sorted in incompatible ways. In NGLess, we always use UTF-8 and we always sort in the same way (our results matrices are sorted by row name and those always use the same sort). The cost is that we will not respect all the nuances in how sorting “should” be done differently in Canadian French vs. European French. The gain is that Canadians and French will get the same results.
Pseudo-randomness
Many algorithms use random numbers. In practice, one rarely needs truly random numbers and can use pseudo-random numbers, which only behave random, but are perfectly reproducible.
Furthermore, these methods can take a seed which sets their internal machinery to known values so that one can obtain the same sequence every time.
As an alternative to setting it to the same value (which is not appropriate for every situation), one can also set it to a data-dependent value. For example, if you process sequences by batches of 100 sequences, it may be inappropriate to reuse the same seed for every new batch as this could easily create biases. Instead, one can set the seed based on a simple computation from the input data itself (a quick hash of the inputs). Now, each batch will be (1) reproducible and (2) have a different pseudo-random pattern.
Non-deterministic results
Some processes can become truly non-deterministic, not just pseudo-random. This can easily happen if, for example, threads are used.
In the example above, I mentioned resetting the seed for each batch. In a sequential system, this would be complete overkill, but if batches are being processed by separate threads, it is the only way.
Even with these tricks, one can easily have non-deterministic results if there is any state shared between batches or if the order in which they are computed influences the result.
Heuristics and approximations
Now, we get into the really complicated cases: very often, we do not have a true solution to the problem. When we use a tool like bwa we are not really solving the problem of find all the best alignments given a specific definition. There is software that solves that (e.g., Swipe), but it is too slow. Instead, bwa employs a series of very smart heuristics that will give you a very good approximate solution at a small fraction of the (computational) cost.
Now, definition becomes the output is the result of running bwa. This seems qualitatively different from saying the output is a sorted version of the input.
Versions
If we now admit that our results are defined by this is the result of running program X on the data as opposed to a more classical mathematical definition, then it becomes imperative that we specify which program it is. Furthermore, we need to specify which version of the program it is. In fact, specifying version is a well-recognized best practice in computational software.
The problem is that it is very hard to version the full stack. You may write in your manuscript that you are using version 1.6.3 of tool X, but if that tool depends on Numpy and Python, you may need to define the full version of those as well (and even that may not be enough). So, while it may be true that computers return the same result for the same computation, this means that we need to present the computer with the same computation all the way from the script code we wrote through to the device drivers.
In this respect, R is a bit better than Python at keeping compatibility, but even R has changed elements such as the random number generator it uses so that even if you were setting the seed to a fixed value as we discussed above, it would give you different results.
My preference is that, if people are going to be providing versions, that they that they provide a machine-readable way to generate the full environment (e.g., a default.nix file, a environment.yml conda file, …). Otherwise, while it is not completely useless, it is often not that informative either.
Nonetheless, this comes with costs: it becomes harder to compose. If tool 1 needs Python 3.6.4, tool 2 needs Python 3.5.3, and tool 3 needs Python 3.5.1, we must have all of them available and switch between them. We do have more and more infrastructure to make this switches fast-enough, but we still end installing Gigabytes of dependencies to run a script of 230 lines.
This also points to another direction: the more we can move away from this is the result of running X v1.2.3 to having outputs be defined by their inputs, the less dependent on specific versions of the tools we become. It may be impossible to get this 100%, but maybe we can get better than we have now. With NGLess, we have tried to move that way in minor ways so that the result does not depend on the version of the tool being run, but we’re still not 100% there.