What’s the deal with computational irreproducibility?

How can computational results be so hard to reproduce? Even with the same input and the same code one can get different results. Shouldn’t computers always return the same results for the same computation?

Let’s look at a few classes of problems, from the easiest to solve to the most complicated.

Different, but equivalent, results

Two different gzip files can uncompress to the same result.

This is obviously a meaningless difference. When we promise that we return a certain result, we should not bound ourselves to specific ways of encoding it.

On NGLess, we did learn this the hard way because some of our tests seemed to be flaky at some point as we were comparing the compressed files. So, depending on the machine, tests would either pass or fail. Now, we test the uncompressed versions

Incompletely specified results

What does a sort algorithm return? Well, obviously it should return a sorted version of its inputs. The problem comes when there are “equivalent” (but not identical) items in the set: in which order should they be returned?

In this case, one can use stable sorting, which preserve the order of “equivalent” input elements. Unfortunately, the fastest sorting algorithms are not stable and use randomness (see below). Alternatively, one can use some tie breaking system so that no two elements compare equal. Now, the results are fully specified by the inputs. This can be done even on attributes that would otherwise be meaningless: for example, if you want to display the results of your processing so that the highest scoring sequences come first, you can sort by scores and, if the scores are identical, break ties using the sequence itself (it’s pretty meaningless that sequences starting with Alanines should come before those starting with Valines, but it means that the output is completely specified).

Another problem is when results depend on the environment. For example, if your tool sorts strings, then it will depend on the environment how this sorting is done! This is a huge rabbit hole and arguably a big mistake in API design that the default sort function in programming languages is not a pure function but depends on some deeply hidden state, but we have seen it cause problems where partial results were sorted in incompatible ways. In NGLess, we always use UTF-8 and we always sort in the same way (our results matrices are sorted by row name and those always use the same sort). The cost is that we will not respect all the nuances in how sorting “should” be done differently in Canadian French vs. European French. The gain is that Canadians and French will get the same results.


Many algorithms use random numbers. In practice, one rarely needs truly random numbers and can use pseudo-random numbers, which only behave random, but are perfectly reproducible.

Furthermore, these methods can take a seed which sets their internal machinery to known values so that one can obtain the same sequence every time.

As an alternative to setting it to the same value (which is not appropriate for every situation), one can also set it to a data-dependent value. For example, if you process sequences by batches of 100 sequences, it may be inappropriate to reuse the same seed for every new batch as this could easily create biases. Instead, one can set the seed based on a simple computation from the input data itself (a quick hash of the inputs). Now, each batch will be (1) reproducible and (2) have a different pseudo-random pattern.

Non-deterministic results

Some processes can become truly non-deterministic, not just pseudo-random. This can easily happen if, for example, threads are used.

In the example above, I mentioned resetting the seed for each batch. In a sequential system, this would be complete overkill, but if batches are being processed by separate threads, it is the only way.

Even with these tricks, one can easily have non-deterministic results if there is any state shared between batches or if the order in which they are computed influences the result.

Heuristics and approximations

Now, we get into the really complicated cases: very often, we do not have a true solution to the problem. When we use a tool like bwa we are not really solving the problem of find all the best alignments given a specific definition. There is software that solves that (e.g., Swipe), but it is too slow. Instead, bwa employs a series of very smart heuristics that will give you a very good approximate solution at a small fraction of the (computational) cost.

Now, definition becomes the output is the result of running bwa. This seems qualitatively different from saying the output is a sorted version of the input.


If we now admit that our results are defined by this is the result of running program X on the data as opposed to a more classical mathematical definition, then it becomes imperative that we specify which program it is. Furthermore, we need to specify which version of the program it is. In fact, specifying version is a well-recognized best practice in computational software.

The problem is that it is very hard to version the full stack. You may write in your manuscript that you are using version 1.6.3 of tool X, but if that tool depends on Numpy and Python, you may need to define the full version of those as well (and even that may not be enough). So, while it may be true that computers return the same result for the same computation, this means that we need to present the computer with the same computation all the way from the script code we wrote through to the device drivers.

In this respect, R is a bit better than Python at keeping compatibility, but even R has changed elements such as the random number generator it uses so that even if you were setting the seed to a fixed value as we discussed above, it would give you different results.

My preference is that, if people are going to be providing versions, that they that they provide a machine-readable way to generate the full environment (e.g., a default.nix file, a environment.yml conda file, …). Otherwise, while it is not completely useless, it is often not that informative either.

Nonetheless, this comes with costs: it becomes harder to compose. If tool 1 needs Python 3.6.4, tool 2 needs Python 3.5.3, and tool 3 needs Python 3.5.1, we must have all of them available and switch between them. We do have more and more infrastructure to make this switches fast-enough, but we still end installing Gigabytes of dependencies to run a script of 230 lines.

This also points to another direction: the more we can move away from this is the result of running X v1.2.3 to having outputs be defined by their inputs, the less dependent on specific versions of the tools we become. It may be impossible to get this 100%, but maybe we can get better than we have now. With NGLess, we have tried to move that way in minor ways so that the result does not depend on the version of the tool being run, but we’re still not 100% there.

Jug as nix-for-Python

In this post, I want to show how Jug can be understood as nix for Python pipelines.

What is Jug?

Jug is a framework for Python which enables parallelization, memoization of results, and generally facilitates reproducibility of results.

Consider a very classical problem framework: you want to process a set of files (in a directory called data/) and then summarize the results

from glob import glob

def count(f):
    # Imagine a long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

def mean(partials):
    final = sum(partials)/len(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')

inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]

This works well, but if the count function takes a while (which would not be the case in this example), it would be great to be able to take advantage of multiple processors (or even a computer cluster) as the problem is embarassingly parallel (this is an actual technical term, by the way).

With Jug, the code looks just a bit differently and we get parallelism for free:

from glob import glob
from jug import TaskGenerator

def count(f):
    # Long running computation
    n = 0
    for _ in open(f):
        n += 1
    return n

def mean(partials):
    final = sum(partials)
    with open('results.txt', 'wt') as out:
        out.write(f'Final result: {final}\n')

inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]

Now, we can use Jug to obtain parallelism, memoization and all the other goodies.

Please see the Jug documentation for more info on how to do this.

What is nix?

Nix is a package management system, similar to those used in Linux distributions or conda.

What makes nix almost unique (Guix shares similar ideas) is that nix attempts perfect reproducibility using hashing tricks. Here’s an example of a nix package:

{ numpy, bottle, pyyaml, redis, six , zlib }:

buildPythonPackage rec {
  pname = "Jug";
  version = "2.0.0";
  buildInputs = [ numpy ];
  propagatedBuildInputs = [

  src = fetchPypi {
    pname = "Jug";
    version = "2.0.0";
    sha256 = "1am73pis8qrbgmpwrkja2qr0n9an6qha1k1yp87nx6iq28w5h7cv";

This is a simplified version of the Jug package itself and (the full thing is in the official repo). Nix language is a bit hard to read in detail. For today, what matters is that this is a package that depends on other packages (numpybottle,…) and is a standard Python package obtained from Pypi (nix has library support for these common use-cases).

The result of building this package is a directory with a name like /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0

You may be able to guess that the bit in the middle there w8d485y2vrj9wylkd5w4k4gpnf7qh3qk is a computed hash of some sort. In fact, this is the hash of code to build the package.

If you change the source code for the package or how it is built, then the hash will change. If you change any dependency, then the hash will also change. So, the final result identifies exactly what was used to the get there.

Jug as nix-for-Python pipelines

Above, I did not present the internals of how Jug works, but it is very similar to nix. Let’s unpack the magic a bit

def count(f):

def mean(partials):
inputs = glob('data/*.txt')
partials = [count(f) for f in inputs]

This can be seen as an embedded domain-specific language for specifying the dependency graph:

partials = [Task(count, f)
                for f in inputs]
Task(mean, partials)

Now, Task(count, f) will get repeatedly instantiated with a particular value for f. For example, if the files in the data directory are name 0.txt1.txt,…

From the jug manuscript

Jug works by hashing together count and the values of f to uniquely identify the results of each of these tasks. If you’ve used jug, you will have certainly noticed the appearance of a magic directory jugfile.jugdata with files named such as jugfile.jugdata/37/b4f7f68c489a6cf3e62fdd7536e1a70d0d2b87. This is equivalent to the /nix/store/w8d485y2vrj9wylkd5w4k4gpnf7qh3qk-python3.6-Jug-2.0.0 path above: it uniquely identifies the result of some computational process so that, if anything changes, the path will change.

Like nix, it works recursively, so that Task(mean, partials), which expands to Task(mean, [Task(count, "0.txt"), Task(count, "1.txt"), Task(count, "2.txt")]) (assuming 3 files, called 0.txt,…) has a hash value that depends on the hash values of all the dependencies.

So, despite the completely different origins and implementations, in the end, Jug and nix share many of the same conceptual models to achieve something very similar: reproducible computations.

NIXML: nix + YAML for easy reproducible environments

The rise and fall of bioconda

A year ago, I remember a conversation which went basically like this:

Them: So, to distribute my package, what do you think I should use?

Me: You should use bioconda.

Them: OK, that’s interesting, but what about …?

Me: No, you should use bioconda.

Them: I will definitely look into it, but maybe it doesn’t fit my package and maybe I will just …

Me: No, you should use bioconda.

That was a year ago. Mind you, I knew there were some issues with conda, but it had a decent UX (user-experience), and more importantly, the community was growing.

Since then, conda has hit scalability concerns, which means that running it is increasingly frustrating: it is slow (an acknowledged issue, but I have had multiple instances of wait 20 minutes for an error message, which didn’t even help me solve the problem); mysterious errors are not uncommon, things that used to work now fail (I have had this more and more recently).

Thus, I no longer recommend bioconda so enthusiastically. What before seemed like some esoteric concerns about guaranteed correctness are now biting us.

The nix model

nix is a Linux distribution with a focus on declarativereproducible builds.

You write a little file (often called default.nix) which describes exactly what you want and the environment is generated from this, exactly the same each time. It has a lot going for it in terms of potential for science:

  1. Can be reproducible to a fault (Byte-for-Byte reproducibility, almost).
  2. Declarative means that the best practice of store your environment for later use is very easy to implement1

Unfortunately, the UX of nix is not great and making the environments reproducible, although possible is not so trivial (although it is now much easier). Nix is very powerful, but it uses a complicated domain-specific language and a semi-documented, ever evolving, set of build conventions which makes it hard for even experienced users to use it directly. There is no way that I can recommend it for general use.

The stack model

Stack is a tool for Haskell which uses the following concept for reproducible environments:

  1. The user specifies a list of packages that they want to use
  2. The user specifies a snapshot of the package directory.

The snapshot determines the versions of all of the packages, which automated testing has revealed to work together (at least up to the limits of unit testing). Furthermore, there is no need to say “version X.Y.Z of package A; version Q.R.S of package B,…”: you specify a single, globally encompassing version (note that this is one of the principles we adopted in NGLess, as we describe in the manuscript).

I really like this UX:

  • Want to update all your packages? just change this one number.
  • Didn’t work? just change it back: you are back where you started. This  is the big advantage of declarative approaches: what you did before does not matter, only the current state of the project.
  • Want to recreate an environment? just use this easy to read text file (for technical reasons, two files, but you get the drift).



This is an afternoon hack, but the idea is to combine nix’s power with stack‘s UX by allowing you specify a set of packages in nix, using YaML

For example, start with this env.nlm file,

nixml: v0.0
snapshot: stable-19.03
  - lang: python
    version: 2
      - numpy
      - scipy
      - matplotlib
      - mahotas
      - jupyter
      - scikitlearn
  - lang: nix
      - vim

Now, running

nixml shell

returns a shell with the packages listed. Running

nixml shell –pure

returns a shell with only the packages listed, so you can be sure to not rely on external packages.

Internally, this just creates a nix file and runs it, but it adds the stack-like interface:

  1. it is always automatically pinned: you see the stable-19.03 thing? That means, the version of these packages that was available in the stable branch on March 2019.
  2. the syntax is simple, no need to know about python2.withPackages or any other nix internals like that. This means a loss of power for the user, but it will be a better trade-off 99% of the time.

Very much a Work in progress right now, but I am putting it out there as it is already usable for Python-based projects.

  1. There are two types of best practices advice: the type that, most people, once they try it out, adopt; and the type that you need to keep hammering into people’s heads. The second type should be seen as a failure of the tool: “best practices” are a user-experience smell.

Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

This week, a discussion broke out about the stability of the Python scientific ecosystem. It was triggered by a blogpost from Konrad Hinsen, which led to several twitter follow ups.

First of all, let me  say that numpy/scipy are great. I use them and recommend them all the time. I am not disparaging the projects as a whole or the people who work on them. It’s just that I would prefer if they were stabler. Given twitter’s limitations, perhaps this was not as clear as I would like on my own twitter response:

I pointed out that I have been bit by API changes:

All of these are individually minor (and can be worked around), but these are just the issues that I have personally ran into and caused enough problems for me to remember them. The most serious was the mannwhitneyu change, which was a silent change (i.e., the function started returning a different result rather than raising an exception or another type of error).


Konrad had pointed out the Linux kernel project as one extreme version of “we never break user code”:

The other extreme is the Silicon-Valley-esque “move fast and break stuff”, which is appropriate for a new project. These are not binary decisions, but two extremes of a continuum. I would hope to see numpy move more towards the “APIs are contracts we respect” side of the spectrum as I think it currently behaves too much like a startup.

Numpy does not use semantic versioning, but if it did almost all its releases would be major releases as they almost always break some code. We’d be at Numpy 14.0 by now. Semantic versioning would allow for a smaller number of “large, possibly-breaking releases” (every few years) instead of a constant stream of minor backwards-incompatible changes. We’d have Numpy 4.2 now, and a list of deprecated features to be removed by 5.0.

Some of the problems that have arisen could have been solved by (1) introducing a new function with the better behaviour, (2) deprecating the old one, (3) waiting a few years and removing the original version (in a major release, for example). This would avoid the most important problem, silent changes.


A typical response is to say “well, just use anaconda (or similar) to freeze your dependencies”. Again, let me be clear, I use and recommend anaconda for everything. It’s great. But, in the context of the backwards compatibility problem, I don’t think this recommendation is really thought through as it only solves a fraction of the problem at hand (again, an important fraction but it’s not a magic wand).  (See also this post by Titus Brown).

What does anaconda not solve? It does not solve the problem of the intermediate layer, libraries which use numpy, but are to be used by final users. What is the expectation here? That I release my computer vision code (mahotas) with a note: Only works on Numpy 1.11? What if I want a project that uses both mahotas and scikit-learn, but scikit-learn is for Numpy 1.12 only? Is the conclusion that you cannot mix mahotas and scikit-learn? This would be worse than the Python 2/3 split. A typical project of mine might use >5 different numpy-dependent libraries. What are the chances that they all expect the exact same numpy version?

Right now, the solution I use in my code is “if I know that this numpy function has changed behaviour, either work around it, avoid it, or reimplement it (perhaps by copying and pasting from numpy)”. For example, some functions return views or copies depending on which version of numpy you have. To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.

It also happens all the time that I have an anaconda environment, add a new package and numpy is upgraded/downgraded. Is this to be considered buggy behaviour by anaconda? Anaconda currently does not upgrade everything to Python 3 when you request a package that is not available on Python 2, nor does it downgrade from 3 to 2; why should it treat numpy any differently if there is no guarantee that behaviour is consistent across numpy verions?

Sometimes the code at hand is not even an officially released library, but some code from another project. Let’s say that I have code that takes a metagenomics abundance matrix, does some preprocessing and computes stats and plots. I might have written it originally for a paper a few years back, but now want to do the same analysis on new data. Is the recommendation that I always write from scratch because it’s a new numpy version? What if it’s someone else asking me for the code? Is the recommendation that I ask “are you still on numpy 1.9, because I only really tested it there”. Note that Python’s dynamic nature actually makes this problem worse than in statically checked languages.

What about training materials? As I also wrote on twitter, it’s when teaching Python that I suffer most from Python 2-vs-Python 3 issues. Is the recommendation that training materials clearly state “This is a tutorial for numpy 1.10 only. Please downgrade to that version or search for a more up to date tutorial”? Note that beginners are the ones most likely to struggle with these issues. I can perfectly understand what it means that: “array == None and array != None do element-wise comparison”(from the numpy 1.13 release notes). But if I was just starting out, would I understand it immediately?

Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.

How NGLess uses its version declaration

NGLess is my metagenomics tool, which is based on a domain specific language. So, NGLess is both a language and a tool (which implements the language).

Since the beginning, ngless has had a focus on reproducibility and one the small ways in which this was implemented was that ngless requires a version declaration. Every ngless script is required to start with a version declaration:

    ngless "0.5"

This was always intended to enable the language to change while keeping perfect reproducibility of past scripts. Until recently, though, this was just hypothetical.

In October, I taught a course on NGLess and it became clear that one of the minor inconsistencies in the previous version of the language (at the time, version “0.0”) was indeed confusing. In the previous version of the language, the preprocess function modified its arguments. No other function did this.

In version “0.5” (which was released on November 1st), preprocess is now a pure function, so that you must assign its output to a value.

However, and this is where the version declaration comes into play, the newer executable still accepts scripts with the version declaration ngless "0.0". Furthermore, if you declare your script as using ngless 0.0, then the old behaviour is used. Thus, we fixed the language, but nobody needs to update their scripts.

Implementation note (which shouldn’t concern the user, but may be interesting to others): before interpretation, ngless will transform the input script, adding checks and optimizing it. A new pass (which is only enabled is the user requested version “0.0”), simply transforms the older code into its newer counterpart. Then, the rest of the process proceeds as if the user had typed in the newer version.

NGless Miscellania [5/5]

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the last in a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous [this post]

Ngless has a few not so visible details that can come in handy.

Local installation

ngless relies on a few third-party utilities (bwa and samtools, besides any other modules you install) as well as possibly reference information. However, it does not require either (1) a super user install nor (2) fiddling with PATH variables or such. It is happy to install its data into your home directory and run from there.

You can also install it globally, of course, but in many academic settings, you need to ask permission to install a package globally, while you can do whatever you want in your home directory. NGless is designed with this in mind.

On the fly QC (quality control)

All FastQ files are automatically passed through a QC analysis when you load them and again after any preprocessing step. You do not need to specify QC as a separate step, it just happens. In fact, if possible, ngless will run it on the fly for efficiency reasons.

Best practices should be easy and QC is a best practice.

Subsample mode

Subsample mode simply throws away 99% of the data.

Why would anyone ever want to do this?

This allows you to quickly check whether your pipeline works as expected and the output files are as expected. For example:

ngless --subsample script.ngl

will run script.ngl in subsample mode, which will probably run much faster than the full pipeline, allowing to quickly spot any issues with your code. A 10 hour pipeline will finish in a few minutes when running in subsample mode.

Subsample mode also changes all your write() so that the output files include the subsample extension. That is, a call such as

write(output, ofile='results.txt')

will automatically get rewritten to

write(output, ofile='results.txt.subsample')

This ensures that you do not confuse subsampled results with the real thing. NGless is all about making sure your results are correct, so it tries to avoid confusing you as much as possible (this is similar to how it always writes output files with the atomic protocol so that you never get a partial results file).

Parallel processing & speed

The main goal of ngless is to save bioinformaticians time while improving the results. However, as a side benefit of having a well-defined language, the interpreter can take automatic advantage of multiple processors.

Consider the following script:

ngless '0.0'

input = fastq('input.fq.gz')
preprocess(input) using |r|:
    r = substrim(r, min_quality=45)
    if len(r) < 45:
mapped = map(input, reference='hg19')
counted = count(mapped, features=['gene'])
write(counted, ofile='genes.txt')

Almost all the steps in the pipeline can take advantage of multiple processors:

  1. QC is performed on the fly as the file ‘input.fq.gz’ is being read.
  2. preprocess takes advantage of mulitple processors by processing reads in parallel
  3. map calls bwa which makes use of threads
  4. count again processes the output of mapping in parallel.

To use more than one core in ngless, just use the option -j with the number of threads you want. For example:

ngless -j8 pipeline.ngl

Will run with 8 cores, speeding the processing considerably.

Perfect reproducibility using ngless [2/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the second of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility [this post]
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous

Perfect reproducibility using ngless

With ngless, your analysis is perfectly reproducible forever. In particular, this is achieved by the use of explicit version strings in the ngless scripts. Let us the first two lines in the example we used before:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

The first line notes the version of ngless that is to be used to run this script. At the moment, ngless is in a pre-release phase and so the version string is “0.0”. In the future, however, this will enable ngless to keep improving while still allowing all past scripts to work exactly as intendended. No more, “I updated some software package and now all my scripts give me different results.” Everything is explicitly versioned.

There are several command line options for ngless, which can change the way that it works internally (e.g., where it keeps its temporary files, whether it is verbose or not, how many cores it should use, &c). You can also use a configuration file to set these options. However, no command line or configuration option change the final output of the analysis. Everything you need to know about the results is in the script.

Reproducible, extendable, and reviewable

It’s not just about reproducibility. In fact, reproducibility is often not that great per se: if you have a big virtual machine image with 50 dependencies, which runs a 10,000 line hairy script to reproduce the plots in a paper, that’s reproducible, but not very useful (except if you want to really dig in). Ngless scripts, however, are easily extendable and even inspectable. Recall the rest of the script (the bits that do actual work):

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

If you have ever worked a bit with NGS data, you can probably get the gist of what is going on. Except for maybe some of the details of what substrim does (it trims the read by finding the largest sustring where all nucleotides are of at least the given quality, see docs), your guess of what is going on would be pretty accurate.

It is easily extendable: If you want to add another functional table, perhaps using kegg modules, you just need to edit the features argument to the count function (you’d need to know which ones are available, but after looking that up, it’s a trivial change).

If you now wanted to perform a similar analysis on your data, I bet you could easily adapt the script for your settings.


A few weeks ago, I asked on twitter/facebook:

Science people: anyone know of any data on how often reviewers check submitted software/scripts if they are available? Thanks.


I didn’t get an answer for the original question, but there was a bit of discussion and as far as I know nobody really checks code submitted along with papers (which is, by itself, not a enough of a reason to not demand code). However, if you were reviewing a paper and the supplemental material had the little script above, you could easily check it out and make sure the authors used the right settings and databases. The resulting code is inspectable and reviewable.

Why you should learn version control: a personal story

As part of software carpentry instructor training, we were asked to write a motivational story for learning one of the themes of the material. This is mine, which I posted on the software carpentry website:

This is my motivational story to use version control (I do not have yet a story of when I was not motivated to learn in mind, so I will post that one later).

A few years ago, I submitted a paper somewhere and it got accepted. In the meanwhile, a few months had passed and we were now asked to provide the camera ready versions.

In order to generate the higher quality version of the figures, I reran the script which did so, after changing just a few of the output parameters to get a high-resolution version. However, the resulting figure did not look like the one we had submitted! Not so different as to warrant a different conclusion, mind you, but you could see that there was a slight shift in the plots when you looked at them side-by-side.

First, I calmed reviewed the code to see if something obvious popped, then I worriedly reran the whole pipeline to re-generate intermediate results, and finally I started to panic. What was wrong? Had I submitted a paper with a result which I did not know how to reproduce?

Because I keep all my code under version, I rolled back the code to the version which had been available at the time of submission. Now, it regenerated the figure exactly as we had submitted it. Relief.

Using binary search (which git has built in), I was able to isolate the exact code change which caused the difference. It turned out to be a very minor change in the way that a certain computation is made, which was mathematically equivalent but not numerically equivalent (i.e., it would have been the same if computers had infinite precision, but because we round, we obtain different results). This meant that an almost arbitrary decision at one point of the algorithm was done differently and then the results shifted enough to be visible.

Thus, because of version control, I was both able to (1) reproduce the figure at the necessary higher resolution and (2) understand why the results had changed. There was much rejoicing.