The rise and fall of bioconda
A year ago, I remember a conversation which went basically like this:
Them: So, to distribute my package, what do you think I should use?
Me: You should use bioconda.
Them: OK, that’s interesting, but what about …?
Me: No, you should use bioconda.
Them: I will definitely look into it, but maybe it doesn’t fit my package and maybe I will just …
Me: No, you should use bioconda.
That was a year ago. Mind you, I knew there were some issues with conda, but it had a decent UX (user-experience), and more importantly, the community was growing.
Since then, conda has hit scalability concerns, which means that running it is increasingly frustrating: it is slow (an acknowledged issue, but I have had multiple instances of wait 20 minutes for an error message, which didn’t even help me solve the problem); mysterious errors are not uncommon, things that used to work now fail (I have had this more and more recently).
Thus, I no longer recommend bioconda so enthusiastically. What before seemed like some esoteric concerns about guaranteed correctness are now biting us.
The nix model
nix is a Linux distribution with a focus on declarative, reproducible builds.
You write a little file (often called
default.nix) which describes exactly what you want and the environment is generated from this, exactly the same each time. It has a lot going for it in terms of potential for science:
- Can be reproducible to a fault (Byte-for-Byte reproducibility, almost).
- Declarative means that the best practice of store your environment for later use is very easy to implement
Unfortunately, the UX of nix is not great and making the environments reproducible, although possible is not so trivial (although it is now much easier). Nix is very powerful, but it uses a complicated domain-specific language and a semi-documented, ever evolving, set of build conventions which makes it hard for even experienced users to use it directly. There is no way that I can recommend it for general use.
Stack is a tool for Haskell which uses the following concept for reproducible environments:
- The user specifies a list of packages that they want to use
- The user specifies a snapshot of the package directory.
The snapshot determines the versions of all of the packages, which automated testing has revealed to work together (at least up to the limits of unit testing). Furthermore, there is no need to say “version X.Y.Z of package A; version Q.R.S of package B,…”: you specify a single, globally encompassing version (note that this is one of the principles we adopted in NGLess, as we describe in the manuscript).
I really like this UX:
- Want to update all your packages? just change this one number.
- Didn’t work? just change it back: you are back where you started. This is the big advantage of declarative approaches: what you did before does not matter, only the current state of the project.
- Want to recreate an environment? just use this easy to read text file (for technical reasons, two files, but you get the drift).
This is an afternoon hack, but the idea is to combine nix’s power with
stack‘s UX by allowing you specify a set of packages in nix, using YaML
For example, start with this
- lang: python
- lang: nix
returns a shell with the packages listed. Running
nixml shell –pure
returns a shell with only the packages listed, so you can be sure to not rely on external packages.
Internally, this just creates a nix file and runs it, but it adds the
- it is always automatically pinned: you see the
stable-19.03 thing? That means, the version of these packages that was available in the stable branch on March 2019.
- the syntax is simple, no need to know about
python2.withPackages or any other nix internals like that. This means a loss of power for the user, but it will be a better trade-off 99% of the time.
Very much a Work in progress right now, but I am putting it out there as it is already usable for Python-based projects.