How Notebooks Should Work

Joel Grus’ presentation on why he does not like notebooks sparked a flurry of notebook-related discussion.

I like the idea of notebooks more than I like actual notebooks. I tried to use them in my analyses for a long time, but eventually gave up as there are too many small annoyances (some that the talk goes over, others that it does not, such as the fact that they do not integrate well with git).

Here is how I think they should work instead:

  1. There is no hidden state. Cells are always run from top to bottom.
  2. If you change a cell in the middle, you immediately clear its output and all those below and the whole thing is run from the top.

For example:

[1] : Code
Output

[2] : Code
Output

[3] : Code
Output

[4] : Code
Output

[5] : Code
Output

Now, if you edit Cell 3, you would get:

[1] : Code
Output

[2] : Code
Output

[3] : New Code
New Output

[ ] : Code

[ ] : Code

If you want, you can run the whole thing now and get the full output:

[1] : Code
Output

[2] : Code
Output

[3] : New Code
New Output

[4] : Code
New Output

[5] : Code
New Output

This way, the whole notebook is always up to date.

But won’t this be incredibly slow if you always have to run it from the top?

Yes, if you implement it naïvely where the kernel really does always re-run from the top, which is not likely to be usable, but you could do a bit of smart caching and keep some intermediate states alive. It would require some engineering, but I think you could keep a few live kernels in intermediate states to make the experience usable so that if you edit cell number 35, it does not need to go back to the first cell, but maybe there is a cached kernel that has the state of cell 30 and only 31 and onwards would need to be rerun.

It would take a lot of engineering and it may even be impossible with the current structure of jupyter kernels, but, from a human point-of-view, I think this would be a better user experience.

Advertisements

The European Court of Justice’s decision that CRISPR’d plants are GMOs is the right interpretation of a law that is bonkers

The European Court of Justice (think of it as the European Supreme Court) declared that CRISPR’d plants count as GMOs.

I think the Court is correct, CRISPR’d plants are GMOs. The EU does not have a tradition of “legislation by judicial decision” like the US’s Supreme Court (although there have been some instances of such, as in the Uber case). Thus, even though I wish the decision had gone the other way as a matter of legislation, as a matter of legal interpretation, it seems clear that the intent of the law was to ban modern biotechnology as scary and, I don’t see how CRISPR does not fill that role.

The decision is scientifically bonkers, in that it says that older atomic gardening plants are kosher, but the exact same organism would be illegal if it were to be obtained by bioengineering methods. According to this decision, you can use CRISPR to obtain and test a mutation. At this point, it’s a GMO, so you cannot sell it in most of Europe. Then you use atomic gardening, PCR, and cross-breeding and you obtain exactly the same genotype. However, now, it’s not a GMO, so it’s fine to sell it. The property of being a GMO is not a property of the plant, but the property of its history. Some plants may carry markers which will identify them as GMOs, but there may be many cases where you can have two identical plants, only one of which is a GMO. This has irked some scientists (see this NYT article), but frankly, it is the original GMO law that is bonkers in that it regulates a method of how to obtain a plant instead of regulating the end result.

On this one, blame the lawmakers, not the court.

HT/ @PhilippBayer

NGLess preprint is up

We have posted a preprint describing NG-meta-profiler and NGLess in general:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language Luis Pedro CoelhoRenato AlvesPaulo MonteiroJaime Huerta-CepasAna Teresa FreitasPeer Bork 

My initial goal was to develop a tool that (1) used a domain-specific language to describe computation (2) was actually used in production. I did not want a proof-of-concept as one of the major arguments for developing a domain-specific language (DSL)  is that it is more usable than just doing a traditional library in another language. As I am skeptical that you can fully evaluate how good a tool is without long-term, real-world, usage,  I wanted NGLess to be used in my day-to-day research.

NGLess has been a long-time cooking but is now a tool that we use every day to produce real results. In that sense, at least, our objectives have been achieved.

Now, we hope that others find it as useful as we do.

Why NGLess took so long to become a robust tool (but now IS a robust tool)

Titus Brown posted that good research software takes 2-3 years to produce. As we are close to submitting a manuscript for our own NGLess, which took a bit longer than that, I will add some examples of why it took so long to get to this stage.

There is a component of why it took so long that is due to people issues and to the fact that NGLess was mostly developed as we needed to process real data (and, while I was working on other projects, rather than on NGLess). But even if this had been someone’s full time project, it would have taken a long time to get to where it is today.

It does not take so long because there are so many Big ideas in there (I wish). NGLess contains just one Big Idea: a domain specific language that results in a tool that is not just a proof of concept but a is better tool because it uses a DSL; everything else follows from that.

Rather, what takes a long time is to find all the weird corner cases. Most of these are issues the majority of users will never encounter, but collectively they make the tool so much more robust. Here are some examples:

  • Around Feb 2017, a user reported that some samples would crash ngless. The user did not seem to be doing anything wrong, but half-way through the processing, memory usage would start growing until the interpreter crashed. It took me the better part of two days to realize that their input files were malformed: they consisted of a few million well-formed reads, then a multi-Gigabyte long series of zero Bytes. Their input FastQs were, in effect, a gzip bomb.

    There is a kind of open source developer that would reply to this situation by saying well, knuckle-head, don’t feed my perfect software your crappy data, but this is not the NGLess way (whose goal is to minimize the effort of real-life people), so we considered this a bug in NGLess and fixed it so that it now (correctly) complains of malformed input and exits.

  • Recently, we realized that if you use the motus module in a system with a badly working locale, ngless could crash. The reason is that, when using that module, we print out a reference for the paper, which includes some authors with non-ASCII characters in their names. Because of some weird combination of the Haskell runtime system and libiconv (which seems to generally be a mess), it crashes if the locale is not installed correctly.

    Again, there is a kind of developer who would respond to this by well, fix your locale installation, knuckle-head, but we added a workaround.

  • When I taught the first ngless workshop in late 2017, I realized that one of inconsistencies in the language was causing a lot of confusion for the learners. So, the next release fixed that issue.
  • There are two variants of FastQ files, depending on whether the qualities are encoded by adding 33 or 64. It is generally trivial to infer which one is being used, though, so NGLess heuristically does so. In Feb 2017, a user reported that the heuristics were failing on one particular (well-formed) example, so we improved the heuristics.
  • There are 25 commits which say they produce “better error messages”. Most of these resulted from a confused debugging session.

None of these issues took that long to fix, but they only emerge through a prolonged beta use period.

You need users to try all types of bad input files, you need to try to teach the tool to understand where the pain points for new users are, you need someone to try to it out in a system with a mis-installed locale, &c

One possible conclusion it that for certain kinds of scientific software, it is actually better if it is done as a side-project: you can keep publishing other stuff, you can apply it on several problems, and the long gestation period catches all these minor issues, even while you are being productive elsewhere. (This was also true of Jug: it was never really a project per se, but after a long time it became usable and its own paper).

Quick followups: NGLess benchmark & Notebooks as papers

A quick follow-up on two earlier posts:

We finalized the benchmark for ngless that I had discussed earlier:

As you can see, NGLess performs much better than either MOCAT or htseq-count. We tried to use featureCounts too, but that completely failed to produce results for some of the samples (we gave it a whopping 1TB of RAM, but it used it all up before crashing).

It also reveals that although ngless was developed in the context of our metagenomics work, it would also likely do well on the type of problems for which htseq-count is currently being used, in the domain of RNA-seq.

§

Earlier, I also  wrote skeptically about the idea of replacing papers with Jupyter notebooks:

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

The little script that generates the plot above is an excellent example of this. It is available online (on github: plot-comparison.py). It’s over 100 lines and, even then, the final result required some minor aesthetic manipulations in inkscape (so that, if you run it, the result is slightly uglier: in particular, the legend is absent).

Would it really add anything to the presentation of the manuscript to have those 100 lines of code be intermingled with the presentation of ngless as a metagenomics profiler?

In this case, I am confident that a Jupyter notebook would be worse than the current solution of a PDF as a main presentation with the data table and plotting scripts as supplemental material.

Journal subscriptions are negotiated, but article processing charges are fixed prices

Scientific publishing is moving to open access. This means that it’s moving from a subscription, reader-pays, to an author-pays model (normally termed article processing charges, or APCs).

This will have a couple of impacts:

1. Institutions which publish more and read less lose to institutions which read more and publish less. Thus, research universities will probably lose out to the pharmaceutical industry (as I’m sure that the pharmaceutical industry is proportionally reading a lot of papers, but not publishing as much).

This is not a big deal, but I thought I would mention it. Some people seem to be very angry at pharmaceutical companies all the time for not paying their fair share, but it’s a tough business and part of the point of publicly funded research is to enable downstream users (like pharmaceuticals) to flourish. Moving to APCs seems to be another move in supporting pharma (probably smaller biotech upstarts being the biggest beneficiary). Teaching-focused universities will also benefit.

2. More importantly, though, the move to APC (article processing charges) instead of subscriptions is also a move from a product that is sold at a variable price to one that is bought a fixed price.

Variable pricing (or price discrimination) is a natural feature of many of the markets that look like subscriptions, where fixed costs are more important than marginal costs (in this case, the extra cost of allowing access to the journal once the journal is done is, basically, zero).

Plane tickets and hotel rooms are less extreme cases where the price will fluctuate to attempt to charge more to those willing to pay more (typically, the wealthier; but sometimes the people willing to pay more and those counting their pennies are the same people, just in different situations: sometimes I really want a specific flight, other times I don’t even care exactly where I am going).

So, in the subscription model, some institutions will pay more. Maybe it’s not fair, but hedge funds such as Harvard will get bilked, while poorer institutions will be able to negotiate down. In the APC model, everyone will pay roughly the same (there may be some discounts for bulk purchasing, but not the variation you have today and it may even be the large institutions which will negotiate down, while the smaller players will pay full price).

Many publishers have policies to favor publications from very poor countries, charging them lower APCs. Naturally this is a good policy, but it will not be fine-grained. Universities in Bulgaria (EU member-state with a GDP per capita of 7,350 USD) are considered as wealthy as private American research universities. They will be expected to pay the same for their PlOS Biology papers.

NGLess timing benchmarks

As part of finalizing a manuscript on NGLess, we have run some basic timing benchmarks comparing NGLess to MOCAT2 (our previous tool) and another alternative for profiling a community based on a gene catalog, namely htseq-count.

The task being profiled is that performed in the NGLess tutorials for the human gut and the ocean: 3 metagenomes are functionally profiled by using a gene catalog as a reference. The time reported is for completing all 3 samples (repeated 3 times to get some variability measure).

The results are that NGLess is overall much faster than the alternatives (note that the Y-axis measures the number of seconds in log-scale).  For the gut dataset, MOCAT takes 2.5x, while for the ocean (tara) one, it takes 4x longer.

ngless-mocat-htseq-count-compare.2.svg.png

The Full  column contains the result of running the whole pipeline, where it is clear that NGLess is much faster than MOCAT2. The other elements are in MOCAT nomenclature:

  • ReadTrimFilter: preprocessing the FastQ files
  • Screen: mapping to the catalog
  • Filter: postprocessing the BAM files
  • Profile: generating feature counts from the BAM files

Htseq-count works well even for this settings which is outside of its original domain (it was designed for RNA-seq, where you have thousands of genes, as opposed to metagenomics, where millions are common). NGLess is still much faster, though.

Note too that for MOCAT, the time it takes for the Full step is simply the addition of the other steps, but in the case of NGLess, when running a complete pipeline, the interpreter can save time.

The htseq-count benchmark is still running, so final results will only be available next week.

I also tried to profile using featureCounts (website), but that tool crashed after using up 800GB of RAM. I might still try it on the larger machines (2TiB of RAM), but it seems pointless.

The scripts and preprocessed data for this benchmark are at https://github.com/BigDataBiology/ngless2018benchmark