NGLess preprint is up

We have posted a preprint describing NG-meta-profiler and NGLess in general:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language Luis Pedro CoelhoRenato AlvesPaulo MonteiroJaime Huerta-CepasAna Teresa FreitasPeer Bork 

My initial goal was to develop a tool that (1) used a domain-specific language to describe computation (2) was actually used in production. I did not want a proof-of-concept as one of the major arguments for developing a domain-specific language (DSL)  is that it is more usable than just doing a traditional library in another language. As I am skeptical that you can fully evaluate how good a tool is without long-term, real-world, usage,  I wanted NGLess to be used in my day-to-day research.

NGLess has been a long-time cooking but is now a tool that we use every day to produce real results. In that sense, at least, our objectives have been achieved.

Now, we hope that others find it as useful as we do.

Advertisements

Quick followups: NGLess benchmark & Notebooks as papers

A quick follow-up on two earlier posts:

We finalized the benchmark for ngless that I had discussed earlier:

As you can see, NGLess performs much better than either MOCAT or htseq-count. We tried to use featureCounts too, but that completely failed to produce results for some of the samples (we gave it a whopping 1TB of RAM, but it used it all up before crashing).

It also reveals that although ngless was developed in the context of our metagenomics work, it would also likely do well on the type of problems for which htseq-count is currently being used, in the domain of RNA-seq.

§

Earlier, I also  wrote skeptically about the idea of replacing papers with Jupyter notebooks:

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

The little script that generates the plot above is an excellent example of this. It is available online (on github: plot-comparison.py). It’s over 100 lines and, even then, the final result required some minor aesthetic manipulations in inkscape (so that, if you run it, the result is slightly uglier: in particular, the legend is absent).

Would it really add anything to the presentation of the manuscript to have those 100 lines of code be intermingled with the presentation of ngless as a metagenomics profiler?

In this case, I am confident that a Jupyter notebook would be worse than the current solution of a PDF as a main presentation with the data table and plotting scripts as supplemental material.

Why does natural evolution use genetic algorithms when it’s not a very good optimization method?

[Epistemic status: waiting for jobs to finish on the cluster, so I am writing down random ideas. Take it as speculation.]

Genetic algorithms are a pretty cool idea: when you have an objective function, you can optimize it by starting with some random guessing and then use mutations and cross-over to improve your guess. At each generation, you keep the most promising examples and eventually you will converge on a good solution. Just like evolution does.

Unfortunately, in practice, this idea does not pan out: genetic algorithms are not that effective. In fact, I am not aware of any general problem area where they are considered the best option. For example, in machine learning, the best methods tend to be variations of stochastic gradient descent.

And, yet, evolution uses genetic algorithms. Why doesn’t evolution use stochastic gradient descent or something better than genetic algorithms?

What would evolutionary gradient descent even look like?

First of all, let’s assure ourselves that we are not just using vague analogies to pretend we have deep thoughts. Evolutionary gradient descent is at least conceptually possible.

To be able to do gradient descent, a bacterium reproducing would need two pieces of information to compare itself to its mother cell: (1) how does it differ in genotype and (2) how much better is it doing than its parent. Here is one possible implementation of this idea: (1) tag the genome where it differs from the the mother cell (epigenetics!) and (2) have some memory of how fast it could grow. When reproducing, if we are performing better than our mother, then introduce more mutations in the regions where we differ from mother.

Why don’t we see something like this in nature? Here are some possible answers

Almost all mutations are deleterious

Almost always (viruses might be an exception), higher mutation rates are bad. Even in a controlled way (just the regions that seem to matter), adding more mutations will make things worse rather than better.

The signal is very noisy

Survival or growth is a very noisy signal of how good a genome is. Maybe we just got luckier than our parents in being born at a time of glucose plenty. If the environment is not stable, over reacting to a single observation may be the wrong thing to do.

The relationship between phenotype and genotype is very tenuous

What we’d really like to do is something like “well, it seems that in this environment, it is a good idea if membranes are more flexible, so I will mutate membrane-stiffness more”. However, the relationship between membrane stiffness and the genotype is complex. There is no simple “mutate membrane-stiffness” option for a bacterium. Epistatic effects are killers for simple ideas like this one.

On the other hand, the relationship between any particular weight in a deep CNN and the output is very weak. Yet, gradient descent still works there.

The cost of maintaining the information for gradient descent is too high

Perhaps, it’s just not worth keeping all this accounting information. Especially because it’s going to be yet another layer of noise.

Maybe there are examples of natural gradient descent, we just haven’t found them yet

There are areas of genomes that are more recombination prone than others (and somatic mutation in the immune system is certainly a mechanism of controlled chaos). Viruses may be another entity where some sort of gradient descent could be found. Maybe plants with their weird genomes are using all those extra copies to transmit information across generations like this.

As I said, this is a post of random speculation while I wait for my jobs to finish….

The Scientific Paper of the Future is Probably a PDF

I do not mean to say the scientific paper of the future should be a PDF, I just mean that it will mostly likely be a PDF or some PDF-derived format. By future, I mean around 2040 (so, in 20-25 years).

I just read James Somers in the Atlantic, arguing that The Scientific Paper Is Obsolete (Here’s what’s next). In that article, he touts Mathematica notebooks as a model of what should be done and Jupyter as the current embodiment of this concept.

I will note that Mathematica came out in 1988 (a good 5 years before the PDF format) and has yet failed to take the world by storm (the article claims that “the program soon became as ubiquitous as Microsoft Word”, a claim which is really hard to reconcile with reality). Perhaps Mathematica was held back because it’s expensive and closed source (but so is Microsoft Word, and Word has taken the world by storm).

How long did it take to get to HTML papers?

For a very long time, the future of the scientific paper was going to be some smart version of HTML. We did eventually get to the point where most journals have decent HTML versions of their papers, but it’s mostly dumb HTML.

As far as I can tell, none of the ideas of having a semantically annotated paper panned out. About 10 years ago, the semantic web was going to revolutionize science. That didn’t happen and it’s even been a while since I heard someone arguing that that would be the future of the scientific paper.

Tools like Read Cube or Paperpile still parse the PDFs and try to infer what’s going on instead of relying on fancy semantic annotations.

What about future proofing the system?

About a week ago, I tweeted:

This is about a paper which is now in press. It’s embargoed, but I’ll post about it when it comes out in 2 weeks.

I have complained before about the lack of backwards compatibility in the Python ecosystem. I can open and print a PDF from 20 years ago (or a PostScript file from the early 1980s) without any issues, but I have trouble running a notebook from last year.

At this point, someone will say docker! and, yes, I can build a docker image (or virtual machine) with all my dependencies and freeze that, but who can commit to hosting/running these over a long period? What about the fact that even tech-savvy people struggle to keep all these things properly organized? I can barely get co-authors to move beyond the “let’s email Word files back and forth.”

With less technical co-authors, can you really imagine them downloading a docker container and properly mounting all the filesystems with OverlayFS to send me back edits? Sure, there are a bunch of cool startups with nicer interfaces, but will they be here in 2 years (let alone 20)?

Is it even a good idea to have the presentation of the results mixed with their computation?

I do see the value in companion Jupyter notebooks for many cases, but as a replacement for the main paper, I am not even sure it is a good idea.

There is a lot of accidental complexity in code. A script that generates a publication plot may easily have 50 lines that do nothing more than set up the plot just right: (1) set up the subplots, (2) set x- and y-labels, (3) fix colours, (4) scale the points, (5) reset the font sizes, &c. What value is there in keeping all of this in the main presentation of the results?

Similarly, all the file paths and the 5 arguments you need to pass to pandas.read_table to read the data correctly: why should we care when we are just trying to get the gist of the results? One of our goals in NGLess is to try to separate some of this accidental complexity from the main processing pipeline, but this also limits what we can do with it (this is the tradeoff, it’s a domain specific tool; it’s hard to achieve the same with a general purpose tool like Jupyter/Python).

I do really like Jupyter for tutorials as the mix of code and text are a good fit. I will work to make sure I have something good for the students, but I don’t particularly enjoy working with the notebook interface, so I need to be convinced before I jump on the bandwagon more generally.

§

I actually do think that the future will eventually be some sort of smarter thing than simulated paper, but I also think that (1) it will take much longer than 20 years and (2) it probably won’t be Jupyter getting us there. It’s a neat tool for many things, but it’s not a PDF killer.

Bug-for-bug backwards compatibility in NGLess

Recently, I found a bug in NGLess. In some rare conditions, it would mess up and reads could be lost. Obviously, I fixed it.

If you’ve used NGLess before (or read about it), you’ll know that every ngless script starts with a version declaration:

ngless "x.y"

This indicates which version of NGLess should be running the code. Since the bug changed the results, I needed to make a new version (we are now at version 0.8).

The question is what should NGLess do when it runs a script that uses an older version declaration? I see three options:

1. Silently update everyone to the new behavior

This is the typical software behavior: the new system is better, why wouldn’t you want to upgrade? Because we’d be breaking our promise to make ngless reproducible. The whole point of having the version line is to ensure that you will always get the same results. We also don’t want to make people afraid of upgrading.

2. Refuse to run older scripts and force everyone to upgrade

This is another option: we could just refuse to run old code. Now, at the very least, there would be no silent changes. It’s still possible to install older versions (and bioconda/biocontainers makes this easy), so if you really needed to, you could still run the older scripts.

3. Emulate the old (buggy) behavior when the user requests the old versions

In the end, I went with this option.

The old behavior is not that awful. Some reads are handled completely wrong, but the reason why the bug was able to persist for so long is that it only shows up in a few reads in a million. Thus, while this means that NGLess will sometimes knowingly output results that are suboptimal, I found it the best solution. A warning is printed, asking the user to upgrade.

Bray-Curtis dissimilarity on relative abundance data is the Manhattan distance (aka L₁ distance)

Warning: super-technical post ahead, but I have made this point in oral discussions at least a few times, so I thought I would write it up. It is a trivial algebraic manipulation, but because “ℓ₁ norm” sounds too mathy for ecologists while “Bray-curtis” sounds too ecological and ad-hoc for mathematically minded people, it’s good to see that it’s the same thing on normalized data.

Assuming you have two feature vectors, Xᵢ, Xⱼ, if they have been normalized to sum to 1, then the Bray-Curtis dissimilarity is just their ℓ₁ distance, aka Manhattan distance (times ½, which is a natural normalization so that the result is between zero and one).

This is the Wikipedia definition of the Bray-Curtis dissimiliarity (there are a few other, equivalent, definitions around, but we’ll use this one):

BC = 1 – 2 Cᵢⱼ/(Sᵢ + Sⱼ), where Cᵢⱼ =Σₖmin(Xᵢₖ, Xⱼₖ)  and Sᵢ = ΣₖXᵢₖ.

While the Manhattan distance is D₁ = Σₖ|Xᵢₖ – Xⱼₖ|

We are assuming that they sum to 1, so Sᵢ=Sⱼ=1. Thus,

BC = 1 – Σₖmin(Xᵢₖ, Xⱼₖ)

Now, this still does not look like the Manhattan distance (D₁, above). But for any a and b ≥0, it holds that

min(a,b) = (a + b)/2 – |a – b|/2

(this is easiest to see graphically:  start at the midpoint, (a+b)/2 and subtract half of the difference, |a-b|/2).

Thus, BC = 1 -Σₖmin(Xᵢₖ, Xⱼₖ) = 1 – Σₖ{(Xᵢₖ + Xⱼₖ)/2 – |Xᵢₖ + Xⱼₖ|/2}

Because, we assumed that the vectors were normalized,  Σₖ(Xᵢₖ + Xⱼₖ)/2 =(ΣₖXᵢₖ +ΣₖXⱼₖ)/2 = 1, so

BC = 1 – 1 + Σₖ|Xᵢₖ + Xⱼₖ|/2 = D₁/2.

Against Science Communication

Science communication, science outreach, is widely seen as a good thing, perhaps even a necessary one (Scientists: do outreach or your science dies). Let me disagree: For the most part, cutting-edge science should not be communicated to the public in mass media and doing so harms both the public and science.

At the end of the post, I provide some alternatives to the state of the world.

What is wrong with science communication?

For all the complaints that the public has lost faith in science, I often anecdotally find that members of the public have more faith in scientific results than most scientists. Most scientists know to take any paper with a grain of salt. The public does not always realize that, they don’t have the time (or the training) to dig into the paper and are left with a click bait headline which they need to take at face value or reject “scientific results”.

Most of the time, science communication has to simplify the underlying science so much as to be basically meaningless. Most science doesn’t make any sense even to scientists working in adjacent fields: We publish a new method which is important to predict the structure of a protein that is important because other studies have shown that it is important in a pathway that other studies have shown is active in response to … and so on. At the very bottom, we have things that the public cares about (and which is why we get funded), but the relationships are not trivial and we should stop pretending otherwise.

When I thumb through the latest Nature edition, only a minority of titles are meaningful to me. Sometimes, I genuinely have no idea what they are talking about (One-pot growth of two-dimensional lateral heterostructures via sequential edge-epitaxy); other times, I get the message (Structure of the glucagon receptor in complex with a glucagon analogue) but I don’t have the context of that field to understand why this is important. To pretend that we can explain this to a member of the public in 500 words (or 90 seconds on radio/TV) is idiotic. Instead, we explain a butchered version.

Science outreach harms the public. Look at this (admittedly, not so recent) story: The pill is linked to depression – and doctors can no longer ignore it (The Guardian, one of my favorite newspapers). The study was potentially interesting, but in no way conclusive: it’s observational with obvious confounders: the population of women who start taking the pill at a young age is not the same as that that takes the pill at a later time. However, reading any of the panicked articles (Guardian again, the BBCQuartz: this was not just the Oprah Winfrey Show) may lead some women to stop taking the pill for no good reason, which is not a positive for them. (To be fair, some news outlets did publish a skeptical view of it, e.g., The Spectator, Jezebel).

Science outreach discredits science. The public now has what amounts to a healthy skepticism of any nutritional result, given the myriad “X food gives you cancer” or “Chocolate makes you thin“. If you think that there are no spillovers from publicizing shaky results into other fields of science (say, climate science), then you are being naïve.

Valuing science outreach encourages bad science. If anything, the scientific process is already too tipped towards chasing the jazzy new finding instead of solid research. It’s the worst papers that make the news, let’s not reward that. The TED-talk giving, book writing, star of pop psychology turned out to be a charlatan. Let’s not reward the same sillyness in other fields.

What is the alternative?

The alternative is very simple: instead of communicating iffy, cutting-edge stuff, communicate settled science. Avoid almost all recent papers and focus on established results and bodies of literature. Books that summarize a body of work can do a good job.

But this is not news! Frankly, who cares? Journalists do, but I think the public is happy to read non-news stuff. In fact, most people find that type of reporting more interesting that the flashy study of the week.

Or use opportunities to tie it to current events. For example, when major prizes are given out (like the Nobel prize, but also the tiers below it), publish your in-depth reporting the topics. Since often the topics are predictable, you can prepare your reporting carefully and publish it when ready. You don’t even need the topic/person to actually win. In the weeks prior to Nobel Prize week, you can easily put up a few articles on “the contenders and their achievements”.

Finally, let me remark that, for the members of the public who do wish to be informed of cutting-edge science and are willing (perhaps even eager) to put in the mental work of understanding it, there are excellent resources out there. For example, I regularly listen to very good science podcasts on the microbe.tv website. They do a good job of going through the cutting edge stuff in an informed way: they take 30 minutes to go through a paper, not 30 seconds. They won’t ever be mainstream, but that’s the way it should be.