Bray-Curtis dissimilarity on relative abundance data is the Manhattan distance (aka L₁ distance)

Warning: super-technical post ahead, but I have made this point in oral discussions at least a few times, so I thought I would write it up. It is a trivial algebraic manipulation, but because “ℓ₁ norm” sounds too mathy for ecologists while “Bray-curtis” sounds too ecological and ad-hoc for mathematically minded people, it’s good to see that it’s the same thing on normalized data.

Assuming you have two feature vectors, Xᵢ, Xⱼ, if they have been normalized to sum to 1, then the Bray-Curtis dissimilarity is just their ℓ₁ distance, aka Manhattan distance (times ½, which is a natural normalization so that the result is between zero and one).

This is the Wikipedia definition of the Bray-Curtis dissimiliarity (there are a few other, equivalent, definitions around, but we’ll use this one):

BC = 1 – 2 Cᵢⱼ/(Sᵢ + Sⱼ), where Cᵢⱼ =Σₖmin(Xᵢₖ, Xⱼₖ)  and Sᵢ = ΣₖXᵢₖ.

While the Manhattan distance is D₁ = Σₖ|Xᵢₖ – Xⱼₖ|

We are assuming that they sum to 1, so Sᵢ=Sⱼ=1. Thus,

BC = 1 – Σₖmin(Xᵢₖ, Xⱼₖ)

Now, this still does not look like the Manhattan distance (D₁, above). But for any a and b ≥0, it holds that

min(a,b) = (a + b)/2 – |a – b|/2

(this is easiest to see graphically:  start at the midpoint, (a+b)/2 and subtract half of the difference, |a-b|/2).

Thus, BC = 1 -Σₖmin(Xᵢₖ, Xⱼₖ) = 1 – Σₖ{(Xᵢₖ + Xⱼₖ)/2 – |Xᵢₖ + Xⱼₖ|/2}

Because, we assumed that the vectors were normalized,  Σₖ(Xᵢₖ + Xⱼₖ)/2 =(ΣₖXᵢₖ +ΣₖXⱼₖ)/2 = 1, so

BC = 1 – 1 + Σₖ|Xᵢₖ + Xⱼₖ|/2 = D₁/2.


Against Science Communication

Science communication, science outreach, is widely seen as a good thing, perhaps even a necessary one (Scientists: do outreach or your science dies). Let me disagree: For the most part, cutting-edge science should not be communicated to the public in mass media and doing so harms both the public and science.

At the end of the post, I provide some alternatives to the state of the world.

What is wrong with science communication?

For all the complaints that the public has lost faith in science, I often anecdotally find that members of the public have more faith in scientific results than most scientists. Most scientists know to take any paper with a grain of salt. The public does not always realize that, they don’t have the time (or the training) to dig into the paper and are left with a click bait headline which they need to take at face value or reject “scientific results”.

Most of the time, science communication has to simplify the underlying science so much as to be basically meaningless. Most science doesn’t make any sense even to scientists working in adjacent fields: We publish a new method which is important to predict the structure of a protein that is important because other studies have shown that it is important in a pathway that other studies have shown is active in response to … and so on. At the very bottom, we have things that the public cares about (and which is why we get funded), but the relationships are not trivial and we should stop pretending otherwise.

When I thumb through the latest Nature edition, only a minority of titles are meaningful to me. Sometimes, I genuinely have no idea what they are talking about (One-pot growth of two-dimensional lateral heterostructures via sequential edge-epitaxy); other times, I get the message (Structure of the glucagon receptor in complex with a glucagon analogue) but I don’t have the context of that field to understand why this is important. To pretend that we can explain this to a member of the public in 500 words (or 90 seconds on radio/TV) is idiotic. Instead, we explain a butchered version.

Science outreach harms the public. Look at this (admittedly, not so recent) story: The pill is linked to depression – and doctors can no longer ignore it (The Guardian, one of my favorite newspapers). The study was potentially interesting, but in no way conclusive: it’s observational with obvious confounders: the population of women who start taking the pill at a young age is not the same as that that takes the pill at a later time. However, reading any of the panicked articles (Guardian again, the BBCQuartz: this was not just the Oprah Winfrey Show) may lead some women to stop taking the pill for no good reason, which is not a positive for them. (To be fair, some news outlets did publish a skeptical view of it, e.g., The Spectator, Jezebel).

Science outreach discredits science. The public now has what amounts to a healthy skepticism of any nutritional result, given the myriad “X food gives you cancer” or “Chocolate makes you thin“. If you think that there are no spillovers from publicizing shaky results into other fields of science (say, climate science), then you are being naïve.

Valuing science outreach encourages bad science. If anything, the scientific process is already too tipped towards chasing the jazzy new finding instead of solid research. It’s the worst papers that make the news, let’s not reward that. The TED-talk giving, book writing, star of pop psychology turned out to be a charlatan. Let’s not reward the same sillyness in other fields.

What is the alternative?

The alternative is very simple: instead of communicating iffy, cutting-edge stuff, communicate settled science. Avoid almost all recent papers and focus on established results and bodies of literature. Books that summarize a body of work can do a good job.

But this is not news! Frankly, who cares? Journalists do, but I think the public is happy to read non-news stuff. In fact, most people find that type of reporting more interesting that the flashy study of the week.

Or use opportunities to tie it to current events. For example, when major prizes are given out (like the Nobel prize, but also the tiers below it), publish your in-depth reporting the topics. Since often the topics are predictable, you can prepare your reporting carefully and publish it when ready. You don’t even need the topic/person to actually win. In the weeks prior to Nobel Prize week, you can easily put up a few articles on “the contenders and their achievements”.

Finally, let me remark that, for the members of the public who do wish to be informed of cutting-edge science and are willing (perhaps even eager) to put in the mental work of understanding it, there are excellent resources out there. For example, I regularly listen to very good science podcasts on the website. They do a good job of going through the cutting edge stuff in an informed way: they take 30 minutes to go through a paper, not 30 seconds. They won’t ever be mainstream, but that’s the way it should be.

Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

This week, a discussion broke out about the stability of the Python scientific ecosystem. It was triggered by a blogpost from Konrad Hinsen, which led to several twitter follow ups.

First of all, let me  say that numpy/scipy are great. I use them and recommend them all the time. I am not disparaging the projects as a whole or the people who work on them. It’s just that I would prefer if they were stabler. Given twitter’s limitations, perhaps this was not as clear as I would like on my own twitter response:

I pointed out that I have been bit by API changes:

All of these are individually minor (and can be worked around), but these are just the issues that I have personally ran into and caused enough problems for me to remember them. The most serious was the mannwhitneyu change, which was a silent change (i.e., the function started returning a different result rather than raising an exception or another type of error).


Konrad had pointed out the Linux kernel project as one extreme version of “we never break user code”:

The other extreme is the Silicon-Valley-esque “move fast and break stuff”, which is appropriate for a new project. These are not binary decisions, but two extremes of a continuum. I would hope to see numpy move more towards the “APIs are contracts we respect” side of the spectrum as I think it currently behaves too much like a startup.

Numpy does not use semantic versioning, but if it did almost all its releases would be major releases as they almost always break some code. We’d be at Numpy 14.0 by now. Semantic versioning would allow for a smaller number of “large, possibly-breaking releases” (every few years) instead of a constant stream of minor backwards-incompatible changes. We’d have Numpy 4.2 now, and a list of deprecated features to be removed by 5.0.

Some of the problems that have arisen could have been solved by (1) introducing a new function with the better behaviour, (2) deprecating the old one, (3) waiting a few years and removing the original version (in a major release, for example). This would avoid the most important problem, silent changes.


A typical response is to say “well, just use anaconda (or similar) to freeze your dependencies”. Again, let me be clear, I use and recommend anaconda for everything. It’s great. But, in the context of the backwards compatibility problem, I don’t think this recommendation is really thought through as it only solves a fraction of the problem at hand (again, an important fraction but it’s not a magic wand).  (See also this post by Titus Brown).

What does anaconda not solve? It does not solve the problem of the intermediate layer, libraries which use numpy, but are to be used by final users. What is the expectation here? That I release my computer vision code (mahotas) with a note: Only works on Numpy 1.11? What if I want a project that uses both mahotas and scikit-learn, but scikit-learn is for Numpy 1.12 only? Is the conclusion that you cannot mix mahotas and scikit-learn? This would be worse than the Python 2/3 split. A typical project of mine might use >5 different numpy-dependent libraries. What are the chances that they all expect the exact same numpy version?

Right now, the solution I use in my code is “if I know that this numpy function has changed behaviour, either work around it, avoid it, or reimplement it (perhaps by copying and pasting from numpy)”. For example, some functions return views or copies depending on which version of numpy you have. To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.

It also happens all the time that I have an anaconda environment, add a new package and numpy is upgraded/downgraded. Is this to be considered buggy behaviour by anaconda? Anaconda currently does not upgrade everything to Python 3 when you request a package that is not available on Python 2, nor does it downgrade from 3 to 2; why should it treat numpy any differently if there is no guarantee that behaviour is consistent across numpy verions?

Sometimes the code at hand is not even an officially released library, but some code from another project. Let’s say that I have code that takes a metagenomics abundance matrix, does some preprocessing and computes stats and plots. I might have written it originally for a paper a few years back, but now want to do the same analysis on new data. Is the recommendation that I always write from scratch because it’s a new numpy version? What if it’s someone else asking me for the code? Is the recommendation that I ask “are you still on numpy 1.9, because I only really tested it there”. Note that Python’s dynamic nature actually makes this problem worse than in statically checked languages.

What about training materials? As I also wrote on twitter, it’s when teaching Python that I suffer most from Python 2-vs-Python 3 issues. Is the recommendation that training materials clearly state “This is a tutorial for numpy 1.10 only. Please downgrade to that version or search for a more up to date tutorial”? Note that beginners are the ones most likely to struggle with these issues. I can perfectly understand what it means that: “array == None and array != None do element-wise comparison”(from the numpy 1.13 release notes). But if I was just starting out, would I understand it immediately?

Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.

New papers I: imaging environmental samples of micro eukaryotes

This week, I had two first author papers published:

  1. Quantitative 3D-imaging for cell biology and ecology of environmental microbial eukaryotes 
  2. Jug: Software for Parallel Reproducible Computation in Python

I intend to post on both of them over the next week or so, but I will start with the first one.

The basic idea is that just as metagenomics was the application of lab techniques (sequencing) that had been developed for pure cultures to environmental samples, we are moving from imaging cell cultures (the type of work I did during my PhD and shortly afterwards) to imaging environmental samples. These are, thus, mixed samples of microbes (micro-eukaryotes, not bacteria, but remember: protists are microbes too).

Figure 1 from paper

Figure 1 from the paper depicting the process (a) and the results (b & c).

The result is a phenotypic view of the whole community, not just the elements that you can easily grow in the lab. As it is not known apriori which organisms will be present, we use generic eukaryotic dyes, tagging DNA, membranes, and the exterior. In addition, chlorophyll is auto-fluorescence, so we get a free extra channel.

With automated microscopes and automated analysis, we obtained images of 300,000 organisms, which were classified into 155 classes. A simple machine-learning system can perform this classification with 82% accuracy, which is similar to (or better than) the inter-operator variability in similar problems.

The result is both a very large set of images as well as a large set of features, which can be exploited for understanding the microbial community.

When you say you are pro-science, what do you mean you are in favor of?

In the last few weeks, with the March for Science coming up, there have been a few discussion of what being pro-science implies. I just want to ask

When you say you are pro-science, what do you mean you are in favor of?

Below, I present a few different answers.

Science as a set of empirically validated statements

Science can mean facts such as:

  • the world is steadily warming and will continue to do as CO2 concentrations go up
  • nuclear power is safe
  • vaccines are safe
  • GMOs are safe

This is the idea behind the there is no alternative to facts rhetoric. The four statements above can be quibbled with (there are some risks to some vaccines, GMO refers to a technique not a product so even calling it safe or unsafe is not the right discussion, nuclear accidents have happened, and there is a lot of uncertainty on both the amount of warming and its downstream effects), but, when understood in general terms, they are facts and those who deny them, deny reality.

When people say that science is not political, they mean that these facts are independent of one’s values. I’d add the Theory of Evolution to the above four, but evolution (like Quantum Mechanics or Relativity) is even more undeniable.

Science and technology as a positive force

The above were “value-less facts”; let’s slowly get into values.

The facts above do not have any consequences for policy or behaviour on their own. They do constrain the set of possible outcomes, but for a decision, you need a set of values on top.

It’s still perfectly consistent with the facts and claim the following: Vaccines are safe, but a person’s bodily autonomy cannot be breached in the name of utilitarianism. In the case of children, the parents’ autonomy should be paramount. This is a perfectly intellectually consistent libertarian position. As long as you are willing to accept that children will die as a consequence, then I cannot really say you are denying the scientific evidence. This may seem a shocking trade-off when said out loud but it also happens to be the de facto policy of the Western world for the 10-20 past years: vaccines are recommended, but most jurisdictions will not enforce them anymore.

Similar statements can be made about all of the above:

  • The world is getting warmer, but fossil fuels bring human beings wealth and so, are worth the price to the natural environment. The rest should be dealt with mitigation and geo-engineering. What is important is finding the lowest cost solution for people.
  • Nuclear power is safe, but storing nuclear waste destroys pristine environments and that is a cost not worth paying.
  • GMOs are safe, but messing with Nature/God’s work is immoral.

Empirical facts can provide us with the set of alternatives that are possible, but do not help us weigh alternatives against each other (note how often cost/benefit shows up in the above, but the costs are not all material costs). Still, often being pro-science is understood as being pro technological progress and, thus, anti-GMO or anti-nuclear activism is anti-science.

Science as a community and set of practices

This meaning of “being pro-Science”, science as the community of scientists, is also what leads to views such as being pro-Science means being pro-inclusive Science. Or, on the other side, bringing up Dr. Mengele.

Although it is true that empirically validated facts are shared across humanity, there are areas of knowledge that impact certain people more than others. If there is no effort to uncover the mechanisms underlying a particular disease that affect people in poorer parts of the world, then the efforts of scientists will have a differential impact in the world.

Progress in war is fueled by science as much as progress in any other area and scientists have certainly played (and continue to play) important roles in figuring out ways of killing more people faster and cheaper.

The scientific enterprise is embedded in the societies around it and has, indeed, in the past resorted to using slaves or prisoners. Even in the modern enlightened world, the scientific community has had its share of unethical behaviours, in ways both big and small.

To drive home the point: does supporting science mean supporting animal experiments? Obviously, yes, if you mean supporting the scientific enterprise as it exists. And, obviously, no, if it means supporting empirically validated statements!

The cluster of values that scientists typically share

Scientists tend to share a particular set of values (at least openly). We are pro-progress (in technological and social sense), socially liberal, cosmopolitan, and egalitarian. This is the view behind science is international and people sharing photos of their foreign colleagues on social media.

There is nothing empirically grounded of why these values would be better than others, except that they seem to be statistically more abundant in the minds of professional scientists. Some of this may really be explained by the fact that open minded people will both like science and share this type of values, but a lot of it is more arbitrary. Some of it is selection: given the fact that the career mandates travel and the English language, there is little appeal to individuals who prefer a more rooted life. Some of it is socialization (spend enough time in a community where these values are shared and you’ll start to share them). Some of it is preference falsification (in response to PC, people are afraid to come out and say what they really believe).

In any case, we must recognition that there is no objective sense in which these values are better than the alternative. Note that I do share them. If anything, their arbitrariness is often salient to me because I am even more cosmopolitan than the average scientist, so I see how the barrier between the “healthy nationalism” that is accepted and the toxic variety is a pretty arbitrary line in the sand.

What is funny too is that science is often funded exactly for the opposite reasons: It’s a prestige project for countries to show themselves superior to others, like funding the arts, or the Olympics team. (This is not the only reason to fund science, but it is certainly one of the reasons). You also hear it in Science is what made America great.

Science as an interest group

Science can be an interest group like any other: we want more subsidies & lower taxes (although there is little room for improvement there: most R&D is already tax-exempt). We want to get rid of pesky regulation, and the right to self-regulate (even though there is little evidence that self-regulation works). Science is an interest group.

Being pro-science

All these views of “What do I mean when I am pro-science?” interact and blend into each other: a lot of the resistance to things like GMOs does come from an empirically wrong view of the world and correcting this view thus assuage concerns about GMOs. Similarly, if you accept that science generally results in good things, you will be more in favor of funding it.

Sometimes, though, they diverge. The libertarian view that mixes a strong empiricism and defense of empirically validated facts with an opposition to public funding of science is a minority overall, but over-represented in certain intellectual circles.

On the other hand, I have met many people who support science as a force for progress and as an interest group, but who end up defending all sorts of pseudo-scientific nonsense and rejecting the consensus on the safety of nuclear power or GMOs. This is why I work at a major science institution whose health insurance covers homeopathy: the non-scientific staff will say they are pro-science, but will cherish their homeopathic “remedies”. I also suspect that many people declare themselves as pro-science because they see it as their side versus the religious views they disagree with, even though you can perfectly well be religious and pro-science in accepting the scientific facts.  I would never claim that Amish people are pro-progress and I hazard no guess on their views on public-science funding, but many are happy to grow GMOs as they accept the empirical fact of their safety. In that sense, they are more pro-science than your typical Brooklyn hipster.

Sometimes, these meanings of being pro-science blend into each other by motivated reasoning. So, instead of saying that vaccines are so overwhelmingly safe and that herd immunity is so important that I support mandating them (my view), I can be tempted to say “there is zero risk from vaccines” (which is not true for every vaccine, but I sure wish it were). I can be tempted to downplay the uncertainty about the harder-to-disentangle areas of economic policy and cite the empirical studies that agree with my ideology, and to call those who disagree “anti-scientific.” I might deny that values even come into play at all. We like to pretend there are no trade-offs. This is why anti-GMO groups often start by discussing intellectual property and land-use issues and end up trying to shut down high-school science biology classes.

In an ideal world, we’d reserve the opprobrium of “being anti-science” for those who deny empirical facts and well-validated theories, while discussing all the other issues as part of the traditional political debates (is it worth investing public money in science or should we invest more in education and new public housing? or lowering taxes?). In the real world, we often borrow credibility from empiricism to support other values. The risk, however, is that, what we borrow, we often have to pay back with interest.

Scott Sumner on what is a science

And don’t embarrass yourself by arguing macroeconomics is not a science.  Of course it’s a science.  It’s failed science, but then so are some of the other sciences, at least based on what I’ve read about the crisis in replication.  The term ‘science’ is not a compliment, it’s not some sort of award given to a field, like a Nobel Prize.  It’s simply a descriptive term for a field that builds models that try to explain how the world works.  Saying that science must be successful to be viewed as science is as silly as saying that a work of art must be good to be considered art.

Scott Sumner

Psychology is a failed science.

Repost: BLAST deserves a Nobel Prize

Given that tomorrow (Monday October 3) the 2016 Nobel Prize in Physiology or Medicine will be announced, I am linking to my 2-year old post arguing that BLAST deserves a Nobel Prize:

In terms of impact in the field, it’s undeniable that BLAST has been huge. These people created a verb! What modern biologist does not know what “blasting a sequence” means? The BLAST paper was, at one point, the most highly cited paper in history. The impact on physiology is undeniable.

Lipman and Gene Myers stand out for their contributions to the computational processing of biological sequences. (See how I phrased that in a Nobel Committee way).


[One] counterargument I’ve heard is that BLAST is mostly a method, but so was GFP […] Does anybody believe that just the 1962 discovery of a jellyfish protein would have sufficed for a Nobel?


BLAST was definitely one of the most largest advances in the field of physiology in the last few decades. For this reason, David Lipman and Gene Myers should get a Physiology and Medicine Nobel Prize.

I also add that current favorite CRISPR is also mostly a method (the CRISPR Prize, when it comes, will be awarded for the method not the discovery of some DNA processing mechanism in a Streptococcus species.