Python’s Weak Performance Matters

Here is an argument I used to make, but now disagree with:

Just to add another perspective, I find many “performance” problems in
the real world can often be attributed to factors other than the raw
speed of the CPython interpreter. Yes, I’d love it if the interpreter
were faster, but in my experience a lot of other things dominate. At
least they do provide low hanging fruit to attack first.

[…]

But there’s something else that’s very important to consider, which
rarely comes up in these discussions, and that’s the developer’s
productivity and programming experience.[…]

This is often undervalued, but shouldn’t be! Moore’s Law doesn’t apply
to humans, and you can’t effectively or cost efficiently scale up by
throwing more bodies at a project. Python is one of the best languages
(and ecosystems!) that make the development experience fun, high
quality, and very efficient.

(from Barry Warsaw)

I used to make this argument. Some of it is just a form of utilitarian programming: having a program that runs 1 minute faster but takes 50 extra hours to write is not worth it unless you run it >3000 times. For code that is written as part of data analysis, this is rarely the case. Now I think it is not as strong of an argument as I previously thought. Now I believe that the fact that CPython (the only widely used Python interpreter) is slow is a major disadvantage of the language and not just a small tradeoff for faster development time.

What changed in my reasoning?

First of all, I’m working on other problems. Whereas I used to do a lot of work that was very easy to map to numpy operations (which are fast as they use compiled code), now I write a lot of code which is not straight numerics. And, then, if I have to write it in standard Python, it is slow as molasses. I don’t mean slower in the sense of “wait a couple of seconds”, I mean “wait several hours instead of 2 minutes.”

At the same time, data keeps getting bigger and computers come with more and more cores (which Python cannot easily take advantage of), while single-core performance is only slowly getting better. Thus, Python is a worse and worse solution, performance-wise.

Other languages have also demonstrated that it is possible to get good performance with high-level code (using JIT or very aggressive compile-time optimizations). Looking from afar, the core Python development group seems uninterested in these ideas. They regularly pop-up in side projects: psyco, unladen swallow, stackless, shedskin, and pypy; the last one being the only one that is in active development; however, for all the buzz they generate they never make into CPython, which is still using the same basic bytecode stack-machine strategy that it used 20 years ago. Yes, optimizing a very dynamic language it’s not a trivial problem, but Javascript is at least as dynamic as Python is and it has several JIT-based implementations.

It is true that programmer time is more valuable than computer time, but waiting for results to finish computing is also a waste of my time (I suppose I could do something else in the meanwhile, but context switches are such a killer of my performance that I often just wait).

I have also sometimes found that, in order to make something fast in Python, I end up with complex code, almost unreadable, code. See this function for an example. The first time we wrote it, it was a loop based function, directly translating the formula it is computing. It took hours on a medium sized problem (it would take weeks on the real-life problems we want to tackle!). Now, it’s down to a few seconds, but unless you are much smarter than me, it’s not trivial to read out the underlying formula from the code.

The result is that I find myself doing more and more things in Haskell, which lets me write high-level code with decent performance (still slower than what I get if I go all the way down to C++, but with very good libraries). I still use Jug (Python-based) to glue it all together, but it is calling Haskell code to do all the actual work.

I now sometimes prototype in Python, then do a kind of race: I start running the analysis on the main dataset, while at the same time reimplementing the whole thing in Haskell. Then, I start the Haskell version and try to make it finish before the Python-analysis completes. Many times, the Haskell version wins (even counting development time!).

Update: Here is a “fun” Python performance bug that I ran into the other day: deleting a set of 1 billion strings takes >12 hours. Obviously, this particular instance can be fixed, but this exactly the sort of thing that I would never have done a few years ago. A billion strings seemed like a lot back then, but now we regularly discuss multiple Terabytes of input data as “not a big deal”. This may not apply for your settings, but it does for mine.

Update 2: Based on a comment I made on hackernews, this is how I summarize my views:

The main motivation is to minimize total time, which is is TimeToWriteCode + TimeToRunCode.

Python has the lowest TimeToWriteCode, but very high TimeToRunCodeTimeToWriteCode is fixed as it is a human factor (after the initial learning curve, I am not getting that much smarter). However, as datasets grow and single-core performance does not get better TimeToRunCode keeps increasing, so that it is more and more worth it to spend more time writing code to decrease TimeToRunCode. C++ would give me the lowest TimeToRunCode, but at too high a cost in TimeToWriteCode (not so much the language, as the lack of decent libraries and package management). Haskell is (for me) a good tradeoff.

This is applicable to my work, where we do use large datasets as inputs. YMMV.
Advertisements

Bray-Curtis dissimilarity on relative abundance data is the Manhattan distance (aka L₁ distance)

Warning: super-technical post ahead, but I have made this point in oral discussions at least a few times, so I thought I would write it up. It is a trivial algebraic manipulation, but because “ℓ₁ norm” sounds too mathy for ecologists while “Bray-curtis” sounds too ecological and ad-hoc for mathematically minded people, it’s good to see that it’s the same thing on normalized data.

Assuming you have two feature vectors, Xᵢ, Xⱼ, if they have been normalized to sum to 1, then the Bray-Curtis dissimilarity is just their ℓ₁ distance, aka Manhattan distance (times ½, which is a natural normalization so that the result is between zero and one).

This is the Wikipedia definition of the Bray-Curtis dissimiliarity (there are a few other, equivalent, definitions around, but we’ll use this one):

BC = 1 – 2 Cᵢⱼ/(Sᵢ + Sⱼ), where Cᵢⱼ =Σₖmin(Xᵢₖ, Xⱼₖ)  and Sᵢ = ΣₖXᵢₖ.

While the Manhattan distance is D₁ = Σₖ|Xᵢₖ – Xⱼₖ|

We are assuming that they sum to 1, so Sᵢ=Sⱼ=1. Thus,

BC = 1 – Σₖmin(Xᵢₖ, Xⱼₖ)

Now, this still does not look like the Manhattan distance (D₁, above). But for any a and b ≥0, it holds that

min(a,b) = (a + b)/2 – |a – b|/2

(this is easiest to see graphically:  start at the midpoint, (a+b)/2 and subtract half of the difference, |a-b|/2).

Thus, BC = 1 -Σₖmin(Xᵢₖ, Xⱼₖ) = 1 – Σₖ{(Xᵢₖ + Xⱼₖ)/2 – |Xᵢₖ + Xⱼₖ|/2}

Because, we assumed that the vectors were normalized,  Σₖ(Xᵢₖ + Xⱼₖ)/2 =(ΣₖXᵢₖ +ΣₖXⱼₖ)/2 = 1, so

BC = 1 – 1 + Σₖ|Xᵢₖ + Xⱼₖ|/2 = D₁/2.

Against Science Communication

Science communication, science outreach, is widely seen as a good thing, perhaps even a necessary one (Scientists: do outreach or your science dies). Let me disagree: For the most part, cutting-edge science should not be communicated to the public in mass media and doing so harms both the public and science.

At the end of the post, I provide some alternatives to the state of the world.

What is wrong with science communication?

For all the complaints that the public has lost faith in science, I often anecdotally find that members of the public have more faith in scientific results than most scientists. Most scientists know to take any paper with a grain of salt. The public does not always realize that, they don’t have the time (or the training) to dig into the paper and are left with a click bait headline which they need to take at face value or reject “scientific results”.

Most of the time, science communication has to simplify the underlying science so much as to be basically meaningless. Most science doesn’t make any sense even to scientists working in adjacent fields: We publish a new method which is important to predict the structure of a protein that is important because other studies have shown that it is important in a pathway that other studies have shown is active in response to … and so on. At the very bottom, we have things that the public cares about (and which is why we get funded), but the relationships are not trivial and we should stop pretending otherwise.

When I thumb through the latest Nature edition, only a minority of titles are meaningful to me. Sometimes, I genuinely have no idea what they are talking about (One-pot growth of two-dimensional lateral heterostructures via sequential edge-epitaxy); other times, I get the message (Structure of the glucagon receptor in complex with a glucagon analogue) but I don’t have the context of that field to understand why this is important. To pretend that we can explain this to a member of the public in 500 words (or 90 seconds on radio/TV) is idiotic. Instead, we explain a butchered version.

Science outreach harms the public. Look at this (admittedly, not so recent) story: The pill is linked to depression – and doctors can no longer ignore it (The Guardian, one of my favorite newspapers). The study was potentially interesting, but in no way conclusive: it’s observational with obvious confounders: the population of women who start taking the pill at a young age is not the same as that that takes the pill at a later time. However, reading any of the panicked articles (Guardian again, the BBCQuartz: this was not just the Oprah Winfrey Show) may lead some women to stop taking the pill for no good reason, which is not a positive for them. (To be fair, some news outlets did publish a skeptical view of it, e.g., The Spectator, Jezebel).

Science outreach discredits science. The public now has what amounts to a healthy skepticism of any nutritional result, given the myriad “X food gives you cancer” or “Chocolate makes you thin“. If you think that there are no spillovers from publicizing shaky results into other fields of science (say, climate science), then you are being naïve.

Valuing science outreach encourages bad science. If anything, the scientific process is already too tipped towards chasing the jazzy new finding instead of solid research. It’s the worst papers that make the news, let’s not reward that. The TED-talk giving, book writing, star of pop psychology turned out to be a charlatan. Let’s not reward the same sillyness in other fields.

What is the alternative?

The alternative is very simple: instead of communicating iffy, cutting-edge stuff, communicate settled science. Avoid almost all recent papers and focus on established results and bodies of literature. Books that summarize a body of work can do a good job.

But this is not news! Frankly, who cares? Journalists do, but I think the public is happy to read non-news stuff. In fact, most people find that type of reporting more interesting that the flashy study of the week.

Or use opportunities to tie it to current events. For example, when major prizes are given out (like the Nobel prize, but also the tiers below it), publish your in-depth reporting the topics. Since often the topics are predictable, you can prepare your reporting carefully and publish it when ready. You don’t even need the topic/person to actually win. In the weeks prior to Nobel Prize week, you can easily put up a few articles on “the contenders and their achievements”.

Finally, let me remark that, for the members of the public who do wish to be informed of cutting-edge science and are willing (perhaps even eager) to put in the mental work of understanding it, there are excellent resources out there. For example, I regularly listen to very good science podcasts on the microbe.tv website. They do a good job of going through the cutting edge stuff in an informed way: they take 30 minutes to go through a paper, not 30 seconds. They won’t ever be mainstream, but that’s the way it should be.

The Dark Looking Glass (Black Mirror fan fiction)

I’m currently in the middle of Black Mirror’s Season 4, I feel that despite it being probably the best show on TV, it starting to repeat the same themes in a way that makes it predictable: brain implants, immersive video games, Siri’s sister, &c.

Thus, I decided to think up a few episodes ideas myself. I felt that the show has not really explored the possibility of better mind-altering drugs, so a couple of random ideas follow.

§

The Dark Looking Glass Episode 1
Title: What happens in Vegas, stays in Vegas.

A new drug has the following effect: you take it and it has no effects until you go to sleep. Then, it erases the memory of what happened that day, except for vague feelings and emotional affect.

People start taking it for parties. Las Vegas hotels have these organized parties where everyone has to take the drug so that nobody will remember what happens (except for hotel security, who stays sober, but is sworn to secrecy). Society becomes more and more conservative, with wild behaviour confined to these forgetful parties, which themselves become wilder. You fly to Vegas for drug-fueled, sexual bacchanals every few months. Then you forget the details of everything that happened, but still feel liberated and relaxed so you live a conservative lifestyle the rest of the year.

At one of these parties, a group of friends (in their 50’s, typical middle-class Americans) have their regular sexual orgy (“do you think this is what we do every time?” asks a woman to a man who is not her husband while they have sex).

Then, a freak accident kills the sober security so that everyone at the party is under the forgetful drug.

Once they realize that they are now truly and completely free, a husband kills a wife; a wife kills her lover. Another man is left for dead, but survives. People run, scramble, hide. Eventually, everyone collapses of exhaustion; thus triggering the drug’s effect. The next day, they wake up afraid, terrified, but nobody knows why. The bodies are discovered, the police is called. Nobody knows who killed whom, who beat up the barely-alive man on the floor. The police cannot make any headway either: the whole house is a mess, everyone was there, the murder weapons are at the bottom of the pool.

The orgy participants know that some of them killed others, that somebody beat somebody up, but nobody knows who the killers are (if there is more than one). As local police are both stuck and embarrassed (they were supposed to provide security and failed), the survivors are let go from Vegas and fly back to suburban Atlanta. To help their teenage kids with their homework.

The Dark Looking Glass, Episode 2
Title: Japan

Open with Lithium, by Nirvana.

Society adds a drug to the water that makes everyone be nicer. Crime becomes almost non-existent, wars disappear (diplomatic solutions are sought after and found), things are good. A group of natural water advocates, however, starts drinking normal, non-drugged water. Think organic-eating yoga mom, not gun wielding libertarian. They are accepted by society as everyone is so nice and tolerant.

As this group of hippies lives without the drug, they talk about having more complete feelings and the euphoria. It takes months to years for the drug to fully wash out of your system, though. As it does, it becomes clear that adults having strong emotions without a lifetime of learning how to manage them is not a good idea and they become violent. Society is completely unprepared for it. The police are not armed, even with a stick, as disputes were always dealt with by reasoning before. Psychologically, nobody knows what to do.

The only solution is to force everyone to take the drug. After thousands more deaths, the makeshift police force does overwhelm the rebels and drug them.

Peace is restored.

Numpy/scipy backwards stability debate (and why freezing versions is not the solution)

This week, a discussion broke out about the stability of the Python scientific ecosystem. It was triggered by a blogpost from Konrad Hinsen, which led to several twitter follow ups.

First of all, let me  say that numpy/scipy are great. I use them and recommend them all the time. I am not disparaging the projects as a whole or the people who work on them. It’s just that I would prefer if they were stabler. Given twitter’s limitations, perhaps this was not as clear as I would like on my own twitter response:

I pointed out that I have been bit by API changes:

All of these are individually minor (and can be worked around), but these are just the issues that I have personally ran into and caused enough problems for me to remember them. The most serious was the mannwhitneyu change, which was a silent change (i.e., the function started returning a different result rather than raising an exception or another type of error).

*

Konrad had pointed out the Linux kernel project as one extreme version of “we never break user code”:

The other extreme is the Silicon-Valley-esque “move fast and break stuff”, which is appropriate for a new project. These are not binary decisions, but two extremes of a continuum. I would hope to see numpy move more towards the “APIs are contracts we respect” side of the spectrum as I think it currently behaves too much like a startup.

Numpy does not use semantic versioning, but if it did almost all its releases would be major releases as they almost always break some code. We’d be at Numpy 14.0 by now. Semantic versioning would allow for a smaller number of “large, possibly-breaking releases” (every few years) instead of a constant stream of minor backwards-incompatible changes. We’d have Numpy 4.2 now, and a list of deprecated features to be removed by 5.0.

Some of the problems that have arisen could have been solved by (1) introducing a new function with the better behaviour, (2) deprecating the old one, (3) waiting a few years and removing the original version (in a major release, for example). This would avoid the most important problem, silent changes.

*

A typical response is to say “well, just use anaconda (or similar) to freeze your dependencies”. Again, let me be clear, I use and recommend anaconda for everything. It’s great. But, in the context of the backwards compatibility problem, I don’t think this recommendation is really thought through as it only solves a fraction of the problem at hand (again, an important fraction but it’s not a magic wand).  (See also this post by Titus Brown).

What does anaconda not solve? It does not solve the problem of the intermediate layer, libraries which use numpy, but are to be used by final users. What is the expectation here? That I release my computer vision code (mahotas) with a note: Only works on Numpy 1.11? What if I want a project that uses both mahotas and scikit-learn, but scikit-learn is for Numpy 1.12 only? Is the conclusion that you cannot mix mahotas and scikit-learn? This would be worse than the Python 2/3 split. A typical project of mine might use >5 different numpy-dependent libraries. What are the chances that they all expect the exact same numpy version?

Right now, the solution I use in my code is “if I know that this numpy function has changed behaviour, either work around it, avoid it, or reimplement it (perhaps by copying and pasting from numpy)”. For example, some functions return views or copies depending on which version of numpy you have. To handle that, just add a “copy()” statement to all of them and now you always have a copy. It’s computationally inefficient, but avoiding even a single bug over a few years probably saves more time in the end.

It also happens all the time that I have an anaconda environment, add a new package and numpy is upgraded/downgraded. Is this to be considered buggy behaviour by anaconda? Anaconda currently does not upgrade everything to Python 3 when you request a package that is not available on Python 2, nor does it downgrade from 3 to 2; why should it treat numpy any differently if there is no guarantee that behaviour is consistent across numpy verions?

Sometimes the code at hand is not even an officially released library, but some code from another project. Let’s say that I have code that takes a metagenomics abundance matrix, does some preprocessing and computes stats and plots. I might have written it originally for a paper a few years back, but now want to do the same analysis on new data. Is the recommendation that I always write from scratch because it’s a new numpy version? What if it’s someone else asking me for the code? Is the recommendation that I ask “are you still on numpy 1.9, because I only really tested it there”. Note that Python’s dynamic nature actually makes this problem worse than in statically checked languages.

What about training materials? As I also wrote on twitter, it’s when teaching Python that I suffer most from Python 2-vs-Python 3 issues. Is the recommendation that training materials clearly state “This is a tutorial for numpy 1.10 only. Please downgrade to that version or search for a more up to date tutorial”? Note that beginners are the ones most likely to struggle with these issues. I can perfectly understand what it means that: “array == None and array != None do element-wise comparison”(from the numpy 1.13 release notes). But if I was just starting out, would I understand it immediately?

Freezing the versions solves some problems, but does not solve the whole issue of backwards compatibility.

How NGLess uses its version declaration

NGLess is my metagenomics tool, which is based on a domain specific language. So, NGLess is both a language and a tool (which implements the language).

Since the beginning, ngless has had a focus on reproducibility and one the small ways in which this was implemented was that ngless requires a version declaration. Every ngless script is required to start with a version declaration:

    ngless "0.5"

This was always intended to enable the language to change while keeping perfect reproducibility of past scripts. Until recently, though, this was just hypothetical.

In October, I taught a course on NGLess and it became clear that one of the minor inconsistencies in the previous version of the language (at the time, version “0.0”) was indeed confusing. In the previous version of the language, the preprocess function modified its arguments. No other function did this.

In version “0.5” (which was released on November 1st), preprocess is now a pure function, so that you must assign its output to a value.

However, and this is where the version declaration comes into play, the newer executable still accepts scripts with the version declaration ngless "0.0". Furthermore, if you declare your script as using ngless 0.0, then the old behaviour is used. Thus, we fixed the language, but nobody needs to update their scripts.

Implementation note (which shouldn’t concern the user, but may be interesting to others): before interpretation, ngless will transform the input script, adding checks and optimizing it. A new pass (which is only enabled is the user requested version “0.0”), simply transforms the older code into its newer counterpart. Then, the rest of the process proceeds as if the user had typed in the newer version.

Classifying protists into 155 (hierarchically organized) classes

An important component of my recent paper (previous post) on imaging protist (micro-eukaryotes) communities is a classifier that classifies each individual object into one of 155 classes. These classes are organized hierarchically, so that the first level corresponds to living/non-living object; then, if living, classifies it into phyla, and so on. This is the graphical representation we have in the paper:

Using a large training set (>18,000), we built a classifier capable of classifying objects into one these 155 classes with >82%.

What is the ML architecture we use? In the end, we use the traditional system: we compute many features and use a random forest trained on the full 155 classes. Why a random forest?

A random forest should be the first thing you try on a supervised classification problem (and perhaps also the last, lest you overfit). I did spent a few weeks trying different variations on this idea and none of them beat this simplest possible system. Random forests are also very fast to train (especially if you have a machine with many cores, as each tree can be learned independently).

As usual, the features were where the real work went. A reviewer astutely asked whether we really needed so many features (we compute 480 of them). The answer is yes. Even when selecting just the best features (which we wouldn’t know apriori, but let’s assume we had an oracle), it seems that we really do need a lot of features:

(This is Figure 3 — supplement 4: https://elifesciences.org/articles/26066/figures#fig3s4sdata1)

We need at least 200 features and it never really saturates. Furthermore, features are computed in groups (Haralick features, Zernike features, …), so we would not gain much

In terms of implementation, features were computed with mahotas (paper) and machine learning was done with scikit-learn (paper).

§

What about Deep Learning? Could we have used CNNs? Maybe, maybe not. We have a fair amount of data (>18,000 labeled samples), but some of the classes are not as well represented (in the pie chart above, the width of the classes represents how many objects are in the training set). A priori, it’s not clear it would have helped much.

Also, we may already be at the edge of what’s possible. Accuracy above 80% is already similar to human performance (unlike some of the more traditional computer vision problems, where humans perform with almost no mistakes and computers had very high error rates prior to the neural network revolution).