Python’s Weak Performance Matters

Here is an argument I used to make, but now disagree with:

Just to add another perspective, I find many “performance” problems in
the real world can often be attributed to factors other than the raw
speed of the CPython interpreter. Yes, I’d love it if the interpreter
were faster, but in my experience a lot of other things dominate. At
least they do provide low hanging fruit to attack first.

[…]

But there’s something else that’s very important to consider, which
rarely comes up in these discussions, and that’s the developer’s
productivity and programming experience.[…]

This is often undervalued, but shouldn’t be! Moore’s Law doesn’t apply
to humans, and you can’t effectively or cost efficiently scale up by
throwing more bodies at a project. Python is one of the best languages
(and ecosystems!) that make the development experience fun, high
quality, and very efficient.

(from Barry Warsaw)

I used to make this argument. Some of it is just a form of utilitarian programming: having a program that runs 1 minute faster but takes 50 extra hours to write is not worth it unless you run it >3000 times. For code that is written as part of data analysis, this is rarely the case. Now I think it is not as strong of an argument as I previously thought. Now I believe that the fact that CPython (the only widely used Python interpreter) is slow is a major disadvantage of the language and not just a small tradeoff for faster development time.

What changed in my reasoning?

First of all, I’m working on other problems. Whereas I used to do a lot of work that was very easy to map to numpy operations (which are fast as they use compiled code), now I write a lot of code which is not straight numerics. And, then, if I have to write it in standard Python, it is slow as molasses. I don’t mean slower in the sense of “wait a couple of seconds”, I mean “wait several hours instead of 2 minutes.”

At the same time, data keeps getting bigger and computers come with more and more cores (which Python cannot easily take advantage of), while single-core performance is only slowly getting better. Thus, Python is a worse and worse solution, performance-wise.

Other languages have also demonstrated that it is possible to get good performance with high-level code (using JIT or very aggressive compile-time optimizations). Looking from afar, the core Python development group seems uninterested in these ideas. They regularly pop-up in side projects: psyco, unladen swallow, stackless, shedskin, and pypy; the last one being the only one that is in active development; however, for all the buzz they generate they never make into CPython, which is still using the same basic bytecode stack-machine strategy that it used 20 years ago. Yes, optimizing a very dynamic language it’s not a trivial problem, but Javascript is at least as dynamic as Python is and it has several JIT-based implementations.

It is true that programmer time is more valuable than computer time, but waiting for results to finish computing is also a waste of my time (I suppose I could do something else in the meanwhile, but context switches are such a killer of my performance that I often just wait).

I have also sometimes found that, in order to make something fast in Python, I end up with complex code, almost unreadable, code. See this function for an example. The first time we wrote it, it was a loop based function, directly translating the formula it is computing. It took hours on a medium sized problem (it would take weeks on the real-life problems we want to tackle!). Now, it’s down to a few seconds, but unless you are much smarter than me, it’s not trivial to read out the underlying formula from the code.

The result is that I find myself doing more and more things in Haskell, which lets me write high-level code with decent performance (still slower than what I get if I go all the way down to C++, but with very good libraries). In fact, part of the reason that NGLess is written in Haskell and not Python is performance. I still use Jug (Python-based) to glue it all together, but it is calling Haskell code to do all the actual work.

I now sometimes prototype in Python, then do a kind of race: I start running the analysis on the main dataset, while at the same time reimplementing the whole thing in Haskell. Then, I start the Haskell version and try to make it finish before the Python-analysis completes. Many times, the Haskell version wins (even counting development time!).

Update: Here is a “fun” Python performance bug that I ran into the other day: deleting a set of 1 billion strings takes >12 hours. Obviously, this particular instance can be fixed, but this exactly the sort of thing that I would never have done a few years ago. A billion strings seemed like a lot back then, but now we regularly discuss multiple Terabytes of input data as “not a big deal”. This may not apply for your settings, but it does for mine.

Update 2: Based on a comment I made on hackernews, this is how I summarize my views:

The main motivation is to minimize total time, which is is TimeToWriteCode + TimeToRunCode.

Python has the lowest TimeToWriteCode, but very high TimeToRunCodeTimeToWriteCode is fixed as it is a human factor (after the initial learning curve, I am not getting that much smarter). However, as datasets grow and single-core performance does not get better TimeToRunCode keeps increasing, so that it is more and more worth it to spend more time writing code to decrease TimeToRunCode. C++ would give me the lowest TimeToRunCode, but at too high a cost in TimeToWriteCode (not so much the language, as the lack of decent libraries and package management). Haskell is (for me) a good tradeoff.

This is applicable to my work, where we do use large datasets as inputs. YMMV.

ANN : Diskhash. Disk-based, persistent hash tables

A few weeks ago, I decided to finally scratch an itch I’ve had for a while: I had a few days off from work and implemented a persistent, disk-based, hash table. Funnily enough, I’m now intensively using it at work, but a priori it felt more like a side project than a work one (it’s often a fuzzy border).

A disk based hashtable

The idea is very simple: it’s a basic hash table which is run on mmap()ed memory so that it can be loaded from disk with a single system call. I’ve heard this type of system to be referred to as “baked data”: you build structures in memory that can be written from/to disk without any need for parsing/converting.

I implemented it all in C (because it is the lowest-common denominator), but there are interfaces in C++, Python, and Haskell. The disk format is fixed, so all these interfaces can work with the same tables. You can jump to the bottom of the post to see code examples. 

Performance

My usage is mostly to build the hashtable once and then reuse it many times. Several design choices reflect this bias and so does performance. Building the hash table can take a while. A big (roughly 1 billion entries) table took almost 1 hour to build. This compares to about 10 minutes for building a Python hashtable of the same size.

On disk, this table takes up 32GB (just the keys and data use up 21GB so I find the overhead acceptable). This compares with almost 200GB for the Python version. Additionally, several processes on the same machine can share the memory map (the operating system will do this automatically for you), further reducing memory usage when more than one process is running.

Using the C++ interface, I measured lookups as taking circa 10-20 microseconds per lookup. When doing the same from Python, it takes 400-800 microseconds. The big difference depends on whether the cache is hot or cold (doing the same lookup twice is much faster than two different lookups as the memory is already in cache). A raw Python hash table takes ca. 40 microseconds. My guess is that the extra overhead of diskhash in Python is boxing/unboxing of types, while the Python version uses boxed types (which is also responsible for the extra memory usage). Still, this is very acceptable.

Design

The format on disk is pretty simple:

[HEADER]
    - magic number (versioned)
    - options
    - size of table
    - number of used slots
[TABLE OF INDICES]
    - integer indices into data table [with value 0 representing NULL and other indices in 1-based format]
[DATA TABLE]
    - [key/value] pairs

The format on disk is the same as the format on memory, thus loading is simply calling mmap(). Conflicts are handled using linear indexing (table load is kept at <50%). When it is necessary to expand the table, a completely new table is built (that is 1.7x as large as the current one), all the elements are inserted into this table and, then, we switch to that table. This can be quite expensive, but is amortized so, insertions are still O(1) and it is possible to pre-allocate a large table if desired.

The indirection (there is a table of indices pointing to a data table) keeps disk space down at the cost of an extra step (and probably an extra memory access) at lookup time. The code is smart enough to switch from 32 to 64 bit indices as the table grows.

There is currently no support for deleting keys.

Experience coding this

C is a pain, but compiling C is fast

I had actually not written any C code in many years. I often use C++, but raw C code is very different. Making sure that the every cleanup path is correct leads to a lot of boilerplate and copy&pasting. Without exceptions and destructors, checking the return value of all functions that we call is a pain. It is not hard, but it sure is tedious.

One thing that was very cool is how fast compilation is. The first time I ran gcc, I thought there must have been something wrong as the command was instantaneous.

Nope, compilation of the library and the test driver takes <0.2s (slightly slower if you use optimizations; it goes all the way up to 0.3s).

This means that compiling and running C is about as fast as starting an interpreter.

Writing a disk based hash is easy, packaging the code is hard

The two hardest things in computer science are not naming things or cache invalidation but installing packages on Linux and solving packaging errors.

I first wrote a Python wrapper using ctypes, but while it was trivial to write and it worked well, I could not find a way to package it. Finally, I decided it was easier to just use the raw C API instead of figuring out how to convince setuptools to do what I wanted.

The haskell packaging was slightly easier, but it still required a few tries until all the right files were correctly included in the package (which is why there were 3 releases until it worked: the code is the same, it was just me fiddling with packaging).

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python’s interface is more limited and only integers are supported as values in the hash table (they are stored as 64-bit integers).

import diskhash
tb = diskhash.Str2int("testing.dht", 15)
tb.insert("key", 9)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it’s not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

I tried Haskell for 5 years and here’s how it was

One blogpost style which I find almost completely useless is “I tried Programming Language X for 5 days and here’s how it was.” Most of the time, the first impression is superficial discussing syntax and whether you could get Hello World to run.

This blogpost is I tried Haskell for 5 years and here’s how it was.

In the last few years, I have been (with others) developing ngless, a domain specific language and interpreter for next-generation sequencing. For partly accidental reasons, the interpreter is written in Haskell. Even though I kept using other languages (most Python and C++), I have now used Haskell quite extensively for a serious, medium-sized project (11,270 lines of code). Here are some scattered notes on Haskell:

There is a learning curve

Haskell is a different type of language. It takes a while to fully get used to it if you’re coming from a more traditional background.

I have debugged code in Java, even though I never really learned (or wrote) any Java. Java is just a C++ pidgin language.

The same is not true of Haskell. If you have never looked at Haskell code, you may have difficulty following even simple functions.

Once you learn it, though, you get it.

Haskell has some very nice libraries

You really have very nice libraries, written by people doing really useful things.

Conduit and Parsec are the basis of a lot of ngless code.

Here is an excellent curated list of Haskell library world (added May 4)

Haskell libraries are sometimes hard to figure out

I like to think that you need both hard documentation and soft documentation.

Hard documentation is where you describe every argument to a function and its effects. It is like a reference work (think of man pages). Soft documentation are tutorials and examples and more descriptive text. Well documented software and libraries will have both (there no need for anything in between, I don’t want soft serve documentation).

Haskell libraries often have extremely hard documentation: they will explain the details of functions, but little in the way of soft documentation. This makes it very hard to understand why a function could be useful in the first place and in which contexts to use this library.

This is exacerbated by the often extremely abstract nature of some of the libraries. Case in point, is the very useful MonadBaseControl class. Trust me, this is useful. However, because it is so generic, it is hard to immediately grasp what it does.

I do not wish to over-generalize. Conduit, mentioned above, has tutorials, blogposts, as well as hard documentation.

Haskell sometimes feels like C++

Like C++, Haskell is (in part) a research project with a single initial Big Idea and a few smaller ones. In Haskell’s case, the Big Idea was purely functional lazy evaluation (or, if you want to be pedantic, call it “non-strict” instead of lazy). In C++’s case, the Big Idea was high level object orientation without loss of performance compared to C.

Both C++ and Haskell are happy to incorporate academic suggestions into real-world computer languages. This doesn’t need elaboration in the case of Haskell, but C++ has also been happy to be at the cutting edge. For example, 20 years ago, you could already use C++ templates to perform (limited) programming with dependent types. C++ really pioneered the mechanism of generics and templates.

Like C++, Haskell is a huge language, where there are many ways to do something. You have multiple ways to represent strings, you have accidents of history kept for backwards compatibility. If you read an article from 10 years ago about the best way to do something in the language, that article is probably outdated by two generations.

Like C++, Haskell’s error messages take a while to get used to.

Like C++, there is a tension in the community between the purists and the practitioners.

Performance is hard to figure out

Haskell and GHC generally let me get good performance, but it is not always trivial to figure out a priori which code will run faster and in less memory.

In some trivial sense, you always depend on the compiler to make your code faster (i.e., if the compiler was infinitely smart, any two programs that produce the same result would compile to the same highly efficient code).

In practice, of course, compilers are not infinitely smart and so there faster and slower code. Still, in many languages you can look at two pieces of code and reasonably guess which one will be faster, at least within an order of magnitude.

Not so with Haskell. Even very smart people struggle with very simple examples. This is because the most generic implementation of the code tends to be very inefficient. However, GHC can be very smart and make your software very fast. This works 90% of the time, but sometimes you write code that does not trigger all the right optimizations and your function suddenly becomes 1,000x slower. I have once or twice written two almost identical versions of a function with large differences in performance (orders of magnitude).

This leads to the funny situation that Haskell is (partially correctly) seen as an academic language used by purists obsessed with elegance; while in practice, a lot of effort goes into making the code written as compiler-friendly as possible.

For the most part, though, this is not a big issue. Most of the code will run just fine and you optimize the inner loops at the end (just like in any other language), but it’s a pitfall to watch out for.

The easy is hard, the hard is easy

For minor tasks (converting between two file formats, for example), I will not use Haskell; I’ll do it Python: It has a better REPL environment, no need to set up a cabal file, it is easier to express simple loops, &c. The easy things are often a bit harder to do in Haskell.

However, in Haskell, it is trivial to add some multithreading capability to a piece of code with complete assurance of correctness. The line that if it compiles, it’s probably correct is often true.

Stack changed the game

Before stack came on the game, it was painful to make sure you had all the right libraries installed in a compatible way. Since stack was released, working in Haskell really has become much nicer. Tooling matters.

The really big missing piece is the equivalent of ccache for Haskell.

Summary

Haskell is a great programming language. It requires some effort at the beginning, but you get to learn a very different way of thinking about your problems. At the same time, the ecosystem matured significantly (hopefully signalling a trend) and the language can be great to work with.