Similarity of the dog and human gut microbiomes in gene content and response to diet

My paper Similarity of the dog and human gut microbiomes in gene content and response to diet was published yesterday in Microbiome. It was a long time in the making (and almost a year in the review process: submitted 11 May 2017), but now it’s finally published! It has been picking up quite a bit of press, which is nice too.

It’s open access, so everyone can read it, but here’s a basic summary:

  1. We built a non-redundant gene catalog for the dog gut microbiome. We then compared it to the equivalent gene catalogs for humans, mice, and pigs (since all these catalogs were built on Illumina data using MOCAT, they were easily comparable). Somewhat surprisingly, we found a high overlap between the dog microbiome genes and that of the human microbiome (higher than for the other non-human animals).
    Fig1f.pngWe can also map a much higher fraction of short reads from the dog gut microbiome to the human gut gene catalog than for the other non-human hosts.

    Fig1e.png

  2. When we used metaSNV to analyse the SNV, we saw strain separation between the human and dog strains (for the same species). Thus, we do not share organisms with our dogs! Only similar species. I have presented this conclusion (in talks and informally) and different people in the field told be both that “of course strains are host specific” and “I was expecting that we’d be getting bacteria from our dogs all the time, this is not what I expected at all.”
  3. Diet shifts the microbiome of the dogs (the dogs were randomly assigned to either a high-protein or a low-protein diet). We had two samples from each dog: one after they ate the baseline diet (which was a low-protein diet) and a second after the random switch to a high-protein diet. We could thus  the microbiome of overweight dogs changes more than that of their healthier counterparts (see the Anna Karenina hypothesis: unhappy microbiomes are less alike).

    (in the Figure, HPLC/OW refers to overweight dogs on the high-protein diet).

  4. We also saw that some taxa dramatically changed their prevalence in response to the diet. In particular, Lactobacillus ruminis was completely absent in the high-protein diet. Not just lower abundant, but undetectable.

Those are the highlights, the full paper has a bit more. I’ll try to post a couple of extra posts with some interesting technical tidbits. For example, how we used the little-known Gehan statistic.

Advertisements

Bug-for-bug backwards compatibility in NGLess

Recently, I found a bug in NGLess. In some rare conditions, it would mess up and reads could be lost. Obviously, I fixed it.

If you’ve used NGLess before (or read about it), you’ll know that every ngless script starts with a version declaration:

ngless "x.y"

This indicates which version of NGLess should be running the code. Since the bug changed the results, I needed to make a new version (we are now at version 0.8).

The question is what should NGLess do when it runs a script that uses an older version declaration? I see three options:

1. Silently update everyone to the new behavior

This is the typical software behavior: the new system is better, why wouldn’t you want to upgrade? Because we’d be breaking our promise to make ngless reproducible. The whole point of having the version line is to ensure that you will always get the same results. We also don’t want to make people afraid of upgrading.

2. Refuse to run older scripts and force everyone to upgrade

This is another option: we could just refuse to run old code. Now, at the very least, there would be no silent changes. It’s still possible to install older versions (and bioconda/biocontainers makes this easy), so if you really needed to, you could still run the older scripts.

3. Emulate the old (buggy) behavior when the user requests the old versions

In the end, I went with this option.

The old behavior is not that awful. Some reads are handled completely wrong, but the reason why the bug was able to persist for so long is that it only shows up in a few reads in a million. Thus, while this means that NGLess will sometimes knowingly output results that are suboptimal, I found it the best solution. A warning is printed, asking the user to upgrade.

How NGLess uses its version declaration

NGLess is my metagenomics tool, which is based on a domain specific language. So, NGLess is both a language and a tool (which implements the language).

Since the beginning, ngless has had a focus on reproducibility and one the small ways in which this was implemented was that ngless requires a version declaration. Every ngless script is required to start with a version declaration:

    ngless "0.5"

This was always intended to enable the language to change while keeping perfect reproducibility of past scripts. Until recently, though, this was just hypothetical.

In October, I taught a course on NGLess and it became clear that one of the minor inconsistencies in the previous version of the language (at the time, version “0.0”) was indeed confusing. In the previous version of the language, the preprocess function modified its arguments. No other function did this.

In version “0.5” (which was released on November 1st), preprocess is now a pure function, so that you must assign its output to a value.

However, and this is where the version declaration comes into play, the newer executable still accepts scripts with the version declaration ngless "0.0". Furthermore, if you declare your script as using ngless 0.0, then the old behaviour is used. Thus, we fixed the language, but nobody needs to update their scripts.

Implementation note (which shouldn’t concern the user, but may be interesting to others): before interpretation, ngless will transform the input script, adding checks and optimizing it. A new pass (which is only enabled is the user requested version “0.0”), simply transforms the older code into its newer counterpart. Then, the rest of the process proceeds as if the user had typed in the newer version.

Classifying protists into 155 (hierarchically organized) classes

An important component of my recent paper (previous post) on imaging protist (micro-eukaryotes) communities is a classifier that classifies each individual object into one of 155 classes. These classes are organized hierarchically, so that the first level corresponds to living/non-living object; then, if living, classifies it into phyla, and so on. This is the graphical representation we have in the paper:

Using a large training set (>18,000), we built a classifier capable of classifying objects into one these 155 classes with >82%.

What is the ML architecture we use? In the end, we use the traditional system: we compute many features and use a random forest trained on the full 155 classes. Why a random forest?

A random forest should be the first thing you try on a supervised classification problem (and perhaps also the last, lest you overfit). I did spent a few weeks trying different variations on this idea and none of them beat this simplest possible system. Random forests are also very fast to train (especially if you have a machine with many cores, as each tree can be learned independently).

As usual, the features were where the real work went. A reviewer astutely asked whether we really needed so many features (we compute 480 of them). The answer is yes. Even when selecting just the best features (which we wouldn’t know apriori, but let’s assume we had an oracle), it seems that we really do need a lot of features:

(This is Figure 3 — supplement 4: https://elifesciences.org/articles/26066/figures#fig3s4sdata1)

We need at least 200 features and it never really saturates. Furthermore, features are computed in groups (Haralick features, Zernike features, …), so we would not gain much

In terms of implementation, features were computed with mahotas (paper) and machine learning was done with scikit-learn (paper).

§

What about Deep Learning? Could we have used CNNs? Maybe, maybe not. We have a fair amount of data (>18,000 labeled samples), but some of the classes are not as well represented (in the pie chart above, the width of the classes represents how many objects are in the training set). A priori, it’s not clear it would have helped much.

Also, we may already be at the edge of what’s possible. Accuracy above 80% is already similar to human performance (unlike some of the more traditional computer vision problems, where humans perform with almost no mistakes and computers had very high error rates prior to the neural network revolution).

New papers I: imaging environmental samples of micro eukaryotes

This week, I had two first author papers published:

  1. Quantitative 3D-imaging for cell biology and ecology of environmental microbial eukaryotes 
  2. Jug: Software for Parallel Reproducible Computation in Python

I intend to post on both of them over the next week or so, but I will start with the first one.

The basic idea is that just as metagenomics was the application of lab techniques (sequencing) that had been developed for pure cultures to environmental samples, we are moving from imaging cell cultures (the type of work I did during my PhD and shortly afterwards) to imaging environmental samples. These are, thus, mixed samples of microbes (micro-eukaryotes, not bacteria, but remember: protists are microbes too).

Figure 1 from paper

Figure 1 from the paper depicting the process (a) and the results (b & c).

The result is a phenotypic view of the whole community, not just the elements that you can easily grow in the lab. As it is not known apriori which organisms will be present, we use generic eukaryotic dyes, tagging DNA, membranes, and the exterior. In addition, chlorophyll is auto-fluorescence, so we get a free extra channel.

With automated microscopes and automated analysis, we obtained images of 300,000 organisms, which were classified into 155 classes. A simple machine-learning system can perform this classification with 82% accuracy, which is similar to (or better than) the inter-operator variability in similar problems.

The result is both a very large set of images as well as a large set of features, which can be exploited for understanding the microbial community.

ANN : Diskhash. Disk-based, persistent hash tables

A few weeks ago, I decided to finally scratch an itch I’ve had for a while: I had a few days off from work and implemented a persistent, disk-based, hash table. Funnily enough, I’m now intensively using it at work, but a priori it felt more like a side project than a work one (it’s often a fuzzy border).

A disk based hashtable

The idea is very simple: it’s a basic hash table which is run on mmap()ed memory so that it can be loaded from disk with a single system call. I’ve heard this type of system to be referred to as “baked data”: you build structures in memory that can be written from/to disk without any need for parsing/converting.

I implemented it all in C (because it is the lowest-common denominator), but there are interfaces in C++, Python, and Haskell. The disk format is fixed, so all these interfaces can work with the same tables. You can jump to the bottom of the post to see code examples. 

Performance

My usage is mostly to build the hashtable once and then reuse it many times. Several design choices reflect this bias and so does performance. Building the hash table can take a while. A big (roughly 1 billion entries) table took almost 1 hour to build. This compares to about 10 minutes for building a Python hashtable of the same size.

On disk, this table takes up 32GB (just the keys and data use up 21GB so I find the overhead acceptable). This compares with almost 200GB for the Python version. Additionally, several processes on the same machine can share the memory map (the operating system will do this automatically for you), further reducing memory usage when more than one process is running.

Using the C++ interface, I measured lookups as taking circa 10-20 microseconds per lookup. When doing the same from Python, it takes 400-800 microseconds. The big difference depends on whether the cache is hot or cold (doing the same lookup twice is much faster than two different lookups as the memory is already in cache). A raw Python hash table takes ca. 40 microseconds. My guess is that the extra overhead of diskhash in Python is boxing/unboxing of types, while the Python version uses boxed types (which is also responsible for the extra memory usage). Still, this is very acceptable.

Design

The format on disk is pretty simple:

[HEADER]
    - magic number (versioned)
    - options
    - size of table
    - number of used slots
[TABLE OF INDICES]
    - integer indices into data table [with value 0 representing NULL and other indices in 1-based format]
[DATA TABLE]
    - [key/value] pairs

The format on disk is the same as the format on memory, thus loading is simply calling mmap(). Conflicts are handled using linear indexing (table load is kept at <50%). When it is necessary to expand the table, a completely new table is built (that is 1.7x as large as the current one), all the elements are inserted into this table and, then, we switch to that table. This can be quite expensive, but is amortized so, insertions are still O(1) and it is possible to pre-allocate a large table if desired.

The indirection (there is a table of indices pointing to a data table) keeps disk space down at the cost of an extra step (and probably an extra memory access) at lookup time. The code is smart enough to switch from 32 to 64 bit indices as the table grows.

There is currently no support for deleting keys.

Experience coding this

C is a pain, but compiling C is fast

I had actually not written any C code in many years. I often use C++, but raw C code is very different. Making sure that the every cleanup path is correct leads to a lot of boilerplate and copy&pasting. Without exceptions and destructors, checking the return value of all functions that we call is a pain. It is not hard, but it sure is tedious.

One thing that was very cool is how fast compilation is. The first time I ran gcc, I thought there must have been something wrong as the command was instantaneous.

Nope, compilation of the library and the test driver takes <0.2s (slightly slower if you use optimizations; it goes all the way up to 0.3s).

This means that compiling and running C is about as fast as starting an interpreter.

Writing a disk based hash is easy, packaging the code is hard

The two hardest things in computer science are not naming things or cache invalidation but installing packages on Linux and solving packaging errors.

I first wrote a Python wrapper using ctypes, but while it was trivial to write and it worked well, I could not find a way to package it. Finally, I decided it was easier to just use the raw C API instead of figuring out how to convince setuptools to do what I wanted.

The haskell packaging was slightly easier, but it still required a few tries until all the right files were correctly included in the package (which is why there were 3 releases until it worked: the code is the same, it was just me fiddling with packaging).

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python’s interface is more limited and only integers are supported as values in the hash table (they are stored as 64-bit integers).

import diskhash
tb = diskhash.Str2int("testing.dht", 15)
tb.insert("key", 9)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it’s not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

I tried Haskell for 5 years and here’s how it was

One blogpost style which I find almost completely useless is “I tried Programming Language X for 5 days and here’s how it was.” Most of the time, the first impression is superficial discussing syntax and whether you could get Hello World to run.

This blogpost is I tried Haskell for 5 years and here’s how it was.

In the last few years, I have been (with others) developing ngless, a domain specific language and interpreter for next-generation sequencing. For partly accidental reasons, the interpreter is written in Haskell. Even though I kept using other languages (most Python and C++), I have now used Haskell quite extensively for a serious, medium-sized project (11,270 lines of code). Here are some scattered notes on Haskell:

There is a learning curve

Haskell is a different type of language. It takes a while to fully get used to it if you’re coming from a more traditional background.

I have debugged code in Java, even though I never really learned (or wrote) any Java. Java is just a C++ pidgin language.

The same is not true of Haskell. If you have never looked at Haskell code, you may have difficulty following even simple functions.

Once you learn it, though, you get it.

Haskell has some very nice libraries

You really have very nice libraries, written by people doing really useful things.

Conduit and Parsec are the basis of a lot of ngless code.

Here is an excellent curated list of Haskell library world (added May 4)

Haskell libraries are sometimes hard to figure out

I like to think that you need both hard documentation and soft documentation.

Hard documentation is where you describe every argument to a function and its effects. It is like a reference work (think of man pages). Soft documentation are tutorials and examples and more descriptive text. Well documented software and libraries will have both (there no need for anything in between, I don’t want soft serve documentation).

Haskell libraries often have extremely hard documentation: they will explain the details of functions, but little in the way of soft documentation. This makes it very hard to understand why a function could be useful in the first place and in which contexts to use this library.

This is exacerbated by the often extremely abstract nature of some of the libraries. Case in point, is the very useful MonadBaseControl class. Trust me, this is useful. However, because it is so generic, it is hard to immediately grasp what it does.

I do not wish to over-generalize. Conduit, mentioned above, has tutorials, blogposts, as well as hard documentation.

Haskell sometimes feels like C++

Like C++, Haskell is (in part) a research project with a single initial Big Idea and a few smaller ones. In Haskell’s case, the Big Idea was purely functional lazy evaluation (or, if you want to be pedantic, call it “non-strict” instead of lazy). In C++’s case, the Big Idea was high level object orientation without loss of performance compared to C.

Both C++ and Haskell are happy to incorporate academic suggestions into real-world computer languages. This doesn’t need elaboration in the case of Haskell, but C++ has also been happy to be at the cutting edge. For example, 20 years ago, you could already use C++ templates to perform (limited) programming with dependent types. C++ really pioneered the mechanism of generics and templates.

Like C++, Haskell is a huge language, where there are many ways to do something. You have multiple ways to represent strings, you have accidents of history kept for backwards compatibility. If you read an article from 10 years ago about the best way to do something in the language, that article is probably outdated by two generations.

Like C++, Haskell’s error messages take a while to get used to.

Like C++, there is a tension in the community between the purists and the practitioners.

Performance is hard to figure out

Haskell and GHC generally let me get good performance, but it is not always trivial to figure out a priori which code will run faster and in less memory.

In some trivial sense, you always depend on the compiler to make your code faster (i.e., if the compiler was infinitely smart, any two programs that produce the same result would compile to the same highly efficient code).

In practice, of course, compilers are not infinitely smart and so there faster and slower code. Still, in many languages you can look at two pieces of code and reasonably guess which one will be faster, at least within an order of magnitude.

Not so with Haskell. Even very smart people struggle with very simple examples. This is because the most generic implementation of the code tends to be very inefficient. However, GHC can be very smart and make your software very fast. This works 90% of the time, but sometimes you write code that does not trigger all the right optimizations and your function suddenly becomes 1,000x slower. I have once or twice written two almost identical versions of a function with large differences in performance (orders of magnitude).

This leads to the funny situation that Haskell is (partially correctly) seen as an academic language used by purists obsessed with elegance; while in practice, a lot of effort goes into making the code written as compiler-friendly as possible.

For the most part, though, this is not a big issue. Most of the code will run just fine and you optimize the inner loops at the end (just like in any other language), but it’s a pitfall to watch out for.

The easy is hard, the hard is easy

For minor tasks (converting between two file formats, for example), I will not use Haskell; I’ll do it Python: It has a better REPL environment, no need to set up a cabal file, it is easier to express simple loops, &c. The easy things are often a bit harder to do in Haskell.

However, in Haskell, it is trivial to add some multithreading capability to a piece of code with complete assurance of correctness. The line that if it compiles, it’s probably correct is often true.

Stack changed the game

Before stack came on the game, it was painful to make sure you had all the right libraries installed in a compatible way. Since stack was released, working in Haskell really has become much nicer. Tooling matters.

The really big missing piece is the equivalent of ccache for Haskell.

Summary

Haskell is a great programming language. It requires some effort at the beginning, but you get to learn a very different way of thinking about your problems. At the same time, the ecosystem matured significantly (hopefully signalling a trend) and the language can be great to work with.