ANN : Diskhash. Disk-based, persistent hash tables

A few weeks ago, I decided to finally scratch an itch I’ve had for a while: I had a few days off from work and implemented a persistent, disk-based, hash table. Funnily enough, I’m now intensively using it at work, but a priori it felt more like a side project than a work one (it’s often a fuzzy border).

A disk based hashtable

The idea is very simple: it’s a basic hash table which is run on mmap()ed memory so that it can be loaded from disk with a single system call. I’ve heard this type of system to be referred to as “baked data”: you build structures in memory that can be written from/to disk without any need for parsing/converting.

I implemented it all in C (because it is the lowest-common denominator), but there are interfaces in C++, Python, and Haskell. The disk format is fixed, so all these interfaces can work with the same tables. You can jump to the bottom of the post to see code examples. 

Performance

My usage is mostly to build the hashtable once and then reuse it many times. Several design choices reflect this bias and so does performance. Building the hash table can take a while. A big (roughly 1 billion entries) table took almost 1 hour to build. This compares to about 10 minutes for building a Python hashtable of the same size.

On disk, this table takes up 32GB (just the keys and data use up 21GB so I find the overhead acceptable). This compares with almost 200GB for the Python version. Additionally, several processes on the same machine can share the memory map (the operating system will do this automatically for you), further reducing memory usage when more than one process is running.

Using the C++ interface, I measured lookups as taking circa 10-20 microseconds per lookup. When doing the same from Python, it takes 400-800 microseconds. The big difference depends on whether the cache is hot or cold (doing the same lookup twice is much faster than two different lookups as the memory is already in cache). A raw Python hash table takes ca. 40 microseconds. My guess is that the extra overhead of diskhash in Python is boxing/unboxing of types, while the Python version uses boxed types (which is also responsible for the extra memory usage). Still, this is very acceptable.

Design

The format on disk is pretty simple:

[HEADER]
    - magic number (versioned)
    - options
    - size of table
    - number of used slots
[TABLE OF INDICES]
    - integer indices into data table [with value 0 representing NULL and other indices in 1-based format]
[DATA TABLE]
    - [key/value] pairs

The format on disk is the same as the format on memory, thus loading is simply calling mmap(). Conflicts are handled using linear indexing (table load is kept at <50%). When it is necessary to expand the table, a completely new table is built (that is 1.7x as large as the current one), all the elements are inserted into this table and, then, we switch to that table. This can be quite expensive, but is amortized so, insertions are still O(1) and it is possible to pre-allocate a large table if desired.

The indirection (there is a table of indices pointing to a data table) keeps disk space down at the cost of an extra step (and probably an extra memory access) at lookup time. The code is smart enough to switch from 32 to 64 bit indices as the table grows.

There is currently no support for deleting keys.

Experience coding this

C is a pain, but compiling C is fast

I had actually not written any C code in many years. I often use C++, but raw C code is very different. Making sure that the every cleanup path is correct leads to a lot of boilerplate and copy&pasting. Without exceptions and destructors, checking the return value of all functions that we call is a pain. It is not hard, but it sure is tedious.

One thing that was very cool is how fast compilation is. The first time I ran gcc, I thought there must have been something wrong as the command was instantaneous.

Nope, compilation of the library and the test driver takes <0.2s (slightly slower if you use optimizations; it goes all the way up to 0.3s).

This means that compiling and running C is about as fast as starting an interpreter.

Writing a disk based hash is easy, packaging the code is hard

The two hardest things in computer science are not naming things or cache invalidation but installing packages on Linux and solving packaging errors.

I first wrote a Python wrapper using ctypes, but while it was trivial to write and it worked well, I could not find a way to package it. Finally, I decided it was easier to just use the raw C API instead of figuring out how to convince setuptools to do what I wanted.

The haskell packaging was slightly easier, but it still required a few tries until all the right files were correctly included in the package (which is why there were 3 releases until it worked: the code is the same, it was just me fiddling with packaging).

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python’s interface is more limited and only integers are supported as values in the hash table (they are stored as 64-bit integers).

import diskhash
tb = diskhash.Str2int("testing.dht", 15)
tb.insert("key", 9)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it’s not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

5 thoughts on “ANN : Diskhash. Disk-based, persistent hash tables

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.