Big Data Biology Lab Software Tool Commitments

Cross-posting from our group website.

Preamble. We produced two types of code artefacts: (i) code that is supportive of results in a results-driven paper and (ii) software tools intended for widespread use.

For an example of the first type, see the Code Ocean capsule that is linked to (Coelho et al., 2018). The main goal of this type of code release is to serve as an Extended Methods section to the paper. Hopefully, it will be useful for the small minority of readers of the paper who really want to dig into the methods or build upon the results, but the work aims at biological results.

This document focuses on the second type of code release: tools that are intended for widespread use. We’ve released a few of these in the last few years: JugNGLess, and Macrel. Here, the whole point is that others use the code. We also use these tools internally, but if nobody else ever adopts the tools, we will have fallen short.

The Six Commitments

  1. Five-year support (from date of publication) If we publish a tool as a paper, then we commit to supporting it for at least five years from the date of publication. We may stop developing new features, but if there are bugs in the released version, we will assume responsibility and fix them. We will also do any minor updates to keep the tool running (for example, if a new Python version breaks something in one of our Python-based tools, we will fix it). Typically, support is provided if you open an issue on the respective Github page and/or post to the respective mailing-list.
  2. Standard, easy to install, packages Right now, this means: we provide conda packages. In the future, if the community moves to another system, we may move too.
  3. High-quality code with continuous integration All our published packages have continuous integration and try to follow best practices in coding.
  4. Complete documentation We provide documentation for the tools, including tutorials, example data, and reference manuals.
  5. Work well, fail well We strive to make our tools not only work well, but also “fail well”: that is, when the user provides erroneous input, we attempt to provide good quality error messages and to never produce bad output (including never producing partial outputs if the process terminates part-way through processing).
  6. Open source, open communication Not only do we provide the released versions of our tools as open source, but all the development is done in the open as well.

Note for group members: This is a commitment from the group and, at the end of the day, the responsibility is Luis’ responsibility. If you leave the group, you don’t have to be responsible for 5 years. If you leave, your responsibility is just the basic responsibility of any author: to be responsive to queries about what was described in the manuscript, but not anything beyond that. What it does mean is that we will not be submitting papers on tools that risk being difficult to maintain. In fact, while the goals above are phrased as outside-focused, they are also internally important so that we can keep working effectively even as group members move on.

Why NGLess took so long to become a robust tool (but now IS a robust tool)

Titus Brown posted that good research software takes 2-3 years to produce. As we are close to submitting a manuscript for our own NGLess, which took a bit longer than that, I will add some examples of why it took so long to get to this stage.

There is a component of why it took so long that is due to people issues and to the fact that NGLess was mostly developed as we needed to process real data (and, while I was working on other projects, rather than on NGLess). But even if this had been someone’s full time project, it would have taken a long time to get to where it is today.

It does not take so long because there are so many Big ideas in there (I wish). NGLess contains just one Big Idea: a domain specific language that results in a tool that is not just a proof of concept but a is better tool because it uses a DSL; everything else follows from that.

Rather, what takes a long time is to find all the weird corner cases. Most of these are issues the majority of users will never encounter, but collectively they make the tool so much more robust. Here are some examples:

  • Around Feb 2017, a user reported that some samples would crash ngless. The user did not seem to be doing anything wrong, but half-way through the processing, memory usage would start growing until the interpreter crashed. It took me the better part of two days to realize that their input files were malformed: they consisted of a few million well-formed reads, then a multi-Gigabyte long series of zero Bytes. Their input FastQs were, in effect, a gzip bomb.

    There is a kind of open source developer that would reply to this situation by saying well, knuckle-head, don’t feed my perfect software your crappy data, but this is not the NGLess way (whose goal is to minimize the effort of real-life people), so we considered this a bug in NGLess and fixed it so that it now (correctly) complains of malformed input and exits.

  • Recently, we realized that if you use the motus module in a system with a badly working locale, ngless could crash. The reason is that, when using that module, we print out a reference for the paper, which includes some authors with non-ASCII characters in their names. Because of some weird combination of the Haskell runtime system and libiconv (which seems to generally be a mess), it crashes if the locale is not installed correctly.

    Again, there is a kind of developer who would respond to this by well, fix your locale installation, knuckle-head, but we added a workaround.

  • When I taught the first ngless workshop in late 2017, I realized that one of inconsistencies in the language was causing a lot of confusion for the learners. So, the next release fixed that issue.
  • There are two variants of FastQ files, depending on whether the qualities are encoded by adding 33 or 64. It is generally trivial to infer which one is being used, though, so NGLess heuristically does so. In Feb 2017, a user reported that the heuristics were failing on one particular (well-formed) example, so we improved the heuristics.
  • There are 25 commits which say they produce “better error messages”. Most of these resulted from a confused debugging session.

None of these issues took that long to fix, but they only emerge through a prolonged beta use period.

You need users to try all types of bad input files, you need to try to teach the tool to understand where the pain points for new users are, you need someone to try to it out in a system with a mis-installed locale, &c

One possible conclusion it that for certain kinds of scientific software, it is actually better if it is done as a side-project: you can keep publishing other stuff, you can apply it on several problems, and the long gestation period catches all these minor issues, even while you are being productive elsewhere. (This was also true of Jug: it was never really a project per se, but after a long time it became usable and its own paper).

How NGLess uses its version declaration

NGLess is my metagenomics tool, which is based on a domain specific language. So, NGLess is both a language and a tool (which implements the language).

Since the beginning, ngless has had a focus on reproducibility and one the small ways in which this was implemented was that ngless requires a version declaration. Every ngless script is required to start with a version declaration:

    ngless "0.5"

This was always intended to enable the language to change while keeping perfect reproducibility of past scripts. Until recently, though, this was just hypothetical.

In October, I taught a course on NGLess and it became clear that one of the minor inconsistencies in the previous version of the language (at the time, version “0.0”) was indeed confusing. In the previous version of the language, the preprocess function modified its arguments. No other function did this.

In version “0.5” (which was released on November 1st), preprocess is now a pure function, so that you must assign its output to a value.

However, and this is where the version declaration comes into play, the newer executable still accepts scripts with the version declaration ngless "0.0". Furthermore, if you declare your script as using ngless 0.0, then the old behaviour is used. Thus, we fixed the language, but nobody needs to update their scripts.

Implementation note (which shouldn’t concern the user, but may be interesting to others): before interpretation, ngless will transform the input script, adding checks and optimizing it. A new pass (which is only enabled is the user requested version “0.0”), simply transforms the older code into its newer counterpart. Then, the rest of the process proceeds as if the user had typed in the newer version.

ANN : Diskhash. Disk-based, persistent hash tables

A few weeks ago, I decided to finally scratch an itch I’ve had for a while: I had a few days off from work and implemented a persistent, disk-based, hash table. Funnily enough, I’m now intensively using it at work, but a priori it felt more like a side project than a work one (it’s often a fuzzy border).

A disk based hashtable

The idea is very simple: it’s a basic hash table which is run on mmap()ed memory so that it can be loaded from disk with a single system call. I’ve heard this type of system to be referred to as “baked data”: you build structures in memory that can be written from/to disk without any need for parsing/converting.

I implemented it all in C (because it is the lowest-common denominator), but there are interfaces in C++, Python, and Haskell. The disk format is fixed, so all these interfaces can work with the same tables. You can jump to the bottom of the post to see code examples. 

Performance

My usage is mostly to build the hashtable once and then reuse it many times. Several design choices reflect this bias and so does performance. Building the hash table can take a while. A big (roughly 1 billion entries) table took almost 1 hour to build. This compares to about 10 minutes for building a Python hashtable of the same size.

On disk, this table takes up 32GB (just the keys and data use up 21GB so I find the overhead acceptable). This compares with almost 200GB for the Python version. Additionally, several processes on the same machine can share the memory map (the operating system will do this automatically for you), further reducing memory usage when more than one process is running.

Using the C++ interface, I measured lookups as taking circa 10-20 microseconds per lookup. When doing the same from Python, it takes 400-800 microseconds. The big difference depends on whether the cache is hot or cold (doing the same lookup twice is much faster than two different lookups as the memory is already in cache). A raw Python hash table takes ca. 40 microseconds. My guess is that the extra overhead of diskhash in Python is boxing/unboxing of types, while the Python version uses boxed types (which is also responsible for the extra memory usage). Still, this is very acceptable.

Design

The format on disk is pretty simple:

[HEADER]
    - magic number (versioned)
    - options
    - size of table
    - number of used slots
[TABLE OF INDICES]
    - integer indices into data table [with value 0 representing NULL and other indices in 1-based format]
[DATA TABLE]
    - [key/value] pairs

The format on disk is the same as the format on memory, thus loading is simply calling mmap(). Conflicts are handled using linear indexing (table load is kept at <50%). When it is necessary to expand the table, a completely new table is built (that is 1.7x as large as the current one), all the elements are inserted into this table and, then, we switch to that table. This can be quite expensive, but is amortized so, insertions are still O(1) and it is possible to pre-allocate a large table if desired.

The indirection (there is a table of indices pointing to a data table) keeps disk space down at the cost of an extra step (and probably an extra memory access) at lookup time. The code is smart enough to switch from 32 to 64 bit indices as the table grows.

There is currently no support for deleting keys.

Experience coding this

C is a pain, but compiling C is fast

I had actually not written any C code in many years. I often use C++, but raw C code is very different. Making sure that the every cleanup path is correct leads to a lot of boilerplate and copy&pasting. Without exceptions and destructors, checking the return value of all functions that we call is a pain. It is not hard, but it sure is tedious.

One thing that was very cool is how fast compilation is. The first time I ran gcc, I thought there must have been something wrong as the command was instantaneous.

Nope, compilation of the library and the test driver takes <0.2s (slightly slower if you use optimizations; it goes all the way up to 0.3s).

This means that compiling and running C is about as fast as starting an interpreter.

Writing a disk based hash is easy, packaging the code is hard

The two hardest things in computer science are not naming things or cache invalidation but installing packages on Linux and solving packaging errors.

I first wrote a Python wrapper using ctypes, but while it was trivial to write and it worked well, I could not find a way to package it. Finally, I decided it was easier to just use the raw C API instead of figuring out how to convince setuptools to do what I wanted.

The haskell packaging was slightly easier, but it still required a few tries until all the right files were correctly included in the package (which is why there were 3 releases until it worked: the code is the same, it was just me fiddling with packaging).

Examples

The following examples all create a hashtable to store longs (int64_t), then set the value associated with the key "key" to 9. In the current API, the maximum size of the keys needs to be pre-specified, which is the value 15 below.

Raw C

#include <stdio.h>
#include <inttypes.h>
#include "diskhash.h"

int main(void) {
    HashTableOpts opts;
    opts.key_maxlen = 15;
    opts.object_datalen = sizeof(int64_t);
    char* err = NULL;
    HashTable* ht = dht_open("testing.dht", opts, O_RDWR|O_CREAT, &err);
    if (!ht) {
        if (!err) err = "Unknown error";
        fprintf(stderr, "Failed opening hash table: %s.\n", err);
        return 1;
    }
    long i = 9;
    dht_insert(ht, "key", &i);
    
    long* val = (long*) dht_lookup(ht, "key");
    printf("Looked up value: %l\n", *val);

    dht_free(ht);
    return 0;
}

Haskell

In Haskell, you have different types/functions for read-write and read-only hashtables.

Read write example:

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRW "testing.dht" 15
    htInsertRW ht "key" (9 :: Int64)
    val <- htLookupRW "key" ht
    print val

Read only example (htLookupRO is pure in this case):

import Data.DiskHash
import Data.Int
main = do
    ht <- htOpenRO "testing.dht" 15
    let val :: Int64
        val = htLookupRO "key" ht
    print val

Python

Python’s interface is more limited and only integers are supported as values in the hash table (they are stored as 64-bit integers).

import diskhash
tb = diskhash.Str2int("testing.dht", 15)
tb.insert("key", 9)
print(tb.lookup("key"))

The Python interface is currently Python 3 only. Patches to extend it to 2.7 are welcome, but it’s not a priority.

C++

In C++, a simple wrapper is defined, which provides a modicum of type-safety. You use the DiskHash<T> template. Additionally, errors are reported through exceptions (both std::bad_alloc and std::runtime_error can be thrown) and not return codes.

#include <iostream>
#include <string>

#include <diskhash.hpp>

int main() {
    const int key_maxlen = 15;
    dht::DiskHash<uint64_t> ht("testing.dht", key_maxlen, dht::DHOpenRW);
    std::string line;
    uint64_t ix = 0;
    while (std::getline(std::cine, line)) {
        if (line.length() > key_maxlen) {
            std::cerr << "Key too long: '" << line << "'. Aborting.\n";
            return 2;
        }
        const bool inserted = ht.insert(line.c_str(), ix);
        if (!inserted) {
            std::cerr  << "Found repeated key '" << line << "' (ignored).\n";
        }
        ++ix;
    }
    return 0;
}

Python “lists” are not lists. A history

Python “lists” are not lists. They are arrays.

§

This seems like the sort of rookie mistake that starting scientists make: they use terms imprecisely, they write like journalists not scientists.

I can totally imagine myself telling a student starting out: a list is a list and an array is an array, these are well defined computer science terms, you cannot use them interchangibly (and I can imagine their mental reaction: this guy is a pedantic prick [1]).

I tell the students to think like a politician: everything you write must still be defensible even taken out of context[2]

Calling an array a list is wrong.

§

I wondered whether they had been lists in the past, so I went back and checked the earliest version of Python available, which happens to be Python 1.0.1

After a minor bug fix [3], it actually compiled and ran on my Ubuntu 13.04 machine! It is somewhat crashy on errors, but otherwise works.

§

First find: already in Python 1.0.1, lists were arrays! This was already year old code base, so maybe at an even earlier time point, men were men and lists were lists. But by 1994, lists were arrays.

§

A surprising thing about the code: if you’re familiar with recent Python code, this will feel very familiar. Here is how you set a list member:

int
setlistitem(op, i, newitem)
    register object *op;
    register int i;
    register object *newitem;
{
    register object *olditem;
    if (!is_listobject(op)) {
        XDECREF(newitem);
        err_badcall();
        return -1;
    }
    if (i < 0 || i >= ((listobject *)op) -> ob_size) {
        XDECREF(newitem);
        err_setstr(IndexError, "list assignment index out of range");
        return -1;
    }
    olditem = ((listobject *)op) -> ob_item[i];
    ((listobject *)op) -> ob_item[i] = newitem;
    XDECREF(olditem);
    return 0;
}

This feels very similar to recent Python.

§

Even more surpring, here is the inner loop:

switch (opcode) {

case POP_TOP:
    v = POP();
    DECREF(v);
    break;

case ROT_TWO:
    v = POP();
    w = POP();
    PUSH(v);
    PUSH(w);
    break;

and so on… This is almost exactly the same as the most recent Python. We are still using fundamentally the same implementation of Python that was out 20 years ago.

Update: See the next post for a response to the notion that I just didn’t get what a list refers to.

[1] Student: I meant list as a general one-thing-after-another-in-order, Wise older researcher You mean a sequence Student: Yeah, that. Wise older researcher: then you should write sequence. At this point, the student attains enlightment and starts applying for jobs in industry.
[2] A common complaint I have heard after several MS thesis defenses is the jury member took this one sentence from my introduction out of context and made a big deal out of it. It wasn’t even that related to my work.
[3] There is a function getline() in the standard library which was not there in the early 1990s. So, Python’s use of an internal function with the same name gets the compiler confused. Renaming the internal function to python_getline fixes it.

Story of Two Bugs

I while back, I got a bug report for mahotas-imread [1]:

PNG reads in inverted: This image loads in imread 0.3.1 incorrectly. It looks right if I xor it with 255, but I don’t think that’s all that’s wrong.

im16_as8

The first thing I noticed was that this was a 16 bit image. If you’ve been coding for as long as I have, you’ll immediately think: It’s a byte-endiness issue [2]. So, I tested:

imshow(im.byteswap())

and saw the following:

byteswap0

Not good. The I looked at the hint that the original poster provided and it did seem to be true: imshow(~f) worked reasonably well. My working hypothesis was thus that there is a flag whereby the PNG data needs to be interpreted after a bit reversion. I also noticed another thing, though:

max_16bit_value = 2**16-1
imshow(max_16bit_value - f)

Also looks decent.

The TIFF format does allow you to specify whether zero is supposed to be white or black. Maybe PNG has a similar “feature.”

I read through the libpng documentation (which is not great), a bit through its source, and through online descriptions of PNG format. Along the way, I noticed that converting the image to TIFF (with ImageMagick) and loading it with imread also gave the wrong result. Perhaps the TIFF reader had the same bug or ImageMagick [3].

Eventually, I realised that PNG files are in network order (i.e., in big-endian format) and the code did not convert them to little-endian. Thus, my initial intuition had been right: it was a byte-swapping error!

But in this case, why did imshow(f.byteswap()) result in a mangled image?

I stated to suspect that matplotlib had a bug. I tried to do:

imshow(f.byteswap() / 2.**16)

and it resulted in the correct image being shown.

As it turned out, matplotlib does not do the right thing when given 16 bit files. Thus, I had this figured out. I fixed imread and released version 0.3.2 and closed the bug report.

§

A single bug is often easy to debug, but when you have multiple bugs interacting; it is much more than twice as hard.

§

Hairy details: You may want to stop reading now.

Consider the following identities:

255 == 0xff
-f == (f ^ 0xff + 1)
2**16 - f = -f + 2**16 == -f (because of overflow)

Thus, it should not be surprising that flipping the bits or subtracting the image resulted int , in two-bit complement, ~f is roughly -f. Not exactly, but similarly enough that, by eye, it is hard to tell apart.

Finally, it all makes sense when you realise that matplotlib assumes that non-8 bit images are floating point and does:

final_image = (input_image * 255)
final_image = final_image.astype(np.uint8)

Because what is multiplying by 255? It’s the same as multiplying by -1! Thus, matplotlib would multiply by -1 and then take the low order bits. Thus, it showed a correct image if you pre-multiplied it by -1 (or flipped the bits) and gave it a byteswapped image!

[1] People don’t always appreciate how valuable good bug reports are. Seriously, they are a huge help: you are testing the software for me. Unfortunately, either shyness or past bad experiences will often cause people who see something worng to not report it.
[2] I now have over 15 years of experience coding (having had a relative late start [I didn’t care much about computers until I was close to college age], I’ve caught up.) If there is an area where I really feel that my experience shines through is debugging: I’ve seen enough mistakes and errors that my guesses as to what the bug is are more and more accurate (this is true even in code I have not seen).
[3] One of the reasons I started mahotas-imread was that I had not found a single library that could read the formats I wanted without a lot of bugs. So, I trust no one on this. In this case, the paranoia was unwarranted, as we’ll see.

Why Python is Better than Matlab for Scientific Software

Why Python is Better than Matlab for Scientific Software

This is an argument I made at EuBIAS when arguing for the use of mahotas and the Python numpy-stack for bioimage informatics. I had a few conversations around this and decided to write a longer post summarizing the whole argument.

0. My argument mostly applies for new projects

If you have legacy code in MATLAB, then it may make sense to just continue using it. If it works, don’t fix it. However, if your Matlab code keeps causing you pain, Python might be a solution.

Note too that porting code is not the same as writing from scratch. You can often convert code from MATLAB to Python in a small fraction of the time it would take you to start from scratch.

1. Python has caught up with Matlab and is in the process of overtaking it.

This is my main argument: the momentum is in Python’s direction. Even two or three years ago, Python was behind. Now, it’s sailing past Matlab.

nr_lines_python

This graph shows the number of lines of code in some important projects for bioimage informatics/science in general (numpy, matplotlib, mahotas, skimage, and sklearn). As you can see, the base projects on the top (numpy and matplotlib) have been stable for some years, while the more applied packages at the bottom have exploded in recent years.

Depending on what you are doing, Python may even better support it. It is now, Matlab which is playing catch-up with open source software (for example, Matlab is now introducing their own versions of Dataframe, which Python has through Pandas [ itself, a version of R’s Dataframe object]).

The Python projects are also newer and tend, therefore, to be programmed in a more modern way: it is typical to find automated testing, excellent and complete documentation, a dedicated peer-reviewed publication, &c. This ain’t your grandfather’s open source with a dump on sourceforge and a single README file full of typos.

As an example of the large amount of activity going on in the Python world, just this week, Yhat released ggplot for Python [1]. So, while last week, I was still pointing to plotting as one of the weakneses of Python, it might no longer be true.

2. Python is a real programming language

Matlab is not, it is a linear algebra package. This means that if you need to add some non-numerical capabilities to your application, it gets hairy very fast.

For scientific purposes, when writing a small specialized script, Python may often be the second best choice: for linear algebra, Matlab may have nicer syntax; for statistics, R is probably nicer; for heavy regular expression usage, Perl (ugh) might still be nicer; if you want speed, Fortran or C(++) may be a better choice. To design a webpage; perhaps you want node.js. Python is not perfect for any of these, but is acceptable for all of them.

In every area, specialized languages are the best choice, but Python is the second best in more areas [2].

3. Python can easily interface with other languages

Python can interfact with any language which can be interacted through C, which is most languages. There is a missing link to some important Java software, but some work is being done to address that too. Technically, the same is true of Matlab.

However, the Python community and especially the scientific Python community has been extremely active in developing tools to make this as easy as you’d like (e.g., Cython).  Therefore, many tools/libraries in C-based languages are already wrapped in Python for you. I semi-routinely get sent little snippets of R code to do something. I will often just import rpy2 and use it from Python without having to change the rest of my code.

4. With Python, you can have a full open-source stack

This means that you are allowed to, for example, ship a whole virtual machine image with your paper. You can also see look at all of the code that your computation depends on. No black boxes.

5. Matlab Licensing issues are a pain. And expensive.

Note that I left the word expensive to the end, although in some contexts it may be crucial. Besides the basic MATLAB licenses, you will often need to buy a few more licenses for specific toolboxes. If you need to run the code on a cluster, often that will mean more licenses.

However, even when you do have the money, this does not make the problem go away: now you need admin for the licensing. When I was at CMU, we had campus-wide licenses and, yet, it took a while to configure the software on every new user’s computer (with some annoying minor issues like the fact that the username on your private computer needed to match the username you had on campus), you couldn’t run it outside the network (unless you set up a VPN, but this still means you need network access to run a piece of software), &c. Every so often, the license server would go down and stop everybody’s work. These secondary costs can be as large as the licensing costs.

Furthermore, using Python means you can more easily collaborate with people who don’t have access to Matlab. Even with a future version of yourself who decided to economize on Matlab licenses (or if the number of users shrinks and your institution decides to drop the campus licensing, you will not be one of the few groups now forced to buy it out-of-pocket).

By the way, if you do want support, there are plenty of options for purchasing it [3]: larger companies as well as individual consultants available in any city. The issue of support is orthogonal to the licensing of the software itself.

§

Python does have weaknesses, of course; but that’s for another post.

[1] Yhat is a commercial company releasing open-source software, by the way; for those keeping track at home.
[2] I remember people saying this about C++ back in the day.
[3] However, because science is a third-world economy, it may be easier to spend 10k on Matlab licenses which come with phone support to spending 1k on a local Python consultant.

Merging directories without loss of data

A problem I often have is to have two directories which are probably mostly the same, but maybe not completely as some of the files might be newer (edited) versions of the other.

For example, directory A:

A/
A/document.txt
A/blogpost.txt
A/photo.jpg
A/me.jpg

and B:

B/
B/document.txt
B/photo.jpg
B/me.jpg
B/you.jpg

Now, I want to merge A and B. With only this small number of files, I could easily check by hand if document.txt is the same on both sides, &c. However, in a large directory, this becomes impossible, so I wrote up a small utitlity to do so:

mergedirs B A

Will go through all of the files in B and check whether an equivalent file in A exists. If so, it will check the contents* (and flags, depending on the command line arguments used) and refuse to remove any file for which you do not have a copy.

Another cute thing it can do is compute a hash of a directory with all its files:

mergedirs --mode=hash

Prints out (for a directory called merge):

merge                    4a44a8706698da50f41fef5fdcffd163

This can be useful to check whether two directories in different computers are exactly the same (in terms of file contents, flags &c).

It’s mostly a tool I wrote to scratch my own itch. I have no plans to develop it beyond my needs, but I it might be useful for others too.

Working Around Bugs in Third Party Libraries

This is another story of continuous improvement, this time in mahotas’ little brother imread.

Last week, Volker Hilsenstein here at EMBL had a few problems with imread on Windows. This is one of those very hard issues: how to help someone on a different platform, especially one which you know nothing about?

In the end, the problem was not Windows per se, but an old version of libtiff. In that version, there is a logic error (literally, there is a condition which is miswritten and always false) and the code will attempt to read a TIFF header from a file even when writing. Mahotas-imread was not ready for this.

Many (especially in the research open-source world, unfortunately) would just say: well, I won’t support broken versions of libtiff: if your code does not adhere to the spec, I am just going to not work for you, if you don’t do exactly what you should, then I won’t work either. See this excellent old essay by Joel Spolsky on this sort of thing.

In my case, I prefer to work around the bug and when libtiff tries to read in write mode, return no data; which it correctly handles. I wrote the following data reading function to pass to libtiff:

tsize_t tiff_no_read(thandle_t, void*, tsize_t) {
        return 0;
}

The purpose of this code is simply to make imread work even on a broken, 5 year old version of a third party library.

§

In the meanwhile, we also fixed compilation in Cygwin as well as a code path which led to a hard crash.

Especially the possibility of a hard crash made me decide that this was important enough to merit a new release.

Extended Depth of Field In Python using Mahotas

I wanted to implement an extended depth of field algorithm last week for visualisation.

The idea of extended depth of focus is that you have a stack of images, taken at different focal points, and you build a single artificial image so that you get everything in focus. Some simple methods work OK, some very advanced methods work better. We were happy with OK for our setting. Here’s how to do it with mahotas:

Start with standard imports:

import numpy as np
import mahotas as mh

We are going to assume that you have an image object, which has three dimensions: the stack, height, and width:

stack,h,w = image.shape

We use mh.sobel as the measure of infocusness for each pixel [1]:

focus = np.array([mh.sobel(t, just_filter=True) for t in image])

Now, we select the best slice at each pixel location:

best = np.argmax(focus, 0)

So far, very easy. The next part is the hard part. We want to do the following:

r = np.zeros((h,w))-1
for y in xrange(h):
    for x in xrange(w):
        r[y,x] = image[best[y,x], y, x]

But this is very slow (never run nested loops in Python if you can avoid it). We get the same result with a slightly less legible, but faster manipulation [2]:

image = image.reshape((stack,-1)) # image is now (stack, nr_pixels)
image = image.transpose() # image is now (nr_pixels, stack)
r = image[np.arange(len(image)), best.ravel()] # Select the right pixel at each location
r = r.reshape((h,w)) # reshape to get final result

Et voilà!

§

Here is an example, from a stack of plankton images. This is is the maximum intensity projection:

Maximum Intensity Projection

This is the most in-focus slice (using the sobel operator):

Most in-focus slice

And this is the extended depth of field result:

Extended Depth of Field

It is clearly sharper (look at some of the organelles on the top-left: they are a blur in the other two images),at the expense of some possible noise. I actually played around with blurring the image a little bit and it did improve things ever so slightly.

§

Cite the mahotas paper if using this in a scientific publication.

[1] Other methods simply use a different measure here.
[2] I am not 100% convinced that this is the best. After all we create an array of size len(image) just to index. I would be happy to find an alternative.