Fast and useful errors with ngless [3/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection [this post]
  4. Extending and interacting with other projects
  5. Miscellaneous

If you are the rare person who just writes code without bugs (if your favorite editor is cat), then you can skip this post as it only concerns those of us who make mistakes. Otherwise, I will assume that /your code will have bugs/. Your code will have silly typos and large mistakes.

Too many tools work well, but fail badly; that is, if all their dependencies are there, all the files are exactly perfect and the user specificies all the right options, then the tool will work perfectly; but any mistake and you will get a bizarre error, which will be hard to fix. Thus,the tool is bad at failing. Ngless promises to work well and fail well.

Make some errors impossible

Let us recall our running example:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

Note that we do not specify paths for the ‘omrgc’ reference or the functional map file. We also do not specify files for intermediate files. This is all implicit and you cannot mess it up. The Fastq encoding is auto-detected, removing one more opportunity for you to mess up (although you can specify the encoding if you really want to).

Ngless always uses the three step output safe writing pattern:

  1. write the output to a temp file,
  2. sync the file and its directory to disk,
  3. rename the temp file to the final output name.

The final step is atomic. That is, the operating system garantees that it either fully completes or never executes even if there is an error, so that you never get a partial file. Thus, if there is an output file, you know that ngless finished without errors (up to that point, at least) and that the output is correct. No more asking “did the cluster crash affect this file? Maybe I need to recompute or maybe I count the number of lines to make sure it’s complete”. None of that: if the file is there, it is complete.

Side-note: programming languages (or their standard libraries) should have support for this safe-output writing pattern. I do not know of any language that does.

Make error detection fast

Have you ever run a multi-step pipeline where the very last step (often saving the results) has a silly typo and everything fails disastrously at that point wasting you hours of compute time? I know I have. Ngless tries as hard as possible to make sure that doesn’t happen.

Although ngless is interpreted rather than compiled, it performs an initial pass over your script to check for all sorts of possible errors.

Ngless is a typed language and all types are checked so that if you try to run the count function without first maping, you will get an error message.

All arguments to functions are also checked. This even checks some rules that would be hard to impose using a more general purpose programming language: for example, when you call count, either (1) you are using a built-in reference which has its own annotation files or (2) you have to pass in the path to a GTF or gene map file so that the output of the mapping can be annotated and summarized. This constraint would be hard to express in, for example, Java or C++, but ngless can check this type of condition easily.

The initial check makes sure that all necessary input files exist and can be read and even that any directories used for output are present and can be written to (in the script above, if a directory named output is not present, you will get an immediate error). If you are using your own functional map, it will read the file header to check that any features you use are indeed present (in the example above, it checks that the ‘ko’ and ‘cog’ features exist in the built-in ocean microbiome catalog).

All typos and other similar errors are caught immediately. If you mistype the name of your output directory, ngless will let you know in 0.2 seconds rather than after hours of computation.

You can also just run the tests with ngless -n script-name.ngl: it does nothing except run all the validation steps.

Again, this is an idea that could be interesting to explore in the context of general purpose languages.

Make error messages helpful

An unknown error occurred
An unhelpful error message

As much as possible, when an error is detected, the message should help you make sense of it and fix it. A tool cannot always read your mind, but as much as possible, ngless error messages are descriptive.

For example, if you used an illegal argument/parameter to a function, ngless will remind you of what the legal arguments are. If it cannot write to an output file it will say it cannot write to an output file (and not just “IO Error”). If a file is missing, it will tell you which file (and it will tell you in about 0.2 seconds.


Ngless is designed to make some errors impossible, while trying hard to give you good error messages for the errors that will inevitably crop up.

Mounting My Phone as a Filesystem

After a lot of time spent trying to find the right app/software for getting things off my phone into my computer (I have spent probably days now, accumulated over my lifetime of phone-ownership; this shouldn’t be so hard), I just gave up and wrote a little FUSE script to mount the phone as a directory in Linux.

It is very basic and relies on adb (android debug bridge) being installed on the PATH, but if it is present, I was able to just type:

$ mkdir -p phone
$ python phone &

To get the phone mounted as a directory:

$ cd phone
$ ls -l
total 656 
drwxr-xr-x 1 root      root           0 Jan  1 00:44 acct 
drwxrwx--- 1 luispedro      2001      0 Jan  6 12:30 cache[...]

Now, I could navigate the directories and files from the phone to the computer (uploading files in is not available as I don’t need it).

What was nice was that, using fusepy (which also needs to be available), the code wasn’t too hard to write even. Then I was able to see where my phone hard drive disk was going (I keep running out of disk space) and delete a few Gigabytes of pictures I had anyone already saved somewhere else (by making sure the hashes matched).

It’s available at: But rely on it at your own risk! It a best-attempt code and works on my phone, but I didn’t vet it 100%. It’s also kind of slow to list directories and such.

(It may also work on Mac, as Mac also has FUSE; but I cannot test it).

Hard and Soft Documentation

A good poker player polarizes his hands. This means that, for example, they might play a check-raise (this means you first refrain from betting and then come over the top on your opponent—This is normally done to give the impression that you have a strong hand) when they do have a very strong hand or when they completely missed the flop (they have very bad cards and are just bluffing). Intermediate hands are played differently [1]

I think good software documentation is often also polarized: you should have hard documentation and soft documentation, but nothing in the middle.

Hard Documentation

This is low level documentation, generally. This is of the kind that a Unix manpage gets you. This tells you exactly what each function and each argument does. If it is good, it will often be very succint.

Mahotas has always excelled at this level. Here is the sobel edge function:

def sobel(img, just_filter=False):
    edges = sobel(img, just_filter=False)

    Compute edges using Sobel's algorithm

    `edges` is a binary image of edges computed according to Sobel's algorithm.

    This implementation is tuned to match MATLAB's implementation.

    img : Any 2D-ndarray
    just_filter : boolean, optional
        If true, then return the result of filtering the image with the sobel
        filters, but do not threashold (default is False).    Returns
    edges : ndarray
        Binary image of edges, unless `just_filter`, in which case it will be
        an array of floating point values.

This is because I can remember the general ideas behind each function, but I might like to look up the exact arguments. So, every little detail is documented.

Soft Documentation

Soft documentation are tutorials and other higher level guides. They do not pertain to a single function or a single object, but to the overall structure and thinking behind the software.

Mahotas has not had so much of these, but I have been trying to add some over the past few months (Finding Wally, for example). Some more mahotas blogging might also help.

The Intermediate Level

I don’t care so much for intermediate level documentation. I rarely find that level helpful. Unfortunately, this is the level at which too much bad documentation is written. Stuff like:

This function is part of the image segmentation pipeline. It can be used after pre-filtering or directly on the raw image data.

Ok, sort of helpful, but not really.

[1] The reason for the randomness is that if you always do a single thing, people will catch on and exploit it (if you bluff a lot, people will call it; if you always have a strong hand, then you won’t get the added benefit of having someone try to call your bluff and give you even more chips). Intermediate hands should not be played like this because if the opponent pushes back, they probably have something that beats your intermediate hand. As always in poker, YMMV.

Extended Depth of Field In Python using Mahotas

I wanted to implement an extended depth of field algorithm last week for visualisation.

The idea of extended depth of focus is that you have a stack of images, taken at different focal points, and you build a single artificial image so that you get everything in focus. Some simple methods work OK, some very advanced methods work better. We were happy with OK for our setting. Here’s how to do it with mahotas:

Start with standard imports:

import numpy as np
import mahotas as mh

We are going to assume that you have an image object, which has three dimensions: the stack, height, and width:

stack,h,w = image.shape

We use mh.sobel as the measure of infocusness for each pixel [1]:

focus = np.array([mh.sobel(t, just_filter=True) for t in image])

Now, we select the best slice at each pixel location:

best = np.argmax(focus, 0)

So far, very easy. The next part is the hard part. We want to do the following:

r = np.zeros((h,w))-1
for y in xrange(h):
    for x in xrange(w):
        r[y,x] = image[best[y,x], y, x]

But this is very slow (never run nested loops in Python if you can avoid it). We get the same result with a slightly less legible, but faster manipulation [2]:

image = image.reshape((stack,-1)) # image is now (stack, nr_pixels)
image = image.transpose() # image is now (nr_pixels, stack)
r = image[np.arange(len(image)), best.ravel()] # Select the right pixel at each location
r = r.reshape((h,w)) # reshape to get final result

Et voilà!


Here is an example, from a stack of plankton images. This is is the maximum intensity projection:

Maximum Intensity Projection

This is the most in-focus slice (using the sobel operator):

Most in-focus slice

And this is the extended depth of field result:

Extended Depth of Field

It is clearly sharper (look at some of the organelles on the top-left: they are a blur in the other two images),at the expense of some possible noise. I actually played around with blurring the image a little bit and it did improve things ever so slightly.


Cite the mahotas paper if using this in a scientific publication.

[1] Other methods simply use a different measure here.
[2] I am not 100% convinced that this is the best. After all we create an array of size len(image) just to index. I would be happy to find an alternative.

Jug now outputs metadata on the computation

This week-end, I added new functionality to jug (previous related posts). Jug can now output the final result of a computation including metadata on all the intermediate inputs!

For example:

from jug import TaskGenerator
from import write_metadata # <---- THIS IS THE NEW STUFF

def double(x):
    return 2*x

x = double(2)
x2 = double(x)write_metadata(x2, 'x2.meta.yaml')

When you execute this script (with jug execute), the write_metadata function will write a YAML description of the computation to the file x.meta.yaml!

This file will look like this:

- args: [2]
  meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
  name: jugfile.double
meta: {completed: 'Sat Aug  3 18:31:42 2013', computed: true}
name: jugfile.double

It tells you that a computation named jugfile.double was computated at 18h31 on Saturday August 3. It also gives the same information recursively for all intermediate results.


This is the result of  a few conversations I had at BOSC 2013.


This is part of the new release, 0.9.6, which I put out yesterday.

People Agreeing With Me on the Internet

A couple of things I noticed lately that relate to a few of my posts/ideas.

§ One

Many “scientists should do x” conversations need to be reframed as”the culture of science needs to support x.”

— Jacquelyn Gill (@JacquelynGill) May 16, 2013

This is a very nice way to phrase what I wrote about collective action problems in sharing code

§ Two

First of all, a really cool thing, Thomas J. Webb commented on his own paper with a blog update.

In the blog post, he writes (emphasis is mine):

With some trepidation, I opened up the file of R code I’d used for the original analysis. And got a pleasant surprise: it was readable! Largely this was because I submitted it as an appendix to our paper, and so had taken more care than usual to annotate it carefully. I think this demonstrates an under-apprecaiated virtue of sharing data and code: in preparing it such that it is comprehensible to others, it becomes much more useful to your future self. This point is nicely made in a new paper by Ethan White and colleagues on making using and reusing data easier.

Exactly: sharing code makes you write better code.

(hat tip to @mattjhodgkinson on twitter).

Introduction to Jug: Parallel Tasks in Python

Next Tuesday, I’m giving a short talk on Jug for the Heidelberg Python Meetup.

If you miss it, you can hear it in Berlin at the BOSC2013 (Bioinformatics Open Source Conference) in July. I will take this opportunity to write a couple of posts about jug.

Jug is a cross between the venerable make and Python. In Make tradition, you write a Perhaps, this is best illustrated by an example.

We are going to implement the dumb algorithm for finding all primes under 100. We write a function to check whether a number is prime:

def is_prime(n):
    from time import sleep
    # Sleep a little bit so that this does not run ridiculously fast
    for j in xrange(2,n-1):
        if (n % j) == 0:
            return False
    return True

Then we build tasks out of this function:

from jug import Task
primes100 = [Task(is_prime, n) for n in range(2,101))

Each of these tasks is of the form call “is_prime“ with argument “n“. So far, we have only built the tasks, nothing has been executed. One important point to note is that the tasks are all independent.

You can run jug execute on the command line and jug will start executing tasks:

jug execute &

The nice thing is that it is fine to run multiple of these at the same time:

jug execute &
jug execute &
jug execute &
jug execute &

They will all execute in parallel. We can use jug status to check what is happening:

jug status

Which prints out:

Task name                                    Waiting       Ready    Finished     Running
primes.is_prime                                    0          74          20           5
Total:                                             0          74          20           5

74 is_prime tasks are still in the Ready state, 5 are currently running (which is what we expected, right?) and 20 are done.

Wait a little bit and check again:

Task name                                    Waiting       Ready    Finished     Running
primes.is_prime                                    0           0          99           0
Total:                                             0           0          99           0

Now every task is finished. If we now run jug execute, it will do nothing, because there is nothing for it to do!


The introduction above has a severe flaw: this is not how you should compute all primes smaller than 100. Also, I have not shown how to get the prime values. On Monday, I will post a more realistic example.

It will also include a processing pipeline where later tasks depend on the results of earlier tasks.


(Really weird thing: as I am typing this, WordPress suggests I link to posts on feminism and Australia. Probably some Australian reference that I am missing here.)