How NGLess uses its version declaration

NGLess is my metagenomics tool, which is based on a domain specific language. So, NGLess is both a language and a tool (which implements the language).

Since the beginning, ngless has had a focus on reproducibility and one the small ways in which this was implemented was that ngless requires a version declaration. Every ngless script is required to start with a version declaration:

    ngless "0.5"

This was always intended to enable the language to change while keeping perfect reproducibility of past scripts. Until recently, though, this was just hypothetical.

In October, I taught a course on NGLess and it became clear that one of the minor inconsistencies in the previous version of the language (at the time, version “0.0”) was indeed confusing. In the previous version of the language, the preprocess function modified its arguments. No other function did this.

In version “0.5” (which was released on November 1st), preprocess is now a pure function, so that you must assign its output to a value.

However, and this is where the version declaration comes into play, the newer executable still accepts scripts with the version declaration ngless "0.0". Furthermore, if you declare your script as using ngless 0.0, then the old behaviour is used. Thus, we fixed the language, but nobody needs to update their scripts.

Implementation note (which shouldn’t concern the user, but may be interesting to others): before interpretation, ngless will transform the input script, adding checks and optimizing it. A new pass (which is only enabled is the user requested version “0.0”), simply transforms the older code into its newer counterpart. Then, the rest of the process proceeds as if the user had typed in the newer version.

Advertisements

Upcoming Travels

I have quite a bit of upcoming travel. If you, dear Reader, happen to be around any of these events (or just in the same cities), just get in touch and we might be able to get a coffee.

July 2017

  • I will be in Valencia for FEMS 2017, July 9-13
  • I will be in Lisbon on July 20 for LxMLS 2017
  • I will be in Prague on July 21-25 for ECCB/ISMB 2017. I have poster about ngless at BOSC, but I will be around for the whole conference.

September 2017

November 2017

I tried Haskell for 5 years and here’s how it was

One blogpost style which I find almost completely useless is “I tried Programming Language X for 5 days and here’s how it was.” Most of the time, the first impression is superficial discussing syntax and whether you could get Hello World to run.

This blogpost is I tried Haskell for 5 years and here’s how it was.

In the last few years, I have been (with others) developing ngless, a domain specific language and interpreter for next-generation sequencing. For partly accidental reasons, the interpreter is written in Haskell. Even though I kept using other languages (most Python and C++), I have now used Haskell quite extensively for a serious, medium-sized project (11,270 lines of code). Here are some scattered notes on Haskell:

There is a learning curve

Haskell is a different type of language. It takes a while to fully get used to it if you’re coming from a more traditional background.

I have debugged code in Java, even though I never really learned (or wrote) any Java. Java is just a C++ pidgin language.

The same is not true of Haskell. If you have never looked at Haskell code, you may have difficulty following even simple functions.

Once you learn it, though, you get it.

Haskell has some very nice libraries

You really have very nice libraries, written by people doing really useful things.

Conduit and Parsec are the basis of a lot of ngless code.

Here is an excellent curated list of Haskell library world (added May 4)

Haskell libraries are sometimes hard to figure out

I like to think that you need both hard documentation and soft documentation.

Hard documentation is where you describe every argument to a function and its effects. It is like a reference work (think of man pages). Soft documentation are tutorials and examples and more descriptive text. Well documented software and libraries will have both (there no need for anything in between, I don’t want soft serve documentation).

Haskell libraries often have extremely hard documentation: they will explain the details of functions, but little in the way of soft documentation. This makes it very hard to understand why a function could be useful in the first place and in which contexts to use this library.

This is exacerbated by the often extremely abstract nature of some of the libraries. Case in point, is the very useful MonadBaseControl class. Trust me, this is useful. However, because it is so generic, it is hard to immediately grasp what it does.

I do not wish to over-generalized. Conduit, mentioned above, has tutorials, blogposts, as well as hard documentation.

Haskell sometimes feels like C++

Like C++, Haskell is (in part) a research project with a single initial Big Idea and a few smaller ones. In Haskell’s case, the Big Idea was purely functional lazy evaluation (or, if you want to be pedantic, call it “non-strict” instead of lazy). In C++’s case, the Big Idea was high level object orientation without loss of performance compared to C.

Both C++ and Haskell are happy to incorporate academic suggestions into real-world computer languages. This doesn’t need elaboration in the case of Haskell, but C++ has also been happy to be at the cutting edge. For example, 20 years ago, you could already use C++ templates to perform (limited) programming with dependent types. C++ really pioneered the mechanism of generics and templates.

Like C++, Haskell is a huge language, where there are many ways to do something. You have multiple ways to represent strings, you have accidents of history kept for backwards compatibility. If you read an article from 10 years ago about the best way to do something in the language, that article is probably outdated by two generations.

Like C++, Haskell’s error messages take a while to get used to.

Like C++, there is a tension in the community between the purists and the practitioners.

Performance is hard to figure out

Haskell and GHC generally let me get good performance, but it is not always trivial to figure out a priori which code will run faster and in less memory.

In some trivial sense, you always depend on the compiler to make your code faster (i.e., if the compiler was infinitely smart, any two programs that produce the same result would compile to the same highly efficient code).

In practice, of course, compilers are not infinitely smart and so there faster and slower code. Still, in many languages you can look at two pieces of code and reasonably guess which one will be faster, at least within an order of magnitude.

Not so with Haskell. Even very smart people struggle with very simple examples. This is because the most generic implementation of the code tends to be very inefficient. However, GHC can be very smart and make your software very fast. This works 90% of the time, but sometimes you write code that does not trigger all the right optimizations and your function suddenly becomes 1,000x slower. I have once or twice written two almost identical versions of a function with large differences in performance (orders of magnitude).

This leads to the funny situation that Haskell is (partially correctly) seen as an academic language used by purists obsessed with elegance; while in practice, a lot of effort goes into making the code written as compiler-friendly as possible.

For the most part, though, this is not a big issue. Most of the code will run just fine and you optimize the inner loops at the end (just like in any other language), but it’s a pitfall to watch out for.

The easy is hard, the hard is easy

For minor tasks (converting between two file formats, for example), I will not use Haskell; I’ll do it Python: It has a better REPL environment, no need to set up a cabal file, it is easier to express simple loops, &c. The easy things are often a bit harder to do in Haskell.

However, in Haskell, it is trivial to add some multithreading capability to a piece of code with complete assurance of correctness. The line that if it compiles, it’s probably correct is often true.

Stack changed the game

Before stack came on the game, it was painful to make sure you had all the right libraries installed in a compatible way. Since stack was released, working in Haskell really has become much nicer. Tooling matters.

The really big missing piece is the equivalent of ccache for Haskell.

Summary

Haskell is a great programming language. It requires some effort at the beginning, but you get to learn a very different way of thinking about your problems. At the same time, the ecosystem matured significantly (hopefully signalling a trend) and the language can be great to work with.

Utilitarian Scientific Software Development

Yesterday, I added this new feature to ngless: if the user asks it to run a non-existent script, it will try it give an error message with a guess of what you probably really meant.

For example, if you type Profiles.ngl, but the script is actually called profile.ngl:

$ ngless Profiles.ngl

Exiting after fatal error:
File `Profiles.ngl` does not exist. Did you mean 'profile.ngl' (closest match)?

Previously, it would just say Profiles.ngl not found, without a suggestion.

It took me about 10-15 minutes to implement this (actually most of the functionality was already present as ngless already implemented similar functionality in other context). Is it worth it?

When is it worth it to add bells & whistles to research software?

I think we should think about it, in an ideal world, using the utilitarian principle of research software development: software should reduce the overall human effort. If this feature saves more time overall than it took to write, then it’s worth it.

This Utilitarian Principle says that these 15 minutes were well invested if (and only if) this ngless features saves more than 15 minutes for all its users over its lifetime. I expect that every time an user triggers this error, they’ll save a few seconds (say 2 seconds). 15 minutes is 900 seconds. Thus, this feature is worth it if it is triggered over 450 times. Given that we hope that ngless will be widely used, this feature is certainly worth it.

This principle also makes the argument that it would not be worth to add such a feature to a script that is only used in an internal analysis. So, code that was only meant to be used by myself or by myself and a small number of others, should have fewer bells & whistles.

In a non-ideal world, we need to take into account the incentives of the scientific (or commercial) world and the possibility of market failure: the market does not always reward the most selfless behaviour (this includes the “market” for scientific rewards where shiny new things are “paid” more than boring old maintenance).

Eager Error Detection in Ngless: A big advantage of a DSL

One of the advantages of ngless is its error detection. For example, consider the following ngless script:

ngless "0.0"
input = fastq("input.fq.gz")
mapped = map(input, ref='hg19')
write(mapped, ofile="output/mapped.bam")

If the directory output does not exist (maybe you meant to write outputs; I know I make this sort of mistake all the time), then ngless will immediately give you an error message:

Line 4: File name ‘outputs/output.sam’ used as output, but directory outputs does not exist.

This is a big advantage compared with traditional tools which would run the pipeline until the last step and then fail. Until last week, though, it would not check the following code:

ngless "0.0"
import "parallel" version "0.0"
sample = lock1(readlines("samples.txt"))
input = fastq(sample + ".fq.gz")
mapped = map(input, ref='hg19')
write(mapped, ofile="output/" + sample + ".mapped.bam")

The parallel module adds the lock1 function which will take the list of samples (in this case read from a file using the readlines function) and select one using a locking mechanism so that several ngless processes can run at the same time and each one will work on a different sample. Now, the output name is being formed depending on inputs. So, ngless could not check it before it starts interpreting the script.

With a commit last week, ngless will now check the script by performing the following transformation:

ngless "0.0"
import "parallel" version "0.0"
sample = lock1(readlines("samples.txt"))
__check_ofile("output/" + sample + ".mapped.bam")
input = fastq(sample + ".fq.gz")
mapped = map(input, ref='hg19')
write(mapped, ofile="output/" + sample + ".mapped.bam")

Now, immediately after the variable sample is set, ngless will build the output path and check that it is available with the right permissions. In this case, readlines and lock1 are very fast functions, so any errors will be reported within a few miliseconds of starting ngless before any expensive computation is performed.

This is only possible because we are working with a domain specific language.

How to get FastQ reads from a SAM/BAM file

Using ngless, of course. Just run the following on the command line to take the sequences from

ngless -e 'write(as_reads(samfile("input.bam")), ofile="sequences.fq.gz")'

Let me unpack that a bit. This is equivalent to writing the following script:

ngless "0.0"

# Load a SAM or BAM file
sam_or_bam = samfile("input.bam")

# Get the reads out
reads = as_reads(sam_or_bam)

# Write them out
write(reads, ofile="sequences.fq.gz")

The exact behaviour will depend on whether you have paired or single-ended reads. If you have single-ended reads, they will be saved to sequences.fq.gz. If you have paired-end reads, they will be saved to sequences.pair.1.fq.gz and sequences.pair.2.fq.gz (and if there are any single-ends mixed in, sequences.singles.fq.gz.

Packaging ngless with nix & brew

I spent a little bit of time packaging ngless in these last few days. I currently have packages for nix and homebrew. I used homebrew because I wanted to get a package for Mac OS, but there is also an easy-to-install Linux port, so this could be an option for Linux users too.

Because ngless is Haskell based, I was afraid this was going to be complicated. Before stack appeared, getting all your Haskell dependencies right for a Haskell project was a non-trivial affair (The google query cabal hell returns >500k results).

Contrary to expectations, it was surprisingly easy to package ngless with both nix and brew.

Nix

The result looks like this:

{ nixpkgs ? import  {}, compiler ? "ghc801" }:
nixpkgs.pkgs.haskell.packages.${compiler}.callPackage ./ngless.nix { }

plus the file ngless.nix, which was 95% autogenerated with cabal2nix.

Now, with nix, you can install ngless with a single command:

nix-env -i -f https://github.com/luispedro/ngless-nix/archive/master.tar.gz

At the moment, you need to be on the unstable channel for this to work (because some of the dependencies were only added recently).

The only non-trivial bit was tweaking the dependencies so that ngless can be compiled with GHC 8. I am still using the previous version (7.10) for development and could have just used it as well for this distribution (in fact you can still use it by specifying --arg compiler '"ghc7103"' on the command line).

However, (1) eventually GHC 8 will take over so we might as well see the issues now and (2) ngless compiled with GHC 8 runs 10~20% faster (in part because of a change in GHC introduced to make ngless faster¹).

Homebrew

Since I already use nix on my laptop and have experience navigating its quirks, I was confident that I could get nix to work, but I had never used brew.

Still, about 30 minutes after starting to google for brew, I had a formula that mostly worked. One or two bug fixes later and it just works. I was able to install ngless on both a Linux and a Mac machine using it. Below I show the whole thing so you can see how small the resulting file is.

On the one hand, it was much simpler than nix; dependencies are implicit and there were no weird error messages from the nix language or any need to follow a long trail of references to figure out where something is defined. On the other hand, though, I did run into a few issues which would not have existed with nix related to versions and caching.

After brew downloads the file corresponding to a specific version, it will not redownload even if you change its URL/SHA256. It just assumes that the previous file is correct and you have to manually delete it from the cache. Not a big deal, but nix would not let you make this mistake.

This seems to be the nix tradeoff: a bit rough on the user experience over the short term but easier to understand and fewer errors over the long term. If the efforts to improve nix’s usability bear fruit, it’ll be superior to alternatives. Right now, brew is still nicer for casual users.

Here is the complete brew formula:

require "language/haskell"

class Ngless < Formula
include Language::Haskell::Cabal

desc "Domain Specific Language for NGS Processing"
homepage "http://ngless.readthedocs.io/"
url "https://github.com/luispedro/ngless/archive/34e7e79be62f76b1a4d61d94a59a1e42d9111308.zip"
sha256 "1ac04e535390b0a6189bcfc1c486b54443dc46b7f6b74bf5221b8e4c98c463bc"
version "0.0.0"

head "https://github.com/luispedro/ngless.git"

depends_on "ghc" => :build
depends_on "cabal-install" => :build

depends_on "bwa" => :install
depends_on "samtools" => :install

def install
system "m4 NGLess.cabal.m4 > NGLess.cabal"
install_cabal_package
end
end

¹ This change really did fix a performance bug in GHC, which should help all multi-threaded programmes, but it was ngless that triggered it.