Utilitarian Scientific Software Development

Yesterday, I added this new feature to ngless: if the user asks it to run a non-existent script, it will try it give an error message with a guess of what you probably really meant.

For example, if you type Profiles.ngl, but the script is actually called profile.ngl:

$ ngless Profiles.ngl

Exiting after fatal error:
File `Profiles.ngl` does not exist. Did you mean 'profile.ngl' (closest match)?

Previously, it would just say Profiles.ngl not found, without a suggestion.

It took me about 10-15 minutes to implement this (actually most of the functionality was already present as ngless already implemented similar functionality in other context). Is it worth it?

When is it worth it to add bells & whistles to research software?

I think we should think about it, in an ideal world, using the utilitarian principle of research software development: software should reduce the overall human effort. If this feature saves more time overall than it took to write, then it’s worth it.

This Utilitarian Principle says that these 15 minutes were well invested if (and only if) this ngless features saves more than 15 minutes for all its users over its lifetime. I expect that every time an user triggers this error, they’ll save a few seconds (say 2 seconds). 15 minutes is 900 seconds. Thus, this feature is worth it if it is triggered over 450 times. Given that we hope that ngless will be widely used, this feature is certainly worth it.

This principle also makes the argument that it would not be worth to add such a feature to a script that is only used in an internal analysis. So, code that was only meant to be used by myself or by myself and a small number of others, should have fewer bells & whistles.

In a non-ideal world, we need to take into account the incentives of the scientific (or commercial) world and the possibility of market failure: the market does not always reward the most selfless behaviour (this includes the “market” for scientific rewards where shiny new things are “paid” more than boring old maintenance).

Eager Error Detection in Ngless: A big advantage of a DSL

One of the advantages of ngless is its error detection. For example, consider the following ngless script:

ngless "0.0"
input = fastq("input.fq.gz")
mapped = map(input, ref='hg19')
write(mapped, ofile="output/mapped.bam")

If the directory output does not exist (maybe you meant to write outputs; I know I make this sort of mistake all the time), then ngless will immediately give you an error message:

Line 4: File name ‘outputs/output.sam’ used as output, but directory outputs does not exist.

This is a big advantage compared with traditional tools which would run the pipeline until the last step and then fail. Until last week, though, it would not check the following code:

ngless "0.0"
import "parallel" version "0.0"
sample = lock1(readlines("samples.txt"))
input = fastq(sample + ".fq.gz")
mapped = map(input, ref='hg19')
write(mapped, ofile="output/" + sample + ".mapped.bam")

The parallel module adds the lock1 function which will take the list of samples (in this case read from a file using the readlines function) and select one using a locking mechanism so that several ngless processes can run at the same time and each one will work on a different sample. Now, the output name is being formed depending on inputs. So, ngless could not check it before it starts interpreting the script.

With a commit last week, ngless will now check the script by performing the following transformation:

ngless "0.0"
import "parallel" version "0.0"
sample = lock1(readlines("samples.txt"))
__check_ofile("output/" + sample + ".mapped.bam")
input = fastq(sample + ".fq.gz")
mapped = map(input, ref='hg19')
write(mapped, ofile="output/" + sample + ".mapped.bam")

Now, immediately after the variable sample is set, ngless will build the output path and check that it is available with the right permissions. In this case, readlines and lock1 are very fast functions, so any errors will be reported within a few miliseconds of starting ngless before any expensive computation is performed.

This is only possible because we are working with a domain specific language.

How to get FastQ reads from a SAM/BAM file

Using ngless, of course. Just run the following on the command line to take the sequences from

ngless -e 'write(as_reads(samfile("input.bam")), ofile="sequences.fq.gz")'

Let me unpack that a bit. This is equivalent to writing the following script:

ngless "0.0"

# Load a SAM or BAM file
sam_or_bam = samfile("input.bam")

# Get the reads out
reads = as_reads(sam_or_bam)

# Write them out
write(reads, ofile="sequences.fq.gz")

The exact behaviour will depend on whether you have paired or single-ended reads. If you have single-ended reads, they will be saved to sequences.fq.gz. If you have paired-end reads, they will be saved to sequences.pair.1.fq.gz and sequences.pair.2.fq.gz (and if there are any single-ends mixed in, sequences.singles.fq.gz.

Packaging ngless with nix & brew

I spent a little bit of time packaging ngless in these last few days. I currently have packages for nix and homebrew. I used homebrew because I wanted to get a package for Mac OS, but there is also an easy-to-install Linux port, so this could be an option for Linux users too.

Because ngless is Haskell based, I was afraid this was going to be complicated. Before stack appeared, getting all your Haskell dependencies right for a Haskell project was a non-trivial affair (The google query cabal hell returns >500k results).

Contrary to expectations, it was surprisingly easy to package ngless with both nix and brew.


The result looks like this:

{ nixpkgs ? import  {}, compiler ? "ghc801" }:
nixpkgs.pkgs.haskell.packages.${compiler}.callPackage ./ngless.nix { }

plus the file ngless.nix, which was 95% autogenerated with cabal2nix.

Now, with nix, you can install ngless with a single command:

nix-env -i -f https://github.com/luispedro/ngless-nix/archive/master.tar.gz

At the moment, you need to be on the unstable channel for this to work (because some of the dependencies were only added recently).

The only non-trivial bit was tweaking the dependencies so that ngless can be compiled with GHC 8. I am still using the previous version (7.10) for development and could have just used it as well for this distribution (in fact you can still use it by specifying --arg compiler '"ghc7103"' on the command line).

However, (1) eventually GHC 8 will take over so we might as well see the issues now and (2) ngless compiled with GHC 8 runs 10~20% faster (in part because of a change in GHC introduced to make ngless faster¹).


Since I already use nix on my laptop and have experience navigating its quirks, I was confident that I could get nix to work, but I had never used brew.

Still, about 30 minutes after starting to google for brew, I had a formula that mostly worked. One or two bug fixes later and it just works. I was able to install ngless on both a Linux and a Mac machine using it. Below I show the whole thing so you can see how small the resulting file is.

On the one hand, it was much simpler than nix; dependencies are implicit and there were no weird error messages from the nix language or any need to follow a long trail of references to figure out where something is defined. On the other hand, though, I did run into a few issues which would not have existed with nix related to versions and caching.

After brew downloads the file corresponding to a specific version, it will not redownload even if you change its URL/SHA256. It just assumes that the previous file is correct and you have to manually delete it from the cache. Not a big deal, but nix would not let you make this mistake.

This seems to be the nix tradeoff: a bit rough on the user experience over the short term but easier to understand and fewer errors over the long term. If the efforts to improve nix’s usability bear fruit, it’ll be superior to alternatives. Right now, brew is still nicer for casual users.

Here is the complete brew formula:

require "language/haskell"

class Ngless < Formula
include Language::Haskell::Cabal

desc "Domain Specific Language for NGS Processing"
homepage "http://ngless.readthedocs.io/"
url "https://github.com/luispedro/ngless/archive/34e7e79be62f76b1a4d61d94a59a1e42d9111308.zip"
sha256 "1ac04e535390b0a6189bcfc1c486b54443dc46b7f6b74bf5221b8e4c98c463bc"
version "0.0.0"

head "https://github.com/luispedro/ngless.git"

depends_on "ghc" => :build
depends_on "cabal-install" => :build

depends_on "bwa" => :install
depends_on "samtools" => :install

def install
system "m4 NGLess.cabal.m4 > NGLess.cabal"

¹ This change really did fix a performance bug in GHC, which should help all multi-threaded programmes, but it was ngless that triggered it.

NGless Miscellania [5/5]

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the last in a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous [this post]

Ngless has a few not so visible details that can come in handy.

Local installation

ngless relies on a few third-party utilities (bwa and samtools, besides any other modules you install) as well as possibly reference information. However, it does not require either (1) a super user install nor (2) fiddling with PATH variables or such. It is happy to install its data into your home directory and run from there.

You can also install it globally, of course, but in many academic settings, you need to ask permission to install a package globally, while you can do whatever you want in your home directory. NGless is designed with this in mind.

On the fly QC (quality control)

All FastQ files are automatically passed through a QC analysis when you load them and again after any preprocessing step. You do not need to specify QC as a separate step, it just happens. In fact, if possible, ngless will run it on the fly for efficiency reasons.

Best practices should be easy and QC is a best practice.

Subsample mode

Subsample mode simply throws away 99% of the data.

Why would anyone ever want to do this?

This allows you to quickly check whether your pipeline works as expected and the output files are as expected. For example:

ngless --subsample script.ngl

will run script.ngl in subsample mode, which will probably run much faster than the full pipeline, allowing to quickly spot any issues with your code. A 10 hour pipeline will finish in a few minutes when running in subsample mode.

Subsample mode also changes all your write() so that the output files include the subsample extension. That is, a call such as

write(output, ofile='results.txt')

will automatically get rewritten to

write(output, ofile='results.txt.subsample')

This ensures that you do not confuse subsampled results with the real thing. NGless is all about making sure your results are correct, so it tries to avoid confusing you as much as possible (this is similar to how it always writes output files with the atomic protocol so that you never get a partial results file).

Parallel processing & speed

The main goal of ngless is to save bioinformaticians time while improving the results. However, as a side benefit of having a well-defined language, the interpreter can take automatic advantage of multiple processors.

Consider the following script:

ngless '0.0'

input = fastq('input.fq.gz')
preprocess(input) using |r|:
    r = substrim(r, min_quality=45)
    if len(r) < 45:
mapped = map(input, reference='hg19')
counted = count(mapped, features=['gene'])
write(counted, ofile='genes.txt')

Almost all the steps in the pipeline can take advantage of multiple processors:

  1. QC is performed on the fly as the file ‘input.fq.gz’ is being read.
  2. preprocess takes advantage of mulitple processors by processing reads in parallel
  3. map calls bwa which makes use of threads
  4. count again processes the output of mapping in parallel.

To use more than one core in ngless, just use the option -j with the number of threads you want. For example:

ngless -j8 pipeline.ngl

Will run with 8 cores, speeding the processing considerably.

Extending ngless and interacting with other projects [4/5]

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects [this post]
  5. Miscellaneous

Extending and interacting with other projects

A frequently asked question about ngless is whether the language is extensible. Yes, it is. You can add modules using a simple text-only format (YaML). These modules can then add new functions to ngless. Behind the scenes, this results in command line calls to scripts you write.

For example, to integrate motus into ngless, I used a simple configuration file, which I am going to describe it here.

Every module has a name and a version:

name: 'motus'
version: '0.0.0'

You can add a citation text. This will be shown to all users of your module (citing the software you use is a best practice, so we support it):

citation: "Metagenomic species profiling using universal phylogenetic marker genes"

You can add an init command. This will run before anything else runs at the start of the interpretation. It should be quick and check that things are OK. For example, in this case, we check that Python is installed. Thus, if there is a problem, the user gets a fast error message before anything else is run.

    init_cmd: './check-python.sh'

Now, we list the functions we are implementing:


In this case, there is just one, corresponding to the ngless function motus.

    nglName: "motus"

arg0 is the command to run (which implements this function):

    arg0: './run-python.sh'

In ngless functions have a single unnamed argument and any number of named arguments. So, we specify first arg1 which is a special

        filetype: "tsv"
        can_gzip: true

The can_gzip flag lets ngless know that it is OK to pass a compressed file to your script. Now, we list any additional arguments. In this case, there is a required argument:

            atype: 'str'
            name: 'ofile'
            def: ''
            required: true

The argument is a string, without a default. That’s it. Now, we can use the motus function in a ngless script:

ngless "0.0"
import "motus" version "0.0.0"

input = paired('data/reads.1.fq.gz', 'data/reads.2.fq.gz')
preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:

mapped = map(input, ref='motus')
mapped = select(mapped) using |mread|:
    mread = mread.filter(min_identity_pc=97)
counted = count(mapped, gff_file='motus.gtf.gz', features=['gene'], multiple={dist1})
motus(counted, ofile='motus-counts.txt')

What can modules do?

An external module can

  • add new functions (will result in a call to a script, which will often be a wrapper around some tool).
  • add new reference information (new catalogs, &c). This can even be downloaded on demand (currently [Apr 2016], the module init script must do this itself; in the future, ngless will support just a URL).
  • add a citation so that all users of the module will see the citation message. This ensures that if you develop a package which gets wrapped into an ngless module, those final users will still see your citation.