NGless Miscellania [5/5]

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the last in a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous [this post]

Ngless has a few not so visible details that can come in handy.

Local installation

ngless relies on a few third-party utilities (bwa and samtools, besides any other modules you install) as well as possibly reference information. However, it does not require either (1) a super user install nor (2) fiddling with PATH variables or such. It is happy to install its data into your home directory and run from there.

You can also install it globally, of course, but in many academic settings, you need to ask permission to install a package globally, while you can do whatever you want in your home directory. NGless is designed with this in mind.

On the fly QC (quality control)

All FastQ files are automatically passed through a QC analysis when you load them and again after any preprocessing step. You do not need to specify QC as a separate step, it just happens. In fact, if possible, ngless will run it on the fly for efficiency reasons.

Best practices should be easy and QC is a best practice.

Subsample mode

Subsample mode simply throws away 99% of the data.

Why would anyone ever want to do this?

This allows you to quickly check whether your pipeline works as expected and the output files are as expected. For example:

ngless --subsample script.ngl

will run script.ngl in subsample mode, which will probably run much faster than the full pipeline, allowing to quickly spot any issues with your code. A 10 hour pipeline will finish in a few minutes when running in subsample mode.

Subsample mode also changes all your write() so that the output files include the subsample extension. That is, a call such as

write(output, ofile='results.txt')

will automatically get rewritten to

write(output, ofile='results.txt.subsample')

This ensures that you do not confuse subsampled results with the real thing. NGless is all about making sure your results are correct, so it tries to avoid confusing you as much as possible (this is similar to how it always writes output files with the atomic protocol so that you never get a partial results file).

Parallel processing & speed

The main goal of ngless is to save bioinformaticians time while improving the results. However, as a side benefit of having a well-defined language, the interpreter can take automatic advantage of multiple processors.

Consider the following script:

ngless '0.0'

input = fastq('input.fq.gz')
preprocess(input) using |r|:
    r = substrim(r, min_quality=45)
    if len(r) < 45:
        discard
mapped = map(input, reference='hg19')
counted = count(mapped, features=['gene'])
write(counted, ofile='genes.txt')

Almost all the steps in the pipeline can take advantage of multiple processors:

  1. QC is performed on the fly as the file ‘input.fq.gz’ is being read.
  2. preprocess takes advantage of mulitple processors by processing reads in parallel
  3. map calls bwa which makes use of threads
  4. count again processes the output of mapping in parallel.

To use more than one core in ngless, just use the option -j with the number of threads you want. For example:

ngless -j8 pipeline.ngl

Will run with 8 cores, speeding the processing considerably.

Extending ngless and interacting with other projects [4/5]

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects [this post]
  5. Miscellaneous

Extending and interacting with other projects

A frequently asked question about ngless is whether the language is extensible. Yes, it is. You can add modules using a simple text-only format (YaML). These modules can then add new functions to ngless. Behind the scenes, this results in command line calls to scripts you write.

For example, to integrate motus into ngless, I used a simple configuration file, which I am going to describe it here.

Every module has a name and a version:

name: 'motus'
version: '0.0.0'

You can add a citation text. This will be shown to all users of your module (citing the software you use is a best practice, so we support it):

citation: "Metagenomic species profiling using universal phylogenetic marker genes"

You can add an init command. This will run before anything else runs at the start of the interpretation. It should be quick and check that things are OK. For example, in this case, we check that Python is installed. Thus, if there is a problem, the user gets a fast error message before anything else is run.

init:
    init_cmd: './check-python.sh'

Now, we list the functions we are implementing:

function:

In this case, there is just one, corresponding to the ngless function motus.

    nglName: "motus"

arg0 is the command to run (which implements this function):

    arg0: './run-python.sh'

In ngless functions have a single unnamed argument and any number of named arguments. So, we specify first arg1 which is a special

    arg1:
        filetype: "tsv"
        can_gzip: true

The can_gzip flag lets ngless know that it is OK to pass a compressed file to your script. Now, we list any additional arguments. In this case, there is a required argument:

    additional:
        -
            atype: 'str'
            name: 'ofile'
            def: ''
            required: true

The argument is a string, without a default. That’s it. Now, we can use the motus function in a ngless script:

ngless "0.0"
import "motus" version "0.0.0"

input = paired('data/reads.1.fq.gz', 'data/reads.2.fq.gz')
preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard

mapped = map(input, ref='motus')
mapped = select(mapped) using |mread|:
    mread = mread.filter(min_identity_pc=97)
counted = count(mapped, gff_file='motus.gtf.gz', features=['gene'], multiple={dist1})
motus(counted, ofile='motus-counts.txt')

What can modules do?

An external module can

  • add new functions (will result in a call to a script, which will often be a wrapper around some tool).
  • add new reference information (new catalogs, &c). This can even be downloaded on demand (currently [Apr 2016], the module init script must do this itself; in the future, ngless will support just a URL).
  • add a citation so that all users of the module will see the citation message. This ensures that if you develop a package which gets wrapped into an ngless module, those final users will still see your citation.

Fast and useful errors with ngless [3/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection [this post]
  4. Extending and interacting with other projects
  5. Miscellaneous

If you are the rare person who just writes code without bugs (if your favorite editor is cat), then you can skip this post as it only concerns those of us who make mistakes. Otherwise, I will assume that /your code will have bugs/. Your code will have silly typos and large mistakes.

Too many tools work well, but fail badly; that is, if all their dependencies are there, all the files are exactly perfect and the user specificies all the right options, then the tool will work perfectly; but any mistake and you will get a bizarre error, which will be hard to fix. Thus,the tool is bad at failing. Ngless promises to work well and fail well.

Make some errors impossible

Let us recall our running example:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

Note that we do not specify paths for the ‘omrgc’ reference or the functional map file. We also do not specify files for intermediate files. This is all implicit and you cannot mess it up. The Fastq encoding is auto-detected, removing one more opportunity for you to mess up (although you can specify the encoding if you really want to).

Ngless always uses the three step output safe writing pattern:

  1. write the output to a temp file,
  2. sync the file and its directory to disk,
  3. rename the temp file to the final output name.

The final step is atomic. That is, the operating system garantees that it either fully completes or never executes even if there is an error, so that you never get a partial file. Thus, if there is an output file, you know that ngless finished without errors (up to that point, at least) and that the output is correct. No more asking “did the cluster crash affect this file? Maybe I need to recompute or maybe I count the number of lines to make sure it’s complete”. None of that: if the file is there, it is complete.

Side-note: programming languages (or their standard libraries) should have support for this safe-output writing pattern. I do not know of any language that does.

Make error detection fast

Have you ever run a multi-step pipeline where the very last step (often saving the results) has a silly typo and everything fails disastrously at that point wasting you hours of compute time? I know I have. Ngless tries as hard as possible to make sure that doesn’t happen.

Although ngless is interpreted rather than compiled, it performs an initial pass over your script to check for all sorts of possible errors.

Ngless is a typed language and all types are checked so that if you try to run the count function without first maping, you will get an error message.

All arguments to functions are also checked. This even checks some rules that would be hard to impose using a more general purpose programming language: for example, when you call count, either (1) you are using a built-in reference which has its own annotation files or (2) you have to pass in the path to a GTF or gene map file so that the output of the mapping can be annotated and summarized. This constraint would be hard to express in, for example, Java or C++, but ngless can check this type of condition easily.

The initial check makes sure that all necessary input files exist and can be read and even that any directories used for output are present and can be written to (in the script above, if a directory named output is not present, you will get an immediate error). If you are using your own functional map, it will read the file header to check that any features you use are indeed present (in the example above, it checks that the ‘ko’ and ‘cog’ features exist in the built-in ocean microbiome catalog).

All typos and other similar errors are caught immediately. If you mistype the name of your output directory, ngless will let you know in 0.2 seconds rather than after hours of computation.

You can also just run the tests with ngless -n script-name.ngl: it does nothing except run all the validation steps.

Again, this is an idea that could be interesting to explore in the context of general purpose languages.

Make error messages helpful

An unknown error occurred

An unhelpful error message

As much as possible, when an error is detected, the message should help you make sense of it and fix it. A tool cannot always read your mind, but as much as possible, ngless error messages are descriptive.

For example, if you used an illegal argument/parameter to a function, ngless will remind you of what the legal arguments are. If it cannot write to an output file it will say it cannot write to an output file (and not just “IO Error”). If a file is missing, it will tell you which file (and it will tell you in about 0.2 seconds.

Summary

Ngless is designed to make some errors impossible, while trying hard to give you good error messages for the errors that will inevitably crop up.

Perfect reproducibility using ngless [2/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the second of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility [this post]
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous

Perfect reproducibility using ngless

With ngless, your analysis is perfectly reproducible forever. In particular, this is achieved by the use of explicit version strings in the ngless scripts. Let us the first two lines in the example we used before:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

The first line notes the version of ngless that is to be used to run this script. At the moment, ngless is in a pre-release phase and so the version string is “0.0”. In the future, however, this will enable ngless to keep improving while still allowing all past scripts to work exactly as intendended. No more, “I updated some software package and now all my scripts give me different results.” Everything is explicitly versioned.

There are several command line options for ngless, which can change the way that it works internally (e.g., where it keeps its temporary files, whether it is verbose or not, how many cores it should use, &c). You can also use a configuration file to set these options. However, no command line or configuration option change the final output of the analysis. Everything you need to know about the results is in the script.

Reproducible, extendable, and reviewable

It’s not just about reproducibility. In fact, reproducibility is often not that great per se: if you have a big virtual machine image with 50 dependencies, which runs a 10,000 line hairy script to reproduce the plots in a paper, that’s reproducible, but not very useful (except if you want to really dig in). Ngless scripts, however, are easily extendable and even inspectable. Recall the rest of the script (the bits that do actual work):

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

If you have ever worked a bit with NGS data, you can probably get the gist of what is going on. Except for maybe some of the details of what substrim does (it trims the read by finding the largest sustring where all nucleotides are of at least the given quality, see docs), your guess of what is going on would be pretty accurate.

It is easily extendable: If you want to add another functional table, perhaps using kegg modules, you just need to edit the features argument to the count function (you’d need to know which ones are available, but after looking that up, it’s a trivial change).

If you now wanted to perform a similar analysis on your data, I bet you could easily adapt the script for your settings.

§

A few weeks ago, I asked on twitter/facebook:

Science people: anyone know of any data on how often reviewers check submitted software/scripts if they are available? Thanks.

//platform.twitter.com/widgets.js

I didn’t get an answer for the original question, but there was a bit of discussion and as far as I know nobody really checks code submitted along with papers (which is, by itself, not a enough of a reason to not demand code). However, if you were reviewing a paper and the supplemental material had the little script above, you could easily check it out and make sure the authors used the right settings and databases. The resulting code is inspectable and reviewable.

Introduction to ngless [1/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless [this post]
  2. Perfect reproducibility
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous

Introduction to ngless

Ngless is both a domain specific language and a tool for processing next generation with a focus (for the moment) on handling metagenomics data.

This is best explained by an example.

Let us say you have your paired end Illumina sequence data in two files data/data.1.fq.gz and data/data.2.fq.gz and want to build a functional profile using our gene catalog. Instead of downloading it yourself (from the excellent companion website) and figuring out all the necessary bits and pieces, you could just write into a file profile.ngl:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

and, after downloading ngless, run on the command line

$ ngless profile.ngl

Go get some coffee and after a while, the file functional.summary.txt will contain functional abundances for your sample (most of the time is spent on aligning the data to the catalog, and this depends on how big your dataset is).

The initial ngless download is pretty small (32MB right now) and so does not include the gene catalog (or any other database). However, the first time you run a script which uses the ocean catalog, it will download it (including all the functional annotation files). Ngless also includes all its own dependencies (it internally uses bwa for mapping), so you don’t need anything other than itself.

Why should you use ngless?

  • The analysis is perfectly reproducible (if I have the same dataset on my computer, I’ll get exactly the same output)
  • The script is easy to read and inspect. Given a script like this, it will be easy for you to adapt it to your data without reading through pages of docs.
  • There is a lot of implicit information (where is the catalog FASTA file, for example), which decreases the chance for errors. If you want to, you can configure where these files are stored and other such details, but you do not have to.
  • We will also see that ngless is very good about finding errors in your scripts before it runs them, thus speeding up the time it takes to analyse your data. No more shell scripts that run for a few hours and then fail at the last step because of a silly typo or missing argument.
  • Finally, there is nothing special about that OceanMicrobiomeReferenceGeneCatalog import: it’s just a text file which specifies where ngless should download the ‘omrgc’ reference. If you want to package your own references (for others or even just your internal usage), this is just a few lines in a configuration file.

The next few posts will go into these points in more depth.

Note on status (as of 8 Feb 2016): The example above does not work with the current version of ngless because although the code is all there, there is no public download source for the gene catalog. The following will work, though, if you manually download the data to the directory catalog:

ngless "0.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, fafile='catalog/omrgc.fna')
summary = count(mapped, features=['ko', 'cog'], functional_map='catalog/omrgc.map')
write(summary, ofile='output/functional.summary.txt')

Now, the first time you run this code, it will index the catalog.

Mounting My Phone as a Filesystem

After a lot of time spent trying to find the right app/software for getting things off my phone into my computer (I have spent probably days now, accumulated over my lifetime of phone-ownership; this shouldn’t be so hard), I just gave up and wrote a little FUSE script to mount the phone as a directory in Linux.

It is very basic and relies on adb (android debug bridge) being installed on the PATH, but if it is present, I was able to just type:

$ mkdir -p phone
$ python android-fuse.py phone &

To get the phone mounted as a directory:

$ cd phone
$ ls -l
total 656 
drwxr-xr-x 1 root      root           0 Jan  1 00:44 acct 
drwxrwx--- 1 luispedro      2001      0 Jan  6 12:30 cache[...]

Now, I could navigate the directories and files from the phone to the computer (uploading files in is not available as I don’t need it).

What was nice was that, using fusepy (which also needs to be available), the code wasn’t too hard to write even. Then I was able to see where my phone hard drive disk was going (I keep running out of disk space) and delete a few Gigabytes of pictures I had anyone already saved somewhere else (by making sure the hashes matched).

It’s available at: https://github.com/luispedro/android-fuse. But rely on it at your own risk! It a best-attempt code and works on my phone, but I didn’t vet it 100%. It’s also kind of slow to list directories and such.

(It may also work on Mac, as Mac also has FUSE; but I cannot test it).