NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please get in touch if you are interested in using ngless in your projects.
This is the first of a series of five posts introducing ngless.
- Introduction to ngless [this post]
- Perfect reproducibility
- Fast and high quality error detection
- Extending and interacting with other projects
Introduction to ngless
Ngless is both a domain specific language and a tool for processing next generation with a focus (for the moment) on handling metagenomics data.
This is best explained by an example.
Let us say you have your paired end Illumina sequence data in two files
data/data.2.fq.gz and want to build a functional profile using our gene catalog. Instead of downloading it yourself (from the excellent companion website) and figuring out all the necessary bits and pieces, you could just write into a file
ngless "0.0" import OceanMicrobiomeReferenceGeneCatalog version "1.0" input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz') preprocess(input) using |read|: read = substrim(read, min_quality=30) if len(read) < 45: discard mapped = map(input, reference='omrgc') summary = count(mapped, features=['ko', 'cog']) write(summary, ofile='output/functional.summary.txt')
and, after downloading ngless, run on the command line
$ ngless profile.ngl
Go get some coffee and after a while, the file
functional.summary.txt will contain functional abundances for your sample (most of the time is spent on aligning the data to the catalog, and this depends on how big your dataset is).
The initial ngless download is pretty small (32MB right now) and so does not include the gene catalog (or any other database). However, the first time you run a script which uses the ocean catalog, it will download it (including all the functional annotation files). Ngless also includes all its own dependencies (it internally uses bwa for mapping), so you don’t need anything other than itself.
Why should you use ngless?
- The analysis is perfectly reproducible (if I have the same dataset on my computer, I’ll get exactly the same output)
- The script is easy to read and inspect. Given a script like this, it will be easy for you to adapt it to your data without reading through pages of docs.
- There is a lot of implicit information (where is the catalog FASTA file, for example), which decreases the chance for errors. If you want to, you can configure where these files are stored and other such details, but you do not have to.
- We will also see that ngless is very good about finding errors in your scripts before it runs them, thus speeding up the time it takes to analyse your data. No more shell scripts that run for a few hours and then fail at the last step because of a silly typo or missing argument.
- Finally, there is nothing special about that
OceanMicrobiomeReferenceGeneCatalogimport: it’s just a text file which specifies where ngless should download the ‘omrgc’ reference. If you want to package your own references (for others or even just your internal usage), this is just a few lines in a configuration file.
The next few posts will go into these points in more depth.
Note on status (as of 8 Feb 2016): The example above does not work with the current version of ngless because although the code is all there, there is no public download source for the gene catalog. The following will work, though, if you manually download the data to the directory
ngless "0.0" input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz') preprocess(input) using |read|: read = substrim(read, min_quality=30) if len(read) < 45: discard mapped = map(input, fafile='catalog/omrgc.fna') summary = count(mapped, features=['ko', 'cog'], functional_map='catalog/omrgc.map') write(summary, ofile='output/functional.summary.txt')
Now, the first time you run this code, it will index the catalog.