Extending ngless and interacting with other projects [4/5]

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects [this post]
  5. Miscellaneous

Extending and interacting with other projects

A frequently asked question about ngless is whether the language is extensible. Yes, it is. You can add modules using a simple text-only format (YaML). These modules can then add new functions to ngless. Behind the scenes, this results in command line calls to scripts you write.

For example, to integrate motus into ngless, I used a simple configuration file, which I am going to describe it here.

Every module has a name and a version:

name: 'motus'
version: '0.0.0'

You can add a citation text. This will be shown to all users of your module (citing the software you use is a best practice, so we support it):

citation: "Metagenomic species profiling using universal phylogenetic marker genes"

You can add an init command. This will run before anything else runs at the start of the interpretation. It should be quick and check that things are OK. For example, in this case, we check that Python is installed. Thus, if there is a problem, the user gets a fast error message before anything else is run.

    init_cmd: './check-python.sh'

Now, we list the functions we are implementing:


In this case, there is just one, corresponding to the ngless function motus.

    nglName: "motus"

arg0 is the command to run (which implements this function):

    arg0: './run-python.sh'

In ngless functions have a single unnamed argument and any number of named arguments. So, we specify first arg1 which is a special

        filetype: "tsv"
        can_gzip: true

The can_gzip flag lets ngless know that it is OK to pass a compressed file to your script. Now, we list any additional arguments. In this case, there is a required argument:

            atype: 'str'
            name: 'ofile'
            def: ''
            required: true

The argument is a string, without a default. That’s it. Now, we can use the motus function in a ngless script:

ngless "0.0"
import "motus" version "0.0.0"

input = paired('data/reads.1.fq.gz', 'data/reads.2.fq.gz')
preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:

mapped = map(input, ref='motus')
mapped = select(mapped) using |mread|:
    mread = mread.filter(min_identity_pc=97)
counted = count(mapped, gff_file='motus.gtf.gz', features=['gene'], multiple={dist1})
motus(counted, ofile='motus-counts.txt')

What can modules do?

An external module can

  • add new functions (will result in a call to a script, which will often be a wrapper around some tool).
  • add new reference information (new catalogs, &c). This can even be downloaded on demand (currently [Apr 2016], the module init script must do this itself; in the future, ngless will support just a URL).
  • add a citation so that all users of the module will see the citation message. This ensures that if you develop a package which gets wrapped into an ngless module, those final users will still see your citation.

New Paper: Metagenomic insights into the human gut resistome and the forces that shape it

Metagenomic insights into the human gut resistome and the forces that shape it by Kristoffer Forslund, Shinichi Sunagawa, Luis P. Coelho, Peer Bork in Bioessays (2014) DOI:10.1002/bies.201300143

This is a new paper which I was a part of. I will let Kristoffer Forslund (the first author) introduce it:

“Everyone knows” that feeding antibiotics to food animals are putting us on a path to the resistant bacterial apocalypse. However, published studies on the matter are less clear, with some authors arguing it is and others defending current use practices. So far progress in finding a definite answer has been limited due to experimental methods being expensive and cumbersome. Metagenomics offers new possibilities for understanding the evolution of antibiotic resistance and its causes, and in this review we both summarize the fledgling subfield, and present some new results of our own describing the distribution of antibiotics resistance genes in human gut microbial genomes, which we find reflects both medical and food production antibiotic use.

Notes on #ISMBECCB Highlights Session (Sunday Morning)

(I missed the first half of the first talk, so I won’t include it. Also, the internet is not good enough for me to get all the links. Sorry)

Of Men and Not Mice: Comparative Genomic Analysis of Human Diseases and Mouse Models by Wenzhong Xiao

Wenzhong Xiao presented an empirical study of correlation between immune response of mice and men. The correlations were very low, which is a warning to be careful in interpreting animal models results. Money quote: “Mice are not human. There are several reasons for that.”

An audience member raised the possibility of using humanized models, which was a great point. I’ll add that the immune system and immune system dysfunction may be where mice and men differ the most and results there do not invalidate results in other areas of study.

Impact of genetic dynamics and single-cell heterogeneity on development of nonstandard personalized medicine strategies for cancer by Chen-Hsiang Yeang

Simulation study of using different strategies for cancer treatment in the present of resistant mutations. “The current system is often like a greedy algorithm: do X until resistance to X emerges, switch to Y. Repeat. Better strategies are possible.”

Very interesting points come out of simple models, but it felt like the start of a conversation rather than an answer.

Interesting presentation detail: author used references to video games as one would use references to literature 100 years ago.

Systems-based metatranscriptomic analysis by Xuejian Xiong.

Original Paper: He, D., Miao, M., Sitarz, E.E., Muiznieks, L.D., Reichheld, S., Stahl, R.J., Keeley, F.W. and Parkinson, J. (2012) Polymorphisms in the Human Tropoelastin Gene Modify in vitro Self-Assembly and Mechanical Properties of Elastin-like Polypeptides. PLoS ONE. 7(9): e46130

Study on non-obese diabetic mice with Illumina sequencing. They projected their reads into enzyme space to perform analysis at the metabolic network level.

Interesting technical points: they use alignment in peptide space instead of nucleotide space to get around variability in codon encoding. They also found that Trinity worked best for their data.

Paper Review: Probabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences


PaperProbabilistic Inference of Biochemical Reactions in Microbial Communities from Metagenomic Sequences by Dazhi Jiao, Yuzhen Ye, and Haixu Tang

This is a recent paper in Plos CompBio which looks at the following problem:

  1. Metagenomics (the sequence of genomes in mixed population samples) gives us many genes. Many of these genes can be mapped to enzymes.
  2. Those enzyme can be mapped (with a database such as KEGG) to reactions, but these assignments are ambiguous.
  3. Can we obtain a consensus interaction network?

The authors approach the problem probabilisticly by computing, for each possible interaction network, a probability. Their model is based on the notion that real interaction networks probably have fewer metabolites than all possible combinations. From the paper (my emphasis):

However, if the product of a gene is annotated to catalyze multiple reactions, some of these reactions may be excluded from the sampled subnetwork, as long as at least one of these reactions is included. We note that, according to this condition, each sampled subnetwork represents a putative reconstruction of the collective metabolic network of the metagenome, among which we assume the subnetworks containing fewer metabolites are more likely to represent the actual metabolism of the microbial community than the ones containing more metabolites.

From this idea, using standard MCMC they are able to assign to each reaction a probability that it is part of a community.

They validate their method using clustering. They show that using their probability assignments results in better separation of samples that relying on naïve assignments of all enzymes to all possible reactions. The result is nice and clean.

To reduce this to bare essentials, the point is that their method (on the right) gets the separation between the different types of samples (represented by different symbols) better than any alternatives.

Hierarchical clustering of 44 IMG/M metagenomics samples represented in dendrograms.

They also suggest that they are able to extract differentially present reactions better than the baseline methods. Unfortunately, due to the lack of a validated result, it is really impossible to know whether they just got more false positives. I do not really know how to do it better, though. This is just one of those fundamental problems in the field: the lack of validated information to build upon.

However, it is good to be able to even talk of differentially expressed reactions instead of just genes or orthologous groups.

In global, the authors present an interesting formulation of a hard problem. I always like the idea of handling uncertainty probabilistically and it is good to see that it really does work.

This is the sort of paper that opens up a bunch of questions immediately on extensions:

  • Can similar methods handle uncertainty in the basic gene assignments?
  • Or KEGG annotations?

Currently, they assume that all enzymes are actually present and perform one of the functions listed, but neither of these statements is always true.

Another area where their method could be taken is whether to move up from computing marginal probabilities of single reactions and into computing small subnetworks. I hope that the authors are exploring some of these questions and present us with some follow up work in the near future.