Perfect reproducibility using ngless [2/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the second of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility [this post]
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous

Perfect reproducibility using ngless

With ngless, your analysis is perfectly reproducible forever. In particular, this is achieved by the use of explicit version strings in the ngless scripts. Let us the first two lines in the example we used before:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

The first line notes the version of ngless that is to be used to run this script. At the moment, ngless is in a pre-release phase and so the version string is “0.0”. In the future, however, this will enable ngless to keep improving while still allowing all past scripts to work exactly as intendended. No more, “I updated some software package and now all my scripts give me different results.” Everything is explicitly versioned.

There are several command line options for ngless, which can change the way that it works internally (e.g., where it keeps its temporary files, whether it is verbose or not, how many cores it should use, &c). You can also use a configuration file to set these options. However, no command line or configuration option change the final output of the analysis. Everything you need to know about the results is in the script.

Reproducible, extendable, and reviewable

It’s not just about reproducibility. In fact, reproducibility is often not that great per se: if you have a big virtual machine image with 50 dependencies, which runs a 10,000 line hairy script to reproduce the plots in a paper, that’s reproducible, but not very useful (except if you want to really dig in). Ngless scripts, however, are easily extendable and even inspectable. Recall the rest of the script (the bits that do actual work):

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

If you have ever worked a bit with NGS data, you can probably get the gist of what is going on. Except for maybe some of the details of what substrim does (it trims the read by finding the largest sustring where all nucleotides are of at least the given quality, see docs), your guess of what is going on would be pretty accurate.

It is easily extendable: If you want to add another functional table, perhaps using kegg modules, you just need to edit the features argument to the count function (you’d need to know which ones are available, but after looking that up, it’s a trivial change).

If you now wanted to perform a similar analysis on your data, I bet you could easily adapt the script for your settings.

§

A few weeks ago, I asked on twitter/facebook:

Science people: anyone know of any data on how often reviewers check submitted software/scripts if they are available? Thanks.

//platform.twitter.com/widgets.js

I didn’t get an answer for the original question, but there was a bit of discussion and as far as I know nobody really checks code submitted along with papers (which is, by itself, not a enough of a reason to not demand code). However, if you were reviewing a paper and the supplemental material had the little script above, you could easily check it out and make sure the authors used the right settings and databases. The resulting code is inspectable and reviewable.

Introduction to ngless [1/5]

NOTE: As of Feb 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless [this post]
  2. Perfect reproducibility
  3. Fast and high quality error detection
  4. Extending and interacting with other projects
  5. Miscellaneous

Introduction to ngless

Ngless is both a domain specific language and a tool for processing next generation with a focus (for the moment) on handling metagenomics data.

This is best explained by an example.

Let us say you have your paired end Illumina sequence data in two files data/data.1.fq.gz and data/data.2.fq.gz and want to build a functional profile using our gene catalog. Instead of downloading it yourself (from the excellent companion website) and figuring out all the necessary bits and pieces, you could just write into a file profile.ngl:

ngless "0.0"
import OceanMicrobiomeReferenceGeneCatalog version "1.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, reference='omrgc')
summary = count(mapped, features=['ko', 'cog'])
write(summary, ofile='output/functional.summary.txt')

and, after downloading ngless, run on the command line

$ ngless profile.ngl

Go get some coffee and after a while, the file functional.summary.txt will contain functional abundances for your sample (most of the time is spent on aligning the data to the catalog, and this depends on how big your dataset is).

The initial ngless download is pretty small (32MB right now) and so does not include the gene catalog (or any other database). However, the first time you run a script which uses the ocean catalog, it will download it (including all the functional annotation files). Ngless also includes all its own dependencies (it internally uses bwa for mapping), so you don’t need anything other than itself.

Why should you use ngless?

  • The analysis is perfectly reproducible (if I have the same dataset on my computer, I’ll get exactly the same output)
  • The script is easy to read and inspect. Given a script like this, it will be easy for you to adapt it to your data without reading through pages of docs.
  • There is a lot of implicit information (where is the catalog FASTA file, for example), which decreases the chance for errors. If you want to, you can configure where these files are stored and other such details, but you do not have to.
  • We will also see that ngless is very good about finding errors in your scripts before it runs them, thus speeding up the time it takes to analyse your data. No more shell scripts that run for a few hours and then fail at the last step because of a silly typo or missing argument.
  • Finally, there is nothing special about that OceanMicrobiomeReferenceGeneCatalog import: it’s just a text file which specifies where ngless should download the ‘omrgc’ reference. If you want to package your own references (for others or even just your internal usage), this is just a few lines in a configuration file.

The next few posts will go into these points in more depth.

Note on status (as of 8 Feb 2016): The example above does not work with the current version of ngless because although the code is all there, there is no public download source for the gene catalog. The following will work, though, if you manually download the data to the directory catalog:

ngless "0.0"

input = paired('data/data.1.fq.gz', 'data/data.2.fq.gz')
preprocess(input) using |read|:
    read = substrim(read, min_quality=30)
    if len(read) < 45:
        discard

mapped = map(input, fafile='catalog/omrgc.fna')
summary = count(mapped, features=['ko', 'cog'], functional_map='catalog/omrgc.map')
write(summary, ofile='output/functional.summary.txt')

Now, the first time you run this code, it will index the catalog.

Mounting My Phone as a Filesystem

After a lot of time spent trying to find the right app/software for getting things off my phone into my computer (I have spent probably days now, accumulated over my lifetime of phone-ownership; this shouldn’t be so hard), I just gave up and wrote a little FUSE script to mount the phone as a directory in Linux.

It is very basic and relies on adb (android debug bridge) being installed on the PATH, but if it is present, I was able to just type:

$ mkdir -p phone
$ python android-fuse.py phone &

To get the phone mounted as a directory:

$ cd phone
$ ls -l
total 656 
drwxr-xr-x 1 root      root           0 Jan  1 00:44 acct 
drwxrwx--- 1 luispedro      2001      0 Jan  6 12:30 cache[...]

Now, I could navigate the directories and files from the phone to the computer (uploading files in is not available as I don’t need it).

What was nice was that, using fusepy (which also needs to be available), the code wasn’t too hard to write even. Then I was able to see where my phone hard drive disk was going (I keep running out of disk space) and delete a few Gigabytes of pictures I had anyone already saved somewhere else (by making sure the hashes matched).

It’s available at: https://github.com/luispedro/android-fuse. But rely on it at your own risk! It a best-attempt code and works on my phone, but I didn’t vet it 100%. It’s also kind of slow to list directories and such.

(It may also work on Mac, as Mac also has FUSE; but I cannot test it).

XKCD Explains why food in San Sebastian is so good

tl;dr: If you like food and you have not yet been to San Sebastian, you’re doing life wrong.

I think this xkcd comic explains why the food in San Sebastian is so good:

XKCD

This is a place where many bars serve elaborate pintxos (small dishes) to a wide audience, food items which would be fancy and expensive cuisine anywhere else:

pork hand with liver

On the TV, there was an ad for a place that sells molecular cuisine equipment.

Here they are cooking elaborate food all the time. So, when you are good here, you are at the top of the world. This is how this small town has two restaurants in the top 20 of Restaurant Magazine. This town has a total of 17 Michelin stars (including three 3-star restaurants). It’s absurd.

§

PS: The James Joyce reference in the XKCD seems to be real. Here one collection of his letters, a portrait of the artist as a pervert

Friday Links

1. This Sokalization of the nutrition literature is brilliant. A journalist got a paper published showing that chocolate helps you lose weight just to show how easy it is to get random results published (emphasis min):

Other than [concealing my identity], the study was 100 percent authentic. My colleagues and I recruited actual human subjects in Germany. We ran an actual clinical trial, with subjects randomly assigned to different diet regimes. And the statistically significant benefits of chocolate that we reported are based on the actual data. It was, in fact, a fairly typical study for the field of diet research. Which is to say: It was terrible science. The results are meaningless, and the health claims that the media blasted out to millions of people around the world are utterly unfounded.

[…]

Or as one prescient reader of the 4 April story in the Daily Express put it, “Every day is April Fool’s in nutrition.”

Long time readers will know that the awful state of nutrition “research” is a pet peeve of mine and that I think it undermines trust in science more generally.

2. If you didn’t see it, we got published in Science last week. If you don’t have time to read the paper, just listen to the rap summary (which features a plot I did).

3. There is a vaccine for Lyme Disease, which is not used because of anti-vaccination concerns. Here is the Nature News writeup (/ht ScienceBabe).

Also, I learned that there are more cases in Europe than in the US, the vaccine was initially based on work from the University of Heidelberg, but the vaccine is not offered here either.

Thoughts on driverless cars

Since I’m getting a lot of hits from marginal revolution readers, let me write down a few thoughts I’ve had for a while on a favorite topic there: driverless cars.

1. It will be a wonderful thing. Besides the 1.2 million killed per year (if that is something you want to shrug off with a Besides), driving is a major source of stress in people’s lives. For many, their commute is the absolutely worst part of their day. There are also a not-insignificant number of people who, for one reason or another, do not drive which limits their choices in all sorts of minor ways (which jobs they can take, where they can live, who they can meet socially…)

2. Regulatory issues will be solved. As I wrote, millions of people die in car crashes every year and so the idea that regulators will stop this technology is, even for me, who am not generally a big believer in the abilities of regulatory authorities, is too horrific to contemplate.  It is an interesting reflection on the Great Stagnation that some people think that there may be technology to save a million lives per year (about twice as much as malaria), but our public regulators are so dysfunctional that they will stop it.

Of course, it’s a fallacy to assume that just because something is horrific, it will not happen. However, I have a bit more trust in the power of large corporations to obtain regulatory approval for their products than the pundits who argue this will be a stopping block. There are also intermediate steps that may be taken so that slowly the world moves from the current state to the new, better, driverless cars, state. The introduction of driverless trucks only on some roads is exactly one such step. In the same way, the liability issues will eventually be solved. Regulations may delay the introduction of driverless cars¹, but not stop it.

You just need a one or two jurisdictions to take that first step. It will probably be one of the most enlightened polities like Sweden, Singapore, or Nevada (the UK?). Then, slowly, the rest of the world will copy it.

3. (Human) driving will be illegal. Besides the fact that it kills so many, it will just be possible to have humans drive through intersections like this one. There won’t be any more need for traffic lights, no parking signs, &c If a city wants to cut off a specific street for construction work, or reroute traffic, just update the online map and broadcast it: the driverless cars will then know to avoid that street. So, pretty quickly it will become impossible for humans to drive the streets.

4. Driverless cars will change citiesThey will enable much higher density. Have you ever driven through an empty large city in the middle of the night and felt amazed at how fast you got around without all the traffic? Even Manhattan could be crossed in less than half an hour from one end to the other. It will be like this all the time. Except that there may be second-order effects where even more people move into dense city centres. At the same time, streets can be narrower, parking can be taken off the street (just have the car drive you to your door and then go park itself somewhere far), so there is more buildable area. This line of reasoning leads us to expect much higher densities.

They will also enable much lower density. It is nice to live out in the country, but long commutes are horrible. But if the car is driving itself while you catch up on email or the blogs… This line of reasoning leads us to expect much lower densities and gigantic sprawl.

Maybe will get both: very dense city centres for the young and hip, with a huge suburban rings. Average density is over.

In reality, it’s hard to predict how all of these forces will play out as they push and pull in different directions. However, cities will probably not look the same at all.

5. Driverless taxis and mini-busses will be the future of public transport.

6. What is the time frame? Pretty soon, we’ll have to buy a new car for my wife. I don’t yet worry about the fact that eventually driverless cars will drive down [sorry] the resale value of driver-needed cars, but at some point, yes, this will happen.

However, for a city project, the time frame may be already close enough that any city starting a mass transit project should consider the possibility of driverless cars already. A new tram line whose planning is starting today may only open 10 or 15 years in the future. Is it really a good idea if the possibility of it being obsolete before it even opens is so real? This feels like trolling (imagine showing up at a public hearing and asking your city council what happens with driverless cars), but mass transit may just not make too much sense anymore.

7. It won’t be good for the environment. Yes, it’s possible to have more efficient cars, car sharing as the norm, &c. But lowering the cost of moving around so enormously will certainly make people move around more. Why not go visit someone who lives 1000 miles away on the week-end if the car can drive itself during the night. Fall asleep here, wake up there (without all the hassle of a public sleeping car in a train).

Perhaps if there is a major breakthrough in nuclear fusion, we can replace all the cars by fusion-charged electrics, but until that happens, the small gains in efficiency will most likely be outweighed by the increase in consumption.

¹ Thousands of people died in this sentence, maybe millions. Even a public hearing which delays decision a few weeks, will cause thousands of deaths. Note that this still happens even if the driverless car roll-out process is spread out over many years, so that the additional deaths do not happen in a  one time event. It’s the way that regulatory agencies kill: they delay the introduction of life-saving technologies and the deaths caused are statistical and invisible.