Big Data Biology Lab Software Tool Commitments

Cross-posting from our group website.

Preamble. We produced two types of code artefacts: (i) code that is supportive of results in a results-driven paper and (ii) software tools intended for widespread use.

For an example of the first type, see the Code Ocean capsule that is linked to (Coelho et al., 2018). The main goal of this type of code release is to serve as an Extended Methods section to the paper. Hopefully, it will be useful for the small minority of readers of the paper who really want to dig into the methods or build upon the results, but the work aims at biological results.

This document focuses on the second type of code release: tools that are intended for widespread use. We’ve released a few of these in the last few years: JugNGLess, and Macrel. Here, the whole point is that others use the code. We also use these tools internally, but if nobody else ever adopts the tools, we will have fallen short.

The Six Commitments

  1. Five-year support (from date of publication) If we publish a tool as a paper, then we commit to supporting it for at least five years from the date of publication. We may stop developing new features, but if there are bugs in the released version, we will assume responsibility and fix them. We will also do any minor updates to keep the tool running (for example, if a new Python version breaks something in one of our Python-based tools, we will fix it). Typically, support is provided if you open an issue on the respective Github page and/or post to the respective mailing-list.
  2. Standard, easy to install, packages Right now, this means: we provide conda packages. In the future, if the community moves to another system, we may move too.
  3. High-quality code with continuous integration All our published packages have continuous integration and try to follow best practices in coding.
  4. Complete documentation We provide documentation for the tools, including tutorials, example data, and reference manuals.
  5. Work well, fail well We strive to make our tools not only work well, but also “fail well”: that is, when the user provides erroneous input, we attempt to provide good quality error messages and to never produce bad output (including never producing partial outputs if the process terminates part-way through processing).
  6. Open source, open communication Not only do we provide the released versions of our tools as open source, but all the development is done in the open as well.

Note for group members: This is a commitment from the group and, at the end of the day, the responsibility is Luis’ responsibility. If you leave the group, you don’t have to be responsible for 5 years. If you leave, your responsibility is just the basic responsibility of any author: to be responsive to queries about what was described in the manuscript, but not anything beyond that. What it does mean is that we will not be submitting papers on tools that risk being difficult to maintain. In fact, while the goals above are phrased as outside-focused, they are also internally important so that we can keep working effectively even as group members move on.

Extending ngless and interacting with other projects [4/5]

NOTE: As of Apr 2016, ngless is available only as a pre-release to allow for testing of early versions by fellow scientists and to discusss ideas. We do not consider it /released/ and do not recommend use in production as some functionality is still in testing. Please if you are interested in using ngless in your projects.

This is the first of a series of five posts introducing ngless.

  1. Introduction to ngless
  2. Perfect reproducibility using ngless
  3. Fast and high quality error detection
  4. Extending and interacting with other projects [this post]
  5. Miscellaneous

Extending and interacting with other projects

A frequently asked question about ngless is whether the language is extensible. Yes, it is. You can add modules using a simple text-only format (YaML). These modules can then add new functions to ngless. Behind the scenes, this results in command line calls to scripts you write.

For example, to integrate motus into ngless, I used a simple configuration file, which I am going to describe it here.

Every module has a name and a version:

name: 'motus'
version: '0.0.0'

You can add a citation text. This will be shown to all users of your module (citing the software you use is a best practice, so we support it):

citation: "Metagenomic species profiling using universal phylogenetic marker genes"

You can add an init command. This will run before anything else runs at the start of the interpretation. It should be quick and check that things are OK. For example, in this case, we check that Python is installed. Thus, if there is a problem, the user gets a fast error message before anything else is run.

init:
    init_cmd: './check-python.sh'

Now, we list the functions we are implementing:

function:

In this case, there is just one, corresponding to the ngless function motus.

    nglName: "motus"

arg0 is the command to run (which implements this function):

    arg0: './run-python.sh'

In ngless functions have a single unnamed argument and any number of named arguments. So, we specify first arg1 which is a special

    arg1:
        filetype: "tsv"
        can_gzip: true

The can_gzip flag lets ngless know that it is OK to pass a compressed file to your script. Now, we list any additional arguments. In this case, there is a required argument:

    additional:
        -
            atype: 'str'
            name: 'ofile'
            def: ''
            required: true

The argument is a string, without a default. That’s it. Now, we can use the motus function in a ngless script:

ngless "0.0"
import "motus" version "0.0.0"

input = paired('data/reads.1.fq.gz', 'data/reads.2.fq.gz')
preprocess(input, keep_singles=False) using |read|:
    read = substrim(read, min_quality=25)
    if len(read) < 45:
        discard

mapped = map(input, ref='motus')
mapped = select(mapped) using |mread|:
    mread = mread.filter(min_identity_pc=97)
counted = count(mapped, gff_file='motus.gtf.gz', features=['gene'], multiple={dist1})
motus(counted, ofile='motus-counts.txt')

What can modules do?

An external module can

  • add new functions (will result in a call to a script, which will often be a wrapper around some tool).
  • add new reference information (new catalogs, &c). This can even be downloaded on demand (currently [Apr 2016], the module init script must do this itself; in the future, ngless will support just a URL).
  • add a citation so that all users of the module will see the citation message. This ensures that if you develop a package which gets wrapped into an ngless module, those final users will still see your citation.

Mahotas: A continuous improvement example

Last week, I tweeted:

Let me highlight a good example of continuous improvement in mahotas and the advantages of eating your own dog food.

I was using mahotas to compute some wavelets, but I couldn’t remember the possible parameter values. So I got some error:

[1]: mh.daubechies(im, 'db4')

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-6-2e939df57e6a> in <module>()
----> 1 mh.daubechies(im, 'db4')

/home/luispedro/work/mahotas/mahotas/convolve.pyc in daubechies(f, code, inline)
    492     '''
    493     f = _wavelet_array(f, inline, 'daubechies')
--> 494     code = _daubechies_codes.index(code)
    495     _convolve.daubechies(f, code)
    496     _convolve.daubechies(f.T, code)ValueError: 'db4' is not in list

I could have just looked up the right code and moved on. Instead, I considered the unhelpfulness of the error message a bug and fixed it (here is the commit)

Now we get a better error message:

ValueError: mahotas.convolve: Known daubechies codes are ['D2', 'D4', 'D6',
'D8', 'D10', 'D12', 'D14', 'D16', 'D18', 'D20']. You passed in db4.

You still get an error, but at least it tells you what you should be doing.

Good software works well. Excellent software fails well too.