Thoughts on “Revisiting authorship, and JOSS software publications”

This is a direct response to Titus’ post: Revisiting authorship, and JOSS software publications. In fact, I started writing it as a comment there and it became so long that I decided it was better as its own post.

I appreciate the tone of Titus’ post, more asking questions than answering them, so here are my two or three cent:

There is nothing special about software papers. Different fields have different criteria for what consistutes authorship. I believe that particle physics has an approach close to “everyone that committed to the git repo gets to be an author”, which leads to papers with >1000 authors (note that I am not myself a particle physicist or anything close to it). At that point, the currency of authorship and citations is dilluted to the point where I seriously don’t know what the point is. What is maybe special about software papers is that, because they are newer, there hasn’t been time for a rough-consensus to emerge on what should be the criteria (I guess this is done in part in discussions like these). Having said that, even in well-established fields, you still have issues that are never resolved (in molecular biology: the question of whether technicians should be listed as authors brings in strong opinions on both sides).

My position is against every positive contribution deserves authorship. Some positive contributions are significant enough that they deserve authorship, others acknowledgements, others can even go unmentioned. Yes, it’s a judgement call what is “significant” and what is not, but I think someone who, for example, reports a bug that leads to a bugfix is a clear NO (even if it’s a good bug report). Even small code contributions should not lead to authorship (perhaps an acknowledgement is where my indecision boundary is at that point). People who proof read a manuscript also don’t get authorship even if they do find a few typos or suggest a few wording changes.

Of course, contribution need not be code. Tutorials, design, &c, all count. But they should be significant. I also would not consider adding as an author an individual that asked a good question during a seminar on the work. Those sometimes turn out to be like good bug reports in that you improve the work based on them. The fact that significance is a judgement call does not imply that we should drop significance as a criterion.

I think authorship is also about responsibility. If you are an author, then you must take responsibility for some part of the work (naturally, not all of it, but some of it). If there are issues later, it is your responsibility to, at the very least, explain what you did or even to fix it, &c. You should be involved in the paper writing and if, for example, some work is need for revision on that particular aspect of the code, you need to do it.

From my side, I have submitted several patches to projects which were best-efforts at the time, but I don’t want to take any responsibility for the project beyond that. If the author of one of those projects now told me that I needed to redo that to work on Mac OS X because one of the reviewers complained about it, I’d tell them “sorry, I cannot help you”. I don’t think authors should get to do that.

I would get off-topic here, but I also wish there was more of an explicit expectation that if you publish a software paper, you shall provide minimal maintenance for 5-10 years. Too often software papers are the obituary of the project rather than the announcement.

My answer to “Another question: does authorship keep accruing over versions? Should all the authors on sourmash 2.0 be authors on sourmash 3.0?” is a strong NO. You don’t double dip. If anything, I think it’s generally the authors of version 2 that often lose out as people benefit from their work and keep citing version 1.

Finally, a question of my own: what if someone does something outside the project that clearly benefits the project, should they be authors? For example, what if someone does a bioconda package for your code? Or writes an excellent blogpost/tutorial that brings you a very large number of users (or runs a tutorial on it)? Should they be authors? My answer is that it first needs to be significant, and it should not be automatic. It may be appropriate to invite them for authorship, but they should then commit to (at the very least) reading the manuscript and keeping up with the development of the project over the medium-term.

Advertisements

Vaccinate even if it probably won’t make a different to you personally

From the ongoing series “how do statistics feel to me” (previous episode)

vaccines

Vaccines create adults

This is a strong slogan, but it’s trivially false and, eventually, stupid. There were plenty of adults before vaccines. Almost every kid that gets one of the diseases that we vaccinate against will survive just fine. Many other pro-vaccination slogans are equally alarmist, equally false, and (I believe) equally counter-productive.

Mind you, I agree with the idea that vaccines are good and I think they should be mandatory, due to the fact that children cannot make informed decisions and vaccination has public health consequences (it’s not just your body, your choice, it’s our herd immunity, our choice). But, even if all vaccines were to disappear, most of the time, the kids would be fine. In fact, the total mortality rate might be not more than 1-2% (deaths before the age of 5).

Without vaccines, most families would not have to explain to the older brother why their sister died, they would not have to mourn a dead child. Society would not materially change that much. There would be roughly as many adults around.

However, 1-2% of children dying would be horrible. There is no need to exaggerate to make it sound even worse: it would be bad enough. If you take your kids to kindergarten, and your kindergarten has an average size, that would be one child funeral per year (or every other year).

Without vaccines, your child would most likely be fine, the odds are in their favour (they’d have a 98% chance of making it). However, you would probably have to explain to them that someone in their school of a horrible disease at some point in their early childhood. You’d probably be saying “Good Morning” to at least one parent who lost a child on a regular basis. If you are a early-education teacher, you’d have to expect to attend at least half-a-dozen funerals in your career, mourning little children.

This is a tremendous amount of pain and grief.

Most kids would be fine, most kids were fine before vaccines. But 1-2% is way too much. We’d see people developing defense mechanisms (Detachment parenting: don’t get too close before you know they’ll make it, a New York Times best seller for the post-vaccine era). We don’t need vaccines to create adults, but a world without them is a significantly worse world than the one we live in today.

It’s not “if we don’t vaccinate, all children will die”: that’s flashy, but false.

Rather, it’s “when you walk into kindergarten with your toddler tomorrow, think that, without vaccines, one child in there would die every other year. It’s probably not going to be yours, but, please, vaccinate them.”

I am a regional thinker: a review of “stubborn attachments”

Alternative title: Why I moved to China

Tyler Cowen likes to say that everyone is a regional thinker. It took me a while to understand how that applied to me. I have a split identity (and two passports to go with them). I am both Portuguese and British, with a German high-school education (in Lisbon). So, it’s not even clear which region I should be grouped with.

But Tyler’s statement (as I understand it) is not everyone can be classified as belonging to a regional school of thought, but rather your thinking, even in the most abstract of subjects will have been molded by where you spent your formative years.

I grew up in Portugal, coming of age in the 1990s. The narrative I learned at the time goes as follows: Portugal was held back by the dictatorship, which turned its back on development and the world, Salazar, dictator from 1926-1968, famously said “Portugal will be proudly alone”. Fortunately for the country (unfortunately for the regime), people were still learning about the outside world and eventually got fed up: after the revolution in 1974, the country turned to Europe, joined the EU, and started to catch up to the rest of the West.

Now, having been born in a democratic Portugal, we were the first generation to grow up in a modern Portugal and we’d have lives that were European and turned to the future, not the past. Everyone complained about all the construction that was taking place as the city of Lisbon was being transformed into a modern city, but it was also a sign of progress. The high point of this period was the 1998 World Expo in Lisbon, which included a large expansion of the subway network, and a shiny new bridge across the Tagus, at the time the longest in Europe.

I could now point to where this narrative was a bit too simplistic (in particular, there was a lot of economic growth in the 1950s and 60s, and the true victims of the dictatorship were in the African colonies), but the point is that this is how we thought of the situation: Portugal had been held back by an accident of history and we could see Spain across the border being slightly richer as an example of where the country would be a few years later and the core of the EU as where we could expect to be within our lifetimes.

In 1998, the World Expo in Lisbon was a huge success (after a few initial hiccups), a whole new modern-looking neighbourhood was built on what had been industrial lands. Now UN Secretary-General Guterres was prime-minister, one of the darlings of the international “Third Wave Socialists” movement. Portugal had gotten rid of the nasty right-wing in 1974, now it also had a modern Left.

Then came nothing. Economic growth stumbled. Guterres resigned a few years later (curiously, for someone who was bringing a modern social-democracy, Guterres was, and probably still is, a social-conservative). Outside of Lisbon, things were still improving as the other cities caught up to the capital, but eventually that petered out. Portugal has now had two lost decades. Adjusting for inflation, GDP per capita grew 7% between 2000 and 2008. I mean it grew 7% over that whole period, not on a yearly basis. Then it fell during the crisis and only last year did it get back to 2008 levels, so that between 2000 and 2017, total growth was 7%. Nobody believes that today’s 20-year-old kids will have an European lifestyle (and I don’t even mean a Nordic lifestyle, just a France/German lifestyle).

A few months ago, Noah Smith tweeted that “people compare themselves to other in their society, so saying that ‘things are getting better’ doesn’t help. Nobody compares themselves to people in 2318”. When I read that, I thought, “Why not? I might not look 300 years into the future, but I certainly compare our world to the world in 2038 and think we’re failing”. (Noah might have tweeted a better version, I couldn’t find the original tweet).

The idea of a Great Stagnation has always been deeply intuitive to me and I frankly cannot understand people who say that technology is moving too fast. I grew up seeing real change around me, see it suddenly stop, and feeling short changed by Portugal. Eventually, I left. I didn’t leave because growth stopped, but I stayed away because growth stopped.

I see a lot of superficial changes all the time, but that’s like fashion: now we wear tight jeans, we used to wear bell-bottoms. It’s change, it may even be a good thing, in that it is fun, break the monotony, change” but not progress.

To be sure, stagnation may not be so bad. Germany is the stagnant country par excellence and it’s a nice place to live. It is certainly one of the best countries in the world in terms of quality of life. But, Germany is stagnant and once you see it, the gap between what could easily be and what actually is gets too large to be ignored. As time goes on, the gap will only get larger. I guess if you grow up without large changes for decades, you start to expect stagnation, maybe even enjoy it. You compare yourself to the Jones’ next door and not to 2038 because there is no picture in your minds’ eye of what 2038 should look like and how it should be better.

The population who lived in Portugal through the last 10 years now get excited over 2.2% year-on-year growth. After so many years of nothing, mediocre growth feels amazing. Still, if you cross the border into Spain it no longer feels “this is what Portugal will be in 2021”. Compared to Portugal, Spain now feels like a much wealthier, qualitatively different, better economy. Portugal could have been that, but, at least in my lifetime, it probably won’t be. This is a lost opportunity and it brings me sadness.

Maybe it’s not that I am a regional thinker, but a regional feeler. I have a visceral feel for what it means to “grow to the level of Greece and then stop there” that comes from lived experience.

In summary, this is why I recommend you read Stubborn Attachments.

How Notebooks Should Work

Joel Grus’ presentation on why he does not like notebooks sparked a flurry of notebook-related discussion.

I like the idea of notebooks more than I like actual notebooks. I tried to use them in my analyses for a long time, but eventually gave up as there are too many small annoyances (some that the talk goes over, others that it does not, such as the fact that they do not integrate well with git).

Here is how I think they should work instead:

  1. There is no hidden state. Cells are always run from top to bottom.
  2. If you change a cell in the middle, you immediately clear its output and all those below and the whole thing is run from the top.

For example:

[1] : Code
Output

[2] : Code
Output

[3] : Code
Output

[4] : Code
Output

[5] : Code
Output

Now, if you edit Cell 3, you would get:

[1] : Code
Output

[2] : Code
Output

[3] : New Code
New Output

[ ] : Code

[ ] : Code

If you want, you can run the whole thing now and get the full output:

[1] : Code
Output

[2] : Code
Output

[3] : New Code
New Output

[4] : Code
New Output

[5] : Code
New Output

This way, the whole notebook is always up to date.

But won’t this be incredibly slow if you always have to run it from the top?

Yes, if you implement it naïvely where the kernel really does always re-run from the top, which is not likely to be usable, but you could do a bit of smart caching and keep some intermediate states alive. It would require some engineering, but I think you could keep a few live kernels in intermediate states to make the experience usable so that if you edit cell number 35, it does not need to go back to the first cell, but maybe there is a cached kernel that has the state of cell 30 and only 31 and onwards would need to be rerun.

It would take a lot of engineering and it may even be impossible with the current structure of jupyter kernels, but, from a human point-of-view, I think this would be a better user experience.

The European Court of Justice’s decision that CRISPR’d plants are GMOs is the right interpretation of a law that is bonkers

The European Court of Justice (think of it as the European Supreme Court) declared that CRISPR’d plants count as GMOs.

I think the Court is correct, CRISPR’d plants are GMOs. The EU does not have a tradition of “legislation by judicial decision” like the US’s Supreme Court (although there have been some instances of such, as in the Uber case). Thus, even though I wish the decision had gone the other way as a matter of legislation, as a matter of legal interpretation, it seems clear that the intent of the law was to ban modern biotechnology as scary and, I don’t see how CRISPR does not fill that role.

The decision is scientifically bonkers, in that it says that older atomic gardening plants are kosher, but the exact same organism would be illegal if it were to be obtained by bioengineering methods. According to this decision, you can use CRISPR to obtain and test a mutation. At this point, it’s a GMO, so you cannot sell it in most of Europe. Then you use atomic gardening, PCR, and cross-breeding and you obtain exactly the same genotype. However, now, it’s not a GMO, so it’s fine to sell it. The property of being a GMO is not a property of the plant, but the property of its history. Some plants may carry markers which will identify them as GMOs, but there may be many cases where you can have two identical plants, only one of which is a GMO. This has irked some scientists (see this NYT article), but frankly, it is the original GMO law that is bonkers in that it regulates a method of how to obtain a plant instead of regulating the end result.

On this one, blame the lawmakers, not the court.

HT/ @PhilippBayer

NGLess preprint is up

We have posted a preprint describing NG-meta-profiler and NGLess in general:

NG-meta-profiler: fast processing of metagenomes using NGLess, a domain-specific language Luis Pedro CoelhoRenato AlvesPaulo MonteiroJaime Huerta-CepasAna Teresa FreitasPeer Bork 

My initial goal was to develop a tool that (1) used a domain-specific language to describe computation (2) was actually used in production. I did not want a proof-of-concept as one of the major arguments for developing a domain-specific language (DSL)  is that it is more usable than just doing a traditional library in another language. As I am skeptical that you can fully evaluate how good a tool is without long-term, real-world, usage,  I wanted NGLess to be used in my day-to-day research.

NGLess has been a long-time cooking but is now a tool that we use every day to produce real results. In that sense, at least, our objectives have been achieved.

Now, we hope that others find it as useful as we do.

Why NGLess took so long to become a robust tool (but now IS a robust tool)

Titus Brown posted that good research software takes 2-3 years to produce. As we are close to submitting a manuscript for our own NGLess, which took a bit longer than that, I will add some examples of why it took so long to get to this stage.

There is a component of why it took so long that is due to people issues and to the fact that NGLess was mostly developed as we needed to process real data (and, while I was working on other projects, rather than on NGLess). But even if this had been someone’s full time project, it would have taken a long time to get to where it is today.

It does not take so long because there are so many Big ideas in there (I wish). NGLess contains just one Big Idea: a domain specific language that results in a tool that is not just a proof of concept but a is better tool because it uses a DSL; everything else follows from that.

Rather, what takes a long time is to find all the weird corner cases. Most of these are issues the majority of users will never encounter, but collectively they make the tool so much more robust. Here are some examples:

  • Around Feb 2017, a user reported that some samples would crash ngless. The user did not seem to be doing anything wrong, but half-way through the processing, memory usage would start growing until the interpreter crashed. It took me the better part of two days to realize that their input files were malformed: they consisted of a few million well-formed reads, then a multi-Gigabyte long series of zero Bytes. Their input FastQs were, in effect, a gzip bomb.

    There is a kind of open source developer that would reply to this situation by saying well, knuckle-head, don’t feed my perfect software your crappy data, but this is not the NGLess way (whose goal is to minimize the effort of real-life people), so we considered this a bug in NGLess and fixed it so that it now (correctly) complains of malformed input and exits.

  • Recently, we realized that if you use the motus module in a system with a badly working locale, ngless could crash. The reason is that, when using that module, we print out a reference for the paper, which includes some authors with non-ASCII characters in their names. Because of some weird combination of the Haskell runtime system and libiconv (which seems to generally be a mess), it crashes if the locale is not installed correctly.

    Again, there is a kind of developer who would respond to this by well, fix your locale installation, knuckle-head, but we added a workaround.

  • When I taught the first ngless workshop in late 2017, I realized that one of inconsistencies in the language was causing a lot of confusion for the learners. So, the next release fixed that issue.
  • There are two variants of FastQ files, depending on whether the qualities are encoded by adding 33 or 64. It is generally trivial to infer which one is being used, though, so NGLess heuristically does so. In Feb 2017, a user reported that the heuristics were failing on one particular (well-formed) example, so we improved the heuristics.
  • There are 25 commits which say they produce “better error messages”. Most of these resulted from a confused debugging session.

None of these issues took that long to fix, but they only emerge through a prolonged beta use period.

You need users to try all types of bad input files, you need to try to teach the tool to understand where the pain points for new users are, you need someone to try to it out in a system with a mis-installed locale, &c

One possible conclusion it that for certain kinds of scientific software, it is actually better if it is done as a side-project: you can keep publishing other stuff, you can apply it on several problems, and the long gestation period catches all these minor issues, even while you are being productive elsewhere. (This was also true of Jug: it was never really a project per se, but after a long time it became usable and its own paper).