Friday Links

0. An analysis of PLOS One Acceptance times

This was a comment by Georg Walter on my previous Plos One analysis. But he took my idea much further than I had.

1. Do women self-cite less?

2. A spammer script

3. Combating Climate Change with Neutrinos

Why didn’t anybody think of that before? (it’s a bogus paper, of course)

/ht Jeffrey Beall

4. A review of my book

I don’t think it is possible to write a better book on Machine Learning in Python, unless the ecosystem evolves with new algorithms. Which it will, and it will mean a new edition of the book! Neat!

[Amazon link]

People Agreeing With Me on the Internet

A couple of things I noticed lately that relate to a few of my posts/ideas.

§ One

Many “scientists should do x” conversations need to be reframed as”the culture of science needs to support x.”

— Jacquelyn Gill (@JacquelynGill) May 16, 2013

This is a very nice way to phrase what I wrote about collective action problems in sharing code

§ Two

First of all, a really cool thing, Thomas J. Webb commented on his own paper with a blog update.

In the blog post, he writes (emphasis is mine):

With some trepidation, I opened up the file of R code I’d used for the original analysis. And got a pleasant surprise: it was readable! Largely this was because I submitted it as an appendix to our paper, and so had taken more care than usual to annotate it carefully. I think this demonstrates an under-apprecaiated virtue of sharing data and code: in preparing it such that it is comprehensible to others, it becomes much more useful to your future self. This point is nicely made in a new paper by Ethan White and colleagues on making using and reusing data easier.

Exactly: sharing code makes you write better code.

(hat tip to @mattjhodgkinson on twitter).

People are Right Not to Share Scientific Code…

… but we are wrong to let them get away with it.

Last week, a preprint discussing code sharing made the following remarks:

[I]t appears that the general agreement often expressed at relevant conferences that source codes should be shared (Allen et al., 2012c; Allen et al., 2013) does not coincide with the way most researchers act when their own code becomes the subject of the discussion.

On Hacker News, a working astrophysicist voiced his concerns:

For me, as an “experimentalist” (read, data-analyst), I also maintain job attractiveness by having sets of code that no one else has.

To summarise: in public, people seem to agree that code sharing is good; in private (or anonymously), they refuse to do it. We have what is called a collective action problem: it seems there is wide agreement that it would be best for everyone if everyone shared, but individual incentives are to defect (not share). The best for the individual is to be able to use everyone’s code, but not share any of their own. The result is that few share.

Yes, there are advantages to sharing. Overall, though, I do think that people are, for the most part, selfishly correct to not share. You may gain a little exposure, a few citations for your papers based on the use of the tools (this is the standard it is good for you spiel). But this is less than getting one of those collaborations where you just run your software for somebody and you are made an author (which is what your astrophysicist is talking about). Note that the code that is not shared is probably of worse quality and less useful than code that is already shared (because there was no random assignment of sharing/not sharing decisions).

And this is not even mentioning one of the biggest risk of sharing code: that people will find out you made a mistake or that your code is too unreliable

How do we handle a collective action problem? We can have the state pass a law that everyone must share. This will probably not happen (first of all, there is no single state as science is very international).

We can rely on people behaving morally. Frankly, I do science for mostly non-selfish reasons: I could make way more money by working on how to make people click on ads or tricking rich people out of their money in a hedge fund, but I really prefer to do something I find worthwhile (this is on the margin, it does not mean that I’d work for subsistence wages, I’m not a monk). Therefore, I find it a bit odd to then behave in a way that does not benefit the wider community. Sure, I’ll play the game around the edges, but if it was about extracting maximal personal benefit, I could do better than this.

Moral behaviour alone, however, is not enough. While 20th century thinking was that collective action needed state intervention, Elinor Olstrom showed that communities can often solve a problem even without Leviathan.

In fact, Science is exactly the sort of polycentric, overlapping authority system that her work studied. There are funders (public and private), journals, universities and other institutes, all the way down to individual PIs in their groups. All of these authorities make decisions that impact others, some are more important than others but none is all powerful (the NIH is problem the single most important player, but even they only control a fraction of world research). Interestingly, all of these authorities act based on peer-review and feedback. This is an ingrained part of the culture. Therefore, whatever the community consensus is, it will get filtered through.

Therefore, if the community consensus is that keeping code private is prejudicial to the whole scientific enterprise, decisions will be made that start to pressure people into sharing. But this is only true if when acting on behalf of the wider community, researchers pressure other researchers into releasing code and reward those who do. Until that happens, we should stop pretending that it is in your best interest to release: it is in everybody’s interest but your own.

If you are evaluating a paper (as editor or peer-reviewer), you can ask about the code (or even just praise the authors that do release code, implying that that makes you more supportive of publication; in certain marginal decisions, it should make the difference between publication and rejection). Or be hard on authors who do not cite the software they use.

The same is true of grant evaluations: do the authors mention what will happen to the code? Particularly in a public funding setting, you are representing the interests of the tax payer. If code will be released as open source, this is probably a better contribution for the society at large. The government funds scientists to produce intelectual public goods, not so that they can get a lot of high profile papers (actually, in small countries, the government is often funding science for the prestige of high profile papers; so this applies mostly to countries with high self-esteem).

The tension is not so clear in the case of hiring and tenure decisions as the institution will often reap the same selfish benefits as the individual researcher for not sharing code (but if you are an external evaluator, you can consider this in evaluating whether that person has made a contribution that is valued by the whole community).

Until we align private and public incentives, we will have sub-optimal results for society. I think that denying that there is a tension between public and private interests by claiming that it really is in your selfish interest to share will fail. It will fail because it is not true.

Note too that, if everybody shares code, one of the largest class of objections to doing so, namely the loss of competitive advantage just disappears: if everybody does something, there can be no competitive advantage to doing/not doing so.

Thus, we can separate the two issues: (1) is it good for science (and the taxpayers paying the bills) that researchers share code? where I think the answer is more clearly positive from (2) is it good for individual researchers in competition to share code? where I think the answer is negative.