Utilitarian Scientific Software Development

Yesterday, I added this new feature to ngless: if the user asks it to run a non-existent script, it will try it give an error message with a guess of what you probably really meant.

For example, if you type Profiles.ngl, but the script is actually called profile.ngl:

$ ngless Profiles.ngl

Exiting after fatal error:
File `Profiles.ngl` does not exist. Did you mean 'profile.ngl' (closest match)?

Previously, it would just say Profiles.ngl not found, without a suggestion.

It took me about 10-15 minutes to implement this (actually most of the functionality was already present as ngless already implemented similar functionality in other context). Is it worth it?

When is it worth it to add bells & whistles to research software?

I think we should think about it, in an ideal world, using the utilitarian principle of research software development: software should reduce the overall human effort. If this feature saves more time overall than it took to write, then it’s worth it.

This Utilitarian Principle says that these 15 minutes were well invested if (and only if) this ngless features saves more than 15 minutes for all its users over its lifetime. I expect that every time an user triggers this error, they’ll save a few seconds (say 2 seconds). 15 minutes is 900 seconds. Thus, this feature is worth it if it is triggered over 450 times. Given that we hope that ngless will be widely used, this feature is certainly worth it.

This principle also makes the argument that it would not be worth to add such a feature to a script that is only used in an internal analysis. So, code that was only meant to be used by myself or by myself and a small number of others, should have fewer bells & whistles.

In a non-ideal world, we need to take into account the incentives of the scientific (or commercial) world and the possibility of market failure: the market does not always reward the most selfless behaviour (this includes the “market” for scientific rewards where shiny new things are “paid” more than boring old maintenance).

Advertisements

Scott Sumner on what is a science

And don’t embarrass yourself by arguing macroeconomics is not a science.  Of course it’s a science.  It’s failed science, but then so are some of the other sciences, at least based on what I’ve read about the crisis in replication.  The term ‘science’ is not a compliment, it’s not some sort of award given to a field, like a Nobel Prize.  It’s simply a descriptive term for a field that builds models that try to explain how the world works.  Saying that science must be successful to be viewed as science is as silly as saying that a work of art must be good to be considered art.

Scott Sumner

Psychology is a failed science.

No, computers are not setting us up for disaster

Yesterday, the Guardian published a long essay by Tim Harford on the dangers of automation. The argument is not new (I first heard it on the econtalk episode with David Mindell), and the characteristic example is that of the Air France flight that crashed in the middle of the ocean after the autopilot handed control back to the human pilots who immediately proceeded to crash the plane. As I read it the argument runs as follows: (a) full automation is impossible, (b) partial automation erodes skills, therefore (c) we should be wary of over-automating.

On twitter, I responded with the snark that that medium encourages:

But I feel I should make a longer counter-argument.

1. Despite being a good visual (a plane crash is dramatic), the example of an airplane crash in 2009 is a terrible one. Commercial civil aviation is incredibly safe. Commercial aviation is so safe, I wouldn’t be surprised to read a contrarian Marginal Revolution post arguing it’s now too safe and we need more dangerous planes. I would be very careful in arguing that somehow whatever the aviation community does, is not working based on a single incident that happened 7 years ago. If this was happening every 7 weeks, then it would be a very worrying problem, but it doesn’t.

2. Everything I’ve heard and read about that Air France accident seems to agree that the pilots were deeply incompetent. I have also gotten the distinct impression that if the system had not handed back control to the humans, they would not have crashed the plane. It is simply asserted that we cannot have completely autonomous planes, but without evidence. Perhaps at the very least, it should be harder for the humans to override the automated control. Fully automated planes would also not be hijackable in a 9/11 way nor by their own pilots committing suicide (which given how safe planes are, may now be a significant fraction of airplane deaths!).

3. Even granting the premise of the article, that (a) full automation is impossible and (b) partial automation can lead to skill erosion, the conclusion that “the database and the algorithm, like the autopilot, should be there to support human decision-making” is a non sequitor. It assumes that the human is always a better decision maker, which is completely unproven. In fact, I rather feel that the conclusion is the opposite: the pilot should be there (if a pilot is needed, but let’s grant that) to support the autopilot. Now, we should ask: what’s the best way for pilots to support automated systems? If it is to intervene in times of rare crisis, then pilots should perhaps train like other professionals who are there for crises: a lot of simulations and war games for the cases that we hope never happen. Perhaps, we’ll get to a world where success is measured by having pilots spend their whole careers without ever flying a plane, much like a Secret Service agent trains for the worst, but hopes to never have to throw themselves in front of a bullet.

4. Throughout the essay, it is taken as a given that humans are better and computers are there to save on effort. There is another example, that of meteorologists who now trust the computer instead of being able to intuit when the computer has screwed up, which is what used to happen, but I don’t see an argument that their intuition is better than the computer. If you tell me that the veteran meteorologists can beat the modern systems, I’ll buy that, but I would also think that maybe it’s because the veteran meteorologists were working when the automated systems weren’t as good as the modern ones.

5. The essay as a whole needs to be more quantitative. Even if computers do cause different types of accident, we need to have at least an estimate of whether the number of deaths is larger or smaller than using other systems (humans). I understand that authors do not always choose their titles, but I wouldn’t have responded if title of the essay had been “It won’t be perfect: how automated systems will still have accidents”.

6. The skill erosion effect is interesting per se and there is some value in discussing it and being aware of it. However, I see no evidence that it completely erases the gains from automation (rather than being a small “tax” or clawback on the benefits of automation) and that the solution involves less automation rather than either more automation or a different kind of human training.

7. My horse riding skills are awful.

Scipy’s mannwhitneyu function

Without looking it up, can you say what the following code does:

import numpy as np
from scipy import stats
a = np.arange(25)
b = np.arange(25)+4
print(stats.mannwhitneyu(a , b))

You probably guessed that it computes the Mann-Whitney test between two samples, but exactly which test? The two-sided or the one-sided test?

You can’t tell from the code because it depends on which version of scipy you are running and it has gone back and forth between the two! Pre-0.17.0 it used the one-sided test with the side being decided based on the input data. This was obviously the wrong thing to do. Then, the API was fixed in 0.17.0 to do the two-sided test. This was considered a bad thing because it broke backwards compatibility and now it’s back to performing the one-sided test! I wish I was making this up. 

Reading through the github issues (#4933, #6034,  #6062, #6100)  is an example of how open source projects can stagnate. There is a basic, simple, solution to the issue: create a corrected version of the function with a new name and deprecate the old one. This keeps backwards compatibility while allowing the project to fix its API. Once the issue had been identified, this should have been a 20 minute job. Reading through the issues, this simple solution is proposed, discussed, seemingly agreed to. Instead, something else happens and at this point, it’d take me longer than 20 minutes to just read through the whole discussions.

This is not the first time I have run into numpy/scipy’s lack of respect for backwards compatibility either. Fortunately, there is a solution to this case, which is to use the full version:

stats.mannwhitneyu(a, b, alternative='two-sided')

Repost: BLAST deserves a Nobel Prize

Given that tomorrow (Monday October 3) the 2016 Nobel Prize in Physiology or Medicine will be announced, I am linking to my 2-year old post arguing that BLAST deserves a Nobel Prize:

In terms of impact in the field, it’s undeniable that BLAST has been huge. These people created a verb! What modern biologist does not know what “blasting a sequence” means? The BLAST paper was, at one point, the most highly cited paper in history. The impact on physiology is undeniable.

Lipman and Gene Myers stand out for their contributions to the computational processing of biological sequences. (See how I phrased that in a Nobel Committee way).

[…]

[One] counterargument I’ve heard is that BLAST is mostly a method, but so was GFP […] Does anybody believe that just the 1962 discovery of a jellyfish protein would have sufficed for a Nobel?

[…]

BLAST was definitely one of the most largest advances in the field of physiology in the last few decades. For this reason, David Lipman and Gene Myers should get a Physiology and Medicine Nobel Prize.

I also add that current favorite CRISPR is also mostly a method (the CRISPR Prize, when it comes, will be awarded for the method not the discovery of some DNA processing mechanism in a Streptococcus species.

Anscombe’s Quartet Animated

Anscombe’s Quartet is a set of four 2D datasets which have the same mean and variance in both X & Y as well as the same relationship between the two variables, even though they look very different.

I built a little animation to show all four datasets and a smooth transition between them:

Animation showing Anscombe's Quartet

Animation showing Anscombe’s Quartet

The black line is the mean Y value and the two dotted lines represent the mean ± std dev., the blue line is the least square regression between x and y. These are recomputed at each frame. In a sense, all the frames are like Anscombe sets.

*

The script for generating these is on github. I enjoyed playing around with theano for easy automatic differentiation (these type of derivatives are easy, but somehow I always get a sign wrong or a factor of 2 missing in the first try).

The UK medal count really is impressive, the US is just as expected

Repeating my analysis from last week on medal counts. To recap, let’s look for models that predict a country’s medal count based on GDP/population and then check which countries over- or under-perform their size and wealth. In the end, simple total GDP at market rates was the best predictor.

Measured by ratio of obtained medals to predicted medals, Russia was still overperforming (and they got banned from several of the events, so that is very impressive, although we’ll never know of which of the other sports they should have gotten banned from). Interestingly, we also see several caucusian countries showing up. And the big winner of this year’s Olympics, Great Britain, does show up as getting many more medals than their GDP predicts.

Finally, note that France, which was underperfoming at the beginning of the Olympics, not only caught up, but made it to the over-performers table (très bien, la France!).

Over performing countries
=========================
                   delta  got  predicted     ratio
Russia         41.352770   56  14.647230  3.823248
Azerbaijan     11.717356   18   6.282644  2.865036
Great Britain  42.256925   67  24.743075  2.707828
New Zealand    10.872079   18   7.127921  2.525280
Kazakhstan      9.783999   17   7.216001  2.355875
Hungary         8.201280   15   6.798720  2.206298
Kenya           6.573411   13   6.426589  2.022846
Uzbekistan      6.550104   13   6.449896  2.015536
Australia      13.899721   29  15.100279  1.920494
France         19.639542   42  22.360458  1.878316

Note that neither the US nor China show up. If anything, they are performing slightly below expectations.

Now, for the bottom half:

Under performing countries
==========================
                          delta  got  predicted     ratio
India                -18.620390    2  20.620390  0.096991
Nigeria               -8.497863    1   9.497863  0.105287
Austria               -7.754915    1   8.754915  0.114222
United Arab Emirates  -7.728790    1   8.728790  0.114563
Singapore             -7.190392    1   8.190392  0.122094
Philippines           -7.185019    1   8.185019  0.122174
Finland               -6.753509    1   7.753509  0.128974
Portugal              -6.539120    1   7.539120  0.132641
Qatar                 -6.316772    1   7.316772  0.136672
Puerto Rico           -5.873931    1   6.873931  0.145477

India got two medals (neither of which gold, one silver and one bronze) even though they are on track to becoming one of the world’s largest economies (right now, their GDP is comparable to Italy’s, but growing fast, while Italy is stagnant).

Several oil countries (unearned wealth) are listed there. The 3 richest countries not to win a medal at all are Saudi ArabiaPakistan, and Chile; another trio of resource rich countries.

*

You can run the whole analysis on a mybinder repo.