# Q2!

Thanks to Mario Figueiredo for pointing out that the measure whose name I had been hunting for is called Q2 in the literature.

It is sometimes spelled Q² or called R²-cv even though it is not a square of anything. In fact, it can even be negative.

# Evaluating Regression with Cross-Validation

I have been doing a bit of regression and have a few thoughts.

Most literature on regression comes from statisticians, not machine learning people. Perhaps for this reason, there is less emphasis on training/testing separation or cross-validation.

In particular, I have been thinking about how to get a general purpose measure of “how well can we predict this output.” Mean squared error $\frac{1}{N} \sum (\hat{y}_i - y_i)^2,$ where i is the cross-validated predictions for input i ) has some nice properties. However, it is meaningless as a number and it would be nice to normalize it.

What seems the most meaningful normalization is to use a null model which consists of outputting the mean of the training data. To make higher numbers be better, I first flip it around:

$1 - \frac{ \sum (\hat{y}_i - y_i)^2 }{\sum (y_i - \bar{y})^2}$

This is

$1 - \frac{\text{Model error}}{\text{Null model error}}$

The result can even be negative if the prediction is harmful. So, I actually want to use

$N(\hat{y}, y) = \max \{ 1 - \frac{ \sum (\hat{y}_i - y_i)^2 }{\sum (y_i - \bar{y})^2}, 0 \}$

This value is 0 for a meaningless prediction, 1 for a perfect one.

I am calling it N for normalized error reduction. In fact, I’ve tried looking and asking around for a literature name. So far, I have not found one.

§

In the case of a positive correlation with no bias, this reduces to the R-squared between  and y , also known to as the explained variance.

I like to look at the results in plots like this one

On the x-axis, I have the underlying output, and on the y-axis, the cross-validated (or out-of-bag prediction) for each sample. I also plot the diagonal. In this case, the prediction is very good and there is only a little noise pushing points away from the diagonal.

§

However, it does not reduce the R-squared in a few interesting cases:

1. The model does not predict at all.

Let’s say that your output cannot be explained by the input at all. To simplify things, let’s assume you don’t even have an input, just a set of outputs, {y1..yN} , which you predict as the mean in the training set.

If you use leave-one-out-cross-validation (LOOCV), then this null-prediction has perfect (negative) correlation with the input (see also here and here! Its R-squared is 1!

If you are using LOOCV, then you’ll probably see this and catch it, but it might slip by if your using 10 folds and you don’t have a lot of data and accidently report it as low-but-significant (with 100 datapoints, and uniformly sampled yi , the R-squared is different from zero, with p-value < 0.05, 90% of the time! 25% of the times, the p-value is below 0.0001)

1. Reversion to the mean models.

This is an instance of where your model was oversmoothed.

In this instance, your model predicts in the right direction, but it underpredicts. However, if you just report R-squared of the prediction, you’ll lead your readers to think you can predict very well.

It’s not always the case that you could just have gone back and multiplied all your coefficients by a number larger than 1. It may be that to get the larger coefficients would imply that you would get noise in the output. This would not happen in traditional regression problems, but in p > n settings, where penalized regression is necessary, it can and does happen (maybe it is an indication to try relaxed lasso).

1. Biased models

If your models are biased, this naturally introduces a penalty. R-squared is invariant to addition (R2( + B, y) =  R2(, y) ), but N is not. In fact, if the correlation is positive,

$R2(\hat{y}, y) = \max_B N(\hat{y} + B, y).$

I most often see this when comparing human annotations. People have very similar trends, but one of the operators will consistenly report a lower value.

In some cases this may not matter and in others it will (if Matt always needs to give a bigger discount than John to get a sale, this does not mean that Matt and John are equivalent salesman).

§

In all of these, the R-squared is excellent, but I think it is overselling how good the prediction is, whilst N is more honest.

# Friday Links: Science & Religion Edition

There’s a sign in the Durham store suggesting that shoppers [..] grind their organic coffees at home because the Whole Foods grinders process conventional coffee, too, and so might transfer some non-organic dust. “This slicer used for cutting both CONVENTIONAL and ORGANIC breads” warns a sign above the Durham location’s bread slicer. Synagogue kitchens are the only other places in which I’ve seen signs implying that level of food-separation purity.

2. Faith & Reason. Money quote:

[A] survey of 1,700 scientists at Harvard, MIT and other elite colleges [reported that a]bout a third were atheists (as opposed to fewer than one-in-20 ordinary Americans) [...]

[A] still-larger study into science and religion [...] sought out “rank-and-file” scientists: researchers in company labs, engineers, dentists and so on. [Surprisingly], Main Street scientists are only a bit less religious than the average American. Perhaps Ivy League scientists are ultra-secular because they are Ivy League, not because they are scientists?

# Average Work-week is Over, a few Thoughts on Productivity

I’ve lately seen some discussions of productivity and they often seem to refer to the widget-cranking model of productivity even whilst claiming not to do so!

§

This is the productivity model I typically see discussed (or assumed):

On the x-axis, I plot time spent working; on the y-axis I plot output (total output, not productivity, which would be the derivative).

In this model, when you work 20 hours a week, you are super productive and can output more than half what you can output when you work 40 hours a week. As working time goes up, fatigue sets in and less is produced per hour until it actually becomes counter-productive including a steep-decline as the worker burns out or just commit egregious mistakes [1].

This is the widget cranking model and it applies to factory workers assembling iPhones, or baristas at a coffee store, with small modifications (see below) it apply to office workers. It does not apply to certain types of highly-intellectual workers working in a modern organization [2].

§

My personal impression is that my personal productivity is much more like this:

The first few hours of the week are zero-output. This is Maintenance. These including attending seminars, reading papers, reading department announcement emails, filling out paperwork, attending training sessions on how to fill out paperwork, some of the meetings I attend, &c. I could do this for years and nothing would come out. Sure, I’d be informed of the literature from reading papers, I’d know what all of the week’s speakers are and I’d have no unread emails on my inbox; but this would not get a single paper published, a single talk given, a single student taught.

Next, there are the Shallow Tasks, the stuff that produces some output, but is not really very challenging or produces very high impact output: answering work emails, (re-)giving talks on work already done, merging text edits from co-authors. These basic tasks can take a few hours every week. Sometimes they take away a whole month.

If all your tasks are of this form then the widget model may apply to you after the Maintenance phase, which in a well-run organization can be 5 hours per week or less [3]. I think this actually describes many office jobs, which are done with the brain, but do not require that much creativity, insight or deep knowledge of a field. It may even describe a lot of the work done by doctors seeing patients (although less now than it did in the past). I can certainly teach a class in shallow mode (probably won’t be my best work, but I can do it). However, if you have a really intellectually intensive job, which requires creativity, shallowing it will not do.

In my line of work, research, shallowing is not enough. At some point, you need more and faster progress. You need deep thinking and breakthroughs.

However, and this is the important point of this model, I cannot do deep thinking on a cold cache. I can only really get there when I have wrapped my head around the details of a project/problem. This is best achieved as a side-effect from working on shallow tasks or from failed attempts at breakthroughs. It takes some time and it does not lend itself to being partitioned into discrete tasks spread through a long period of time (a few hours every week).

When I switch projects to something I have not worked at for a while [4], it sometimes takes me a full week or more just to get the details back in my head. Even coming back from the week-end, it takes a few hours to get back to where I was on Friday [5]. I sometimes think I’ve got it and then make silly mistake because I forgot that in this particular project, some aspect was done slightly differently from usual so I waste a full day on something stupid; I spend more time looking up basic information, I make changes to code which need to be reverted because I forgot why the code was doing what it was doing or I write some text which I delete without even sharing because it had forgotten an important aspect of the problem. (For the programmers in the audience, think about switching to a programming language you know well but have not used for a few months. You are now calling the size method instead of length to get the number of elements of a vector, looking up library functions you used to know by heart, your fingers will no longer automatically type build system commands, &c)

Only when I finally have the project in my head, does the typical widget model fully apply to me: breakthroughs are now easy and I am very productive for the first few hours of investment. I can manipulate the concepts in my head and translate them to actual analyses, I remember pitfalls automatically, do not fall through them, and things are good. I can try new things easily without breaking up everything else (of course, they are not all successful attempts, but I am iterating fast).

However, I cannot get to this phase without a preparatory phase. I often have my best ideas on the bus. I have been struggling with something for the whole afternoon, and on the bus on my way home, I finally see the solution. However, if I just rode the bus around town all day, I would not be very productive. Loading the project into memory is a vital phase of the process. Only then can I make the insightful leaps.

Later comes fatigue and breakdown when mistakes accumulate and I can’t spell anymore [6].

§

In this model, although we still have diminishing returns at the right end of the curve; we have increasing returns at the left end. Working half the time produces less than half of the output, working a quarter of the time produces almost nothing.

§

In this model, Average is Over. The 40 hour week is over, there will be those who work 20 to 30 hours (those who are on the top curve) and those who work 50 to 60 (those who are on the bottom curve).

In this model, it makes sense for widget makers to work fewer hours as society gets rich (they are cashing in on society’s wealth in the form of leisure), while the elites work more hours for much more money. You cannot be a part-time C-level executive, part-time quant trader, part time cutting-edge-scientist at a big institution. You can, however, be a part time barista or HR officer.

In this model, for certain careers, it is hard to cash on society’s wealth by working fewer hours, except if you take advantage of a loophole: you take long breaks or vacations. Not some puny two- or three-weeks in the Summer every year or something 20th-century like that (which require another week or so of catching up time when you get back). Sure you might do a week in in Florida (Lanzarote, for Europeans) when the fancy strikes and visit the in-laws for the holidays, but I meant that you take some real time off, like a few months to go live in Asia (or volunteer in Africa). You take a year off to walk from Alaska to Peru. Then you go back and work 60-hour weeks at a hedge fund again, until you take your next six months off. Very often, you take these breaks in between jobs.

I think that this back-n-forth between apparent-workaholism and long breaks is both more rational in this model and better describes the life-styles of the modern elites.

The poor may work in the morning, fish in the afternoon, and criticize in the evening [7]; the rich will work one year, fish the next one, and criticize (go into politics) a decade later.

Updated: I added the sentence about breaks being in between jobs, which is what I observe to emphasize the point that these are not traditional vacations.

 [1] It may apply to certain types of gentleman scholar work such as a writer who writes his best work before lunch and takes the afternoons off.
 [2] I purposefully left out values out of these plots. Some have claimed that 40 hours is the peak output (on average). Perhaps that is true, but it feels a bit Panglossian to me (it would also mean that the historical fights for the 40 hour work-week were based on a mistake on the part of the employers fighting to get their employees to work longer hours: they’d get more output while paying them less by switching to 40 hours, but the unions had to fight them for it). On the other hand, I know my peak is way beyond 40 hours, so I might just be generalizing from N=1.
 [3] In a badly run organization, this can take much longer.
 [4] The common English idiom is working on, but many times research feels more like working at problems than working on them.
 [5] Which is why context switches can be so painful. Not interruptions per se, but context switches.
 [6] Actually, I can’t spell at all in any language at any time of day; but you get the point.
 [7] I mean material-poor relative to the very rich. This can apply to people with very rich lives who are part of the global 1% of income (you need 34k/year to be in the global 1%).
 [7] I mean material-poor relative to the very rich. This can apply to people with very rich lives who are part of the global 1% of income (you need 34k/year to be in the global 1%).

# Quote of the Day

Quote of the Day (long-form edition):

In De Motu, Galileo reported that the lighter body falls faster at the beginning, then the heavier body catches up and arrives at the ground slightly before the lighter one. Since this should not be true of the objects that Galileo used, a wooden sphere and an iron one, if they are released simultaneously, it has been inferred that Galileo was either a poor observer or making up his data. But in replications of Galileo’s procedure, it has been found that when a light wooden sphere and a heavy iron one are dropped by hand, the lighter wooden sphere does start out its journey a bit ahead—a natural, if misleading, consequence of the need to clutch the heavier iron ball more firmly than the wooden one. This causes the iron ball to be released slightly after the wooden ball even though the experimenter has the impression that he is opening his hands at the same time. Then, because of the differential effects of air resistance on objects of different weight, the iron ball catches up with and passes the wooden ball, just as Galileo reported. There is a satisfying irony in this finding. The modern critics of Galileo were making the same mistake that the ancients made, criticizing results on the basis of what “must be true” rather than going out and doing the work to find out what is true.

From Charles Murray, Human Accomplishment

The stone age did not end for lack of stones, the fossil fuel age will not end for lack of fossil fuels. It will end when there is a better technology.

Nuclear fusion (or something that can consistently deliver massive amounts of power without CO2 emissions [and without the huge costs of traditional nuclear fission]) is really the only thing that can stop global warming. More research is needed.

I don’t fully understand why you’d want to verify the output of the programme and not just the code/reasoning behind it.

This seems great: a computational system that checks the scientific literature.

3. From the Oxford University Press, ladies and gentlemen, we have this insight into modern physics:

I am struck by the fact that treatises on particle physics never say what shape the particles have, and whether different kinds of particles might have different shapes. In diagrams they are usually depicted as spherical, but such a determination never plays a role in the theories of particles — unlike questions of charge and mass. Would it matter if an electron had a star shape?

Basic Structures of Reality by Colin McGinn

Sokal would be proud to publish there. Do read the full review in Mind [1], which summarises the book thus:

McGinn’s thesis, to repeat, is that our technical competence with physics ‘conceals vast chasms of ignorance’.

On the other hand, McGinn’s technical incompetence conceals nothing.

(h/t Anna Zielinska)

 [1] Before you lump all the ignorant philosophers into one bag, note that this review is from a philosophy journal.