Evaluating Regression with Cross-Validation

I have been doing a bit of regression and have a few thoughts.

Most literature on regression comes from statisticians, not machine learning people. Perhaps for this reason, there is less emphasis on training/testing separation or cross-validation.

In particular, I have been thinking about how to get a general purpose measure of “how well can we predict this output.” Mean squared error \frac{1}{N} \sum (\hat{y}_i - y_i)^2,  where i is the cross-validated predictions for input i ) has some nice properties. However, it is meaningless as a number and it would be nice to normalize it.

What seems the most meaningful normalization is to use a null model which consists of outputting the mean of the training data. To make higher numbers be better, I first flip it around:

1 - \frac{ \sum (\hat{y}_i - y_i)^2 }{\sum (y_i - \bar{y})^2}

This is

1 - \frac{\text{Model error}}{\text{Null model error}}

The result can even be negative if the prediction is harmful. So, I actually want to use

N(\hat{y}, y) = \max \{ 1 - \frac{ \sum (\hat{y}_i - y_i)^2 }{\sum (y_i - \bar{y})^2}, 0 \}

This value is 0 for a meaningless prediction, 1 for a perfect one.

I am calling it N for normalized error reduction. In fact, I’ve tried looking and asking around for a literature name. So far, I have not found one.


In the case of a positive correlation with no bias, this reduces to the R-squared between  and y , also known to as the explained variance.

I like to look at the results in plots like this one


On the x-axis, I have the underlying output, and on the y-axis, the cross-validated (or out-of-bag prediction) for each sample. I also plot the diagonal. In this case, the prediction is very good and there is only a little noise pushing points away from the diagonal.


However, it does not reduce the R-squared in a few interesting cases:

  1. The model does not predict at all.


Let’s say that your output cannot be explained by the input at all. To simplify things, let’s assume you don’t even have an input, just a set of outputs, {y1..yN} , which you predict as the mean in the training set.

If you use leave-one-out-cross-validation (LOOCV), then this null-prediction has perfect (negative) correlation with the input (see also here and here! Its R-squared is 1!

If you are using LOOCV, then you’ll probably see this and catch it, but it might slip by if your using 10 folds and you don’t have a lot of data and accidently report it as low-but-significant (with 100 datapoints, and uniformly sampled yi , the R-squared is different from zero, with p-value < 0.05, 90% of the time! 25% of the times, the p-value is below 0.0001)

  1. Reversion to the mean models.


This is an instance of where your model was oversmoothed.

In this instance, your model predicts in the right direction, but it underpredicts. However, if you just report R-squared of the prediction, you’ll lead your readers to think you can predict very well.

It’s not always the case that you could just have gone back and multiplied all your coefficients by a number larger than 1. It may be that to get the larger coefficients would imply that you would get noise in the output. This would not happen in traditional regression problems, but in p > n settings, where penalized regression is necessary, it can and does happen (maybe it is an indication to try relaxed lasso).

  1. Biased models


If your models are biased, this naturally introduces a penalty. R-squared is invariant to addition (R2( + B, y) =  R2(, y) ), but N is not. In fact, if the correlation is positive,

R2(\hat{y}, y) = \max_B N(\hat{y} + B, y).

I most often see this when comparing human annotations. People have very similar trends, but one of the operators will consistenly report a lower value.

In some cases this may not matter and in others it will (if Matt always needs to give a bigger discount than John to get a sale, this does not mean that Matt and John are equivalent salesman).


In all of these, the R-squared is excellent, but I think it is overselling how good the prediction is, whilst N is more honest.

3 thoughts on “Evaluating Regression with Cross-Validation

  1. Luis, if I’m not missing any subtle point in your post, the MSE estimate based on the LOOCV (or other CV) is called PRESS (predictive error sum of squares) and what you call N (without the thresholding at zero) is usually called cross-validated R2 (or sometimes Q2). Take a look at this paper http://www.jstor.org/stable/1391469. There is a nice recent discussion on this theme in the following paper: http://onlinelibrary.wiley.com/doi/10.1002/cem.1290/abstract (I can email you the PDF).

    1. Thanks!

      I cannot get the paper [aarghh!], but it seems this is Q2. I think I missed this because of the squared in the name, while this is not a square (it can even be negative).

  2. Pingback: Q2! | Meta Rabbit

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.