I’ve complained before about the reliability and quality of the open source statistics package, R. Sometimes I get pushback, with people suggesting that I just don’t understand what R is trying to do, or that there is an obvious way to do things differently that answers my complaints – that R is idiosyncratic but generally trustworthy.

Well, try this exercise, which I stumbled on today while trying to teach basic programming in R:

  • Run a logistic regression model with any reasonable data set, assign the output to an object (let’s call it logit1)
  • Extract the Akaike Information Criterion (AIC) from this object, using the command logit1$aic. What is the value of the AIC?
  • Now extract basic information from the logistic model by typing its name (logit1). What is the value of the AIC?
  • Now extract more detailed information from the logistic model by typing summary(logit1). What is the value of the AIC?

When I did this today my AIC value was 54720.95. From the summary function it was 54721; from the basic output option it was 54720.

That’s right, depending on how you extract the information, R rounds the value of the AIC up, or truncates it. R truncates a numerical value without telling you.

Do you trust this package to conduct a maximum likelihood estimation procedure, when its developers not only can’t adhere to standard practice in rounding, but can’t even be internally consistent in their errors? And how can you convince someone who needs reliability in their numerical algorithms that they should use R, when R can’t even round numbers consistently?

I should point out that a decision to truncate a number is not a trivial decision. That isn’t something that happens because you didn’t change the default. Someone actually consciously programmed the basic output display method in R to truncate rather than round off. At some point they faced a decision between floor() and round() for a basic, essential part of the I/O for a statistics package, and they decided floor() was the better option. And no one has changed that decision ever since. I don’t think it’s a precision error either (the default precision of the summary function is 4 digits!) because the example I stumbled across today ended with the decimal digits .95. This was a conscious programming decision that no one bothered to fix.

The more I work with R, the more I believe that it is only good for automation, and all its output needs to be checked in a system with actual quality control. And that you should never, ever use it for any process anyone else is going to rely upon.

I stumbled on a problem with out-of-sample prediction in R, which I think is an example of idiosyncratic programming (and possibly insoluble) and, in the spirit of Andrew Gelman’s stats blog, I thought I’d put it on the internet to see if a) I’m just being stupid and b) it really is as much of a barrier to automation of statistical functions as I think it is. Perhaps someone with more clout than me will find this blog and notice this problem – or perhaps it’s not a problem and I’m just being stupid (or unfairly demanding of my prediction modeling software).

In case you haven’t worked this out already, if you aren’t interested in stats, R, or nerdy shit that wastes everyone’s time, you should stop reading right now. Everything beneath the dotted line is going to kill your love life, and probably your pets.

First, let’s consider the basic out-of-sample prediction process. We run a very simple linear model and then we provide R with a set of new data – containing exactly the same variable and type of variable – and get it to work out what the expected value of the line is on the new predictor variables. Taking the example straight from the manual, here is the code for a trivial predictive model:

x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)

This is very simple code: we produce a set of x values and make the y values equal to these x values plus a random amount[1]. Then we run the model and get the predicted values (just for shits and giggles, btw), and then we make a new data set that contains values ranging from -3 to 3 in steps of 0.5, and run the predictions on these new x values. In this code, the data frame called new is the out-of-sample data, and we get the expected value of y given the observed values in new, for the given model. This is stats 101, right?

In my case, the x values are

1.5313262  1.5307600 -0.5067223 -0.1366010 -0.6557527  1.4245267 -0.3917136  1.7941995  2.0511560  0.3602334 -0.8798598 -0.5755816 -1.6419118 -0.7885237  1.1478910

and the coefficients of the resulting linear model are

[0.367,1.047]

That is my model has an intercept of 0.367 and a slope of 1.047. This means that, e.g. for a new x value of -3 I would expect y to be 0.367+-3*1.047, or about -2.774. The correct predicted values for the new data set, based on these coefficients, produced by the 5th line of my code, are:

-2.7751002 -2.2514963 -1.7278925 -1.2042886 -0.6806847 -0.1570809  0.3665230  0.8901269  1.4137307  1.9373346  2.4609385  2.9845424  3.5081462

Now, let’s produce our first example of a problem. Consider the following very simple code:

yy<-cbind(y,x)
pred.m1<-predict(lm(yy[,1]~yy[,2]))
pred.mnew<-predict(lm(yy[,1]~yy[,2]),new,se.fit=TRUE)

Here, I’ve bound my previous x and y values together to make a matrix, with y in column 1 and x in column 2. Then I’ve rerun the model using the perfectly natural R method, of referring to each variable in the data set by its column reference ([,1] for the y, [,2] for the x). Then I’ve rerun the out-of-sample prediction on the same set of x values as I used before (the vector new). This is the simplest possible non-trivial linear model, and out-of-sample prediction in this case should be as simple as in the case of the previous model. So what happens? First, the following “warning”:

Warning message:
‘newdata’ had 13 rows but variable(s) found have 15 rows

Note this is a warning. So what values do we get? Here they are:

1.97013967  1.96954674 -0.16412052  0.22347339 -0.32018633  1.85829845 -0.04368247  2.24542264  2.51450943  0.74376222 -0.55487302 -0.23623050 -1.35289974 -0.45922514  1.56860335

These values are different. In fact, they’re the in-sample predictions. You can check it – they’re the result of applying the coefficients to the original observations of the predictor variable. This means that two important things happened:

  1. R failed to produce a solution to the simplest possible linear modeling problem, simply because I changed the input method for my regression model – even though the revised input methods (for the formula and the data) are entirely consistent with the R language
  2. The “warning” was not a warning: R didn’t just warn me about a problem, it failed to apply the command I told it to, and didn’t tell me about its decision. I could publish this shit, and not realize that I’ve been given my original values back.

Note how thoroughly horrible this would be if I were running some kind of automated series of regression processes – I once did a capture/recapture model that was heavily dependent on automation in R, and I’m wondering if I fell for this crap then. R should not be declaring a “warning” when it is actually refusing to run the command I asked it to and producing completely different output. I know I’m only a statistician, but imagine if this kind of programming were allowed in Jumbo Jets. Unhappy punters we would all be, young Jedi.

Also let’s just pause and dwell on the meaningless nature of that warning. I’ve supplied a vector of 13 observations, and the vector of 1s (for the constant) is implicit in everything I’ve done to date. So this means I’m (reasonably) assuming R will construct a 13×2 matrix with a column of 1s and a column of (out-of-sample) x values. Then, we have a 2×1 vector of coefficients. Thus, R should be able to calculate a vector of outputs from the information I’ve supplied it. The only possible explanation is that it is expecting me to supply it with a full-rank matrix, i.e. it has decided arbitrarily to stop implicitly assuming the constant. So, shall we try that?

new.fr<-cbind(rep(1,13),new)
pred.frnew<-predict(lm(yy[,1]~yy[,2]),new.fr,se.fit=TRUE)

Same prediction process, only now the out-of-sample data that I provide is full-rank, containing a column of 1s and a column of (out-of-sample) x values[2].

The result? The same problem: the same warning in red, and in-sample expected values. So the problem is not that I’m providing a less-than-full-rank matrix of predictors on the assumption that one column of the design matrix is implicit. The problem is deeper than that.

I can, however, fix the problem in a weird way. If I assign column names to the original data, and also assign the new data a name that matches the name of the predictor in the original data, I get the correct output:

yy2<-data.frame(yy)
names(yy2)<-c(“Outcome”,”Predictor”)
new2<-new
names(new2)<-c(“Predictor”)
pred.mnamed<-predict(lm(Outcome~Predictor,data=yy2),new2,se.fit=TRUE)

In this case I’ve made a new data frame for the original data and the out-of-sample data, to avoid confusion, and I’ve given them all consistent names. Note now that in the prediction process the out-of-sample data, new2, is not full rank – the column of 1s is assumed in the calculation process. But it doesn’t cause any trouble: the previous warning about seeing only 13 observations doesn’t appear, and I get the correct out-of-sample predictions. I still only have 13 observations, but R doesn’t care. I get correct predictions and everything is fine.

This might seem like a trivial problem until you’re doing automation. I recently wrote a program to do a search through a large number of univariate models, and then combine the variables from the top 10 in a predictive model. The predictions were the only thing I wanted. But I spent hours trying to work out why R wouldn’t give me out-of-sample predictions: my training sample had about 50k records, and my validation sample had about 20k. I was using numeric column references for the purposes of automation (the process should be fairly obvious: you make a matrix of variable numbers with AIC values from the corresponding linear models, sort by AIC values, choose the variable numbers corresponding to the best 10, and then put them into a formula in a linear model) but it wasn’t working. Eventually I worked out this problem, made a vector of column names and used the paste() function to produce a formula.

But this work around is stupid. In both cases – using column numbers or referring to names – the design matrix is implicit, not specified by the user. The underlying R code has to construct this design matrix. So why does it screw me around like this? Why provide two ways of referencing columns and then have one of them bork your most basic modeling function? What would Tibshirani say? Dividing data into training and validation sets is an essential part of modern machine learning principles, but R can’t handle it because – omfg – one has a different number of records to the other. Even though the matrix mathematics is in no way affected by the difference in the number of records.

That, dear reader, is called crap program design. Especially crap is the issuing of a warning that isn’t a warning at all, but an indication that the program has arbitrarily decided to produce a result different to the one you asked for! Rather than declaring a “warning,” R should say clearly “Catastrophic failure! Providing default prediction using original design matrix only!” Or better still it could say “What are you doing, Dave?” and refuse to let me out of the cargo bay doors.

The manual says the following about giving out-of-sample data to the program:

If the fit is rank-deficient, some of the columns of the design matrix will have been dropped. Prediction from such a fit only makes sense if newdata is contained in the same subspace as the original data. That cannot be checked accurately, so a warning is issued.

and

A warning will be given if the variables found are not of the same length as those in newdata if it was supplied.

This is not exactly the most informative help in the world. I don’t think that what is described in the first sentences (to the extent that I understand the English) is what happens in practice, and the last sentence is certainly misleading. In R, warnings usually indicate something you should be aware of that does not affect the numerical processing of results (see e.g. Andrew Gelman’s statement in the above blog link, “I don’t care about warning messages.” Except you should, because sometimes R issues a message called a “warning” that is actually an indicator of insouciant and catastrophic failure to enact the expected mathematical process.

Open source software: it sucks. Except that R is far and away the best software for automation in stats – until you run into crap like this. Fix it, CRAN!

fn1: of course the R manual makes this part just slightly more obscure than it needs to be, because hey! R is a really pithy language, so why would you use comments to make the process easier to grasp, or separate the process of generating x from the process of generating epsilon? Anyone who knows what they’re doing can figure it out, and manuals are for people who already know what they’re doing – right?

fn2: I guess I could do this more accurately by supplying a variable containing only 1s, and specifying “no constant.” That might work, I suppose. How horrid.

Here is the full code to reproduce all these results:

## Predictions
x <- rnorm(15)
y <- x + rnorm(15)
pred.v1<-predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
pred.vnew<-predict(lm(y ~ x), new, se.fit = TRUE)

## now make a data set yy that contains a column for x and a column for y

yy<-cbind(y,x)

pred.m1<-predict(lm(yy[,1]~yy[,2]))

pred.mnew<-predict(lm(yy[,1]~yy[,2]),new,se.fit=TRUE)

## now give the data set names and retry it
yy2<-data.frame(yy)
names(yy2)<-c(“Outcome”,”Predictor”)
new2<-new
names(new2)<-c(“Predictor”)

pred.mnamed<-predict(lm(Outcome~Predictor,data=yy2),new2,se.fit=TRUE)

# now try a full rank matrix
new.fr<-cbind(rep(1,13),new)
pred.frnew<-predict(lm(yy[,1]~yy[,2]),new.fr,se.fit=TRUE)

x.mat<-as.numeric(lm(yy[,1]~yy[,2])$coefficients)

# try again with matrix maths
new.fr<-as.matrix(cbind(rep(1,13),new))
pred.frnew2<-new.fr %*% x.mat