I’ve complained before about the reliability and quality of the open source statistics package, R. Sometimes I get pushback, with people suggesting that I just don’t understand what R is trying to do, or that there is an obvious way to do things differently that answers my complaints – that R is idiosyncratic but generally trustworthy.
Well, try this exercise, which I stumbled on today while trying to teach basic programming in R:
- Run a logistic regression model with any reasonable data set, assign the output to an object (let’s call it logit1)
- Extract the Akaike Information Criterion (AIC) from this object, using the command logit1$aic. What is the value of the AIC?
- Now extract basic information from the logistic model by typing its name (logit1). What is the value of the AIC?
- Now extract more detailed information from the logistic model by typing summary(logit1). What is the value of the AIC?
When I did this today my AIC value was 54720.95. From the summary function it was 54721; from the basic output option it was 54720.
That’s right, depending on how you extract the information, R rounds the value of the AIC up, or truncates it. R truncates a numerical value without telling you.
Do you trust this package to conduct a maximum likelihood estimation procedure, when its developers not only can’t adhere to standard practice in rounding, but can’t even be internally consistent in their errors? And how can you convince someone who needs reliability in their numerical algorithms that they should use R, when R can’t even round numbers consistently?
I should point out that a decision to truncate a number is not a trivial decision. That isn’t something that happens because you didn’t change the default. Someone actually consciously programmed the basic output display method in R to truncate rather than round off. At some point they faced a decision between floor() and round() for a basic, essential part of the I/O for a statistics package, and they decided floor() was the better option. And no one has changed that decision ever since. I don’t think it’s a precision error either (the default precision of the summary function is 4 digits!) because the example I stumbled across today ended with the decimal digits .95. This was a conscious programming decision that no one bothered to fix.
The more I work with R, the more I believe that it is only good for automation, and all its output needs to be checked in a system with actual quality control. And that you should never, ever use it for any process anyone else is going to rely upon.
December 2, 2014 at 7:18 am
Monsters I say!
December 3, 2014 at 8:01 am
You are complaining about the behavior of the print() function in R which is what is called anytime you enter only a variable name in R. summary(logit1) simply PRINTS a pretty formatted value to the screen. That value is NOT what is stored under the hood as summary(logit1)$aic demonstrates. Again, when you simply type logit1 into R, the value you see in the console comes from a generic version of print() for that model type. The rounding is done simply for pretty-printing and NOT to the underlying data! So your complaints are really not valid because absolutely no truncation of actual data is done, and the fully precise AIC value is untouched.
December 3, 2014 at 8:45 am
If the project is open source, open a pull request!
December 3, 2014 at 12:11 pm
Paul, there is a special place in hell for people who round down without good reason.
Charlie, thanks for commenting. Yes, I am comparing the print() and summary functions in R. Note that when I type logit1$aic I am also invoking the print function, but it delivers completely different precision to the print(logit1) command, or the summary(logit1) command (which is an implicit print() function?) This seems to me to be another example of the inconsistency I am describing. Why should print(logit1) and print(logit1$aic) have different precision, and furthermore why should any print() function in a numerical analysis package have precision of just 4 figures? (which is what I assume the print(logit1) function is doing). There is no justification for curtailing numerical values at this degree of roughness in a modern computing environment, especially a numerical analysis one, and especially with a value like the AIC which we know is in most cases going to have large numbers of figures, but when used to compare models will require high levels of precision.
This is especially true since the print(logit1) function I apply here is essentially the basic, first instance of output that R gives you. It’s the output that most people using R for basic statistical analysis are most likely to read. Having inconsistent and highly imprecise rounding on this screen is obviously dangerous.
My point though is not that this is particularly dangerous (it is obvious that the value is rounded, so anyone with any sense will not report it regardless of the specific rounding method use). My point is that these basic I/O procedures have completely arbitrary degrees of precision for reporting the same value, they don’t show any warnings and they don’t have any coherence. It cannot be the case that the print() function contains information on the specific degree of precision expected for every single object it prints, so it must be that the information is contained in the parameters of that object. How do you change or even understand those values? The R help suggests that there is a single global setting for this (accessed through the getOptions() function) but this is clearly not true. The summary() function for glm also uses this option (with the max(3,getOption(digits)-3))[1] but the print option is using different values depending on the object it prints (contrast your print(logit1) with print(logit1$aic)). It may be possible to unravel this stuff, but why should anyone have to, and if we have to unravel something as simple as this because it doesn’t behave as the help tells us, why on god’s green earth would we trust the program’s maximum likelihood estimation process?
And, as I said above, in the modern computer age there is absolutely no reason to restrict the precision in this way. The idea that we should distinguish in a numerical analysis package between “pretty-printing” and printing is kind of weird. This isn’t excel!
Also note that the print() function is often used as part of debugging, and returning imprecise and idiosyncratically-rounded values is not a very productive property of this function when used in this way.
—
fn1: what on earth is a numerical analysis package doing setting the default minimum precision of its summary() function at 3? That is just ridiculous. Sure, if we were using a ZX spectrum that would make sense, but most modern analysts work on computers with wide screens, multiple chips and multiple handfuls of RAM. Why on earth would we want a precision of 3 figures???
December 3, 2014 at 1:49 pm
While your post is quite interesting, it contains a number of inaccuracies:
-“they faced a decision between floor() and round()… and they decided floor() was the better option.” This is not true; rather, it was rounding to 4 digits of significance (which turns 54720.95 into 54720).
-“Do you trust this package to conduct a maximum likelihood estimation procedure…” what maximum likelihood procedure could possibly be affected by how the output prints? logit1$aic and summary(logit1)$aic both return precisely the same value (accurate to however many digits you need), so any ML optimization would be on an unrounded value.
-“It cannot be the case that the print() function contains information on the specific degree of precision expected for every single object it prints, so it must be that the information is contained in the parameters of that object. How do you change or even understand those values?” The information isn’t stored in the objects, it’s stored in default arguments to the print.glm and print.summary.glm methods. Both of those functions take a digits argument; so it can be controlled with (e.g.) print(logit1, digits = 7) or print(summary(logit1), digits = 7). Indeed, this is exactly what the help (“?print.summary.glm”) tells us.
While your comment corrects some of the problems, it seems to me that the original post is misleading, and spreads the kind of misinformation that encourages voodoo programming (“Oh, don’t use the AIC in an optimization procedure, I once heard that R truncates that value.” “Oh, my algorithm’s not reaching an optimum, but it’s probably just because a value got truncated along the way.”) Do you plan to correct the errors for the benefit of future readers?
I have plenty of complaints about inconsistencies in R’s modeling and summary outputs (why name a column “Pr(>|z|)”??), some of which led me to create the broom package to convert model outputs to a consistent data frame. But how model summaries print isn’t high among my complaints. To almost anyone glancing at the model output, the fifth significant digit of the AIC isn’t going to matter- and to those whom it does, changing the digits option to print solves the problem.
Finally, I don’t see how this could serve as a criticism of open source software. Thanks to the fact that the code for print.glm and print.summary.glm is open source and available (simply by typing their names into an R console), I was able to discover the reason for the inconsistency in seconds. If this had been closed source software, I would have had to file a bug report (assuming I were paying an annual fee to earn such a privilege). And it’s certainly not as though closed source software is immune from bugs and rounding errors. To quote Douglas Adams, “the major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.”
December 3, 2014 at 3:45 pm
Thanks for your comment safe4democracy. I realize that the issue was precision vs. floor(), though these can have the same effect in the instance I described. My point about trust is not meant to imply that the output of the print procedure will affect maximum likelihood estimation; rather, it is that these kinds of arbitrary and counter-intuitive, hidden programming decisions in something as simple as print() make it very hard for me to believe that the mle procedures (for example) are free of similar arbitrary conditions. And when I say “hidden” here I don’t mean not accessible (great Douglas Adams quote btw and I fully agree with you about this benefit of open source software); rather by “hidden” I mean that strong and arbitrary decisions are not communicated up front. As I say in my comment above, it’s very obvious that the values are rounded, so no one is going to use them as they produced. But idiosyncratic variation in precision that are unannounced up front are a sign of poor communication and quality control, and trying to figure them out is a waste of mine and everyone else’s time. I cannot see any possible reason for making the output of the AIC from print(logit1) different in precision to the output from print(logit1$aic) or summary(logit1), and more relevantly I fail to see any reason why I should assume such inconsistencies (and thus go looking for exact details about which one is right).
R does have examples of more serious programming problems that are not declared and extremely frustrating – Andrew Gelman often mentions them. The most egregious is the default number of integration points in lme4, and the restrictions it places on models of more than 2 levels; there is something wrong with the optimization procedure in glm that causes it to fail to converge in some trivial examples (Gelman also stumbled on this one). For a long time its reporting of p values in ARIMA models was apparently wrong, and of course in some other model outputs it doesn’t report p values at all, which is just frustrating and silly. But my point in this post is not about the more serious programming problems that one can encounter in R; rather it is about this capricious and idiosyncratic style. I think some of the comments here contain examples of what I’m complaining about: you and Charlie are defending R from the reasonable vantage point that experienced programmers and users should be able to read these tiny details in the help. But my perspective is from that of a long time working with people who are not experienced programmers and users, and I think all statistics packages should have some degree of awareness of the ways in which they are actually used by the majority of people who need to do stats, rather than the ways they should be used by an elite of experienced programmers. The kind of arbitrary precision decisions made here (in a world of 64-bit computing and 21″ monitors!) are good examples of this. Why not just report everything to the same precision, as other stats packages do, and if you’re going to constrain precision why do it to less figures than a standard AIC will return?? Anyone who is shifting to R from another package, or an inexperienced programmer attempting to learn how to use R for some complex problem (and in my experience a lot of people do this!) will fall afoul of these kinds of idiosyncracies. Saying “oh but your mistake is perfectly clear if you dig out the getOptions command and properly understand the specific way that R handles precision differently every time it prints something” is not going to broaden the language’s appeal.
So yes I agree with you that R makes checking problems very easy and clear through its open source nature. But these quality control issues make it very dangerous for people attempting to use it without a very solid background in numerical analysis – and I am willing to bet the majority of people who use R are, in fact, not highly trained numerical analysts.
December 4, 2014 at 12:06 am
This is such a non-issue.
I would never use the print() outputs for anything more than visual inspection. Actual use of the data in further analysis would always use the full precision (or whatever default setting you have in options()) stored in the object.
The inconsistency is worth noting perhaps, but if it really bothers you submit a bug request to the R team.
There are problems in R. Idiosyncracies of different packages and functions is one them, you just need to get used to it.
December 4, 2014 at 12:16 am
Idiosyncratic programming without guides, warnings or assistance is not a non-issue. Arbitrary decisions in basic functionality are not a non-issue. Do you think that SAS, Stata and SPSS would have achieved the popularity and success they have if they had arbitrary rounding and incomprehensible help functions, with known issues in key packages that they refuse to resolve, an online help community who are rude and abrasive and an assumption that everyone who uses the product are advanced users who know all the minutia of the system and only have themselves to blame if they miss the idiosyncracies in the basic functions of the system?
Or to put it another way: when you interact with another mathematician, do you take their judgment about serious mathematical processes seriously if they can’t even be consistent in their presentation of basic numbers? There’s already a comment here suggesting my opinion is flawed because I appear to have confused precision and rounding. Shouldn’t the same critical view be applied to the makers of R? If they can’t be consistent in their handling of basic details of presentation, and they seriously think that 7 figures is sufficient precision for serious numerical analysis, should their efforts at more important mathematical procedures be taken seriously? Is it any wonder that the glm function in R fails where it operates smoothly in Stata? Small flaws like this are evidence of a broader problem of quality control, so no, they are not a non-issue. Try getting published in a major journal if you can’t consistently round your estimates to a reasonable precision – it’s not a non-issue then!