Compromise and Conceit

Infernal adventuring…

about

An Australian statistician and role-player living in Japan, interested in the theory and practice of role-playing…

Problems with Out of Sample Prediction using R

February 16, 2012

I stumbled on a problem with out-of-sample prediction in R, which I think is an example of idiosyncratic programming (and possibly insoluble) and, in the spirit of Andrew Gelman’s stats blog, I thought I’d put it on the internet to see if a) I’m just being stupid and b) it really is as much of a barrier to automation of statistical functions as I think it is. Perhaps someone with more clout than me will find this blog and notice this problem – or perhaps it’s not a problem and I’m just being stupid (or unfairly demanding of my prediction modeling software).

In case you haven’t worked this out already, if you aren’t interested in stats, R, or nerdy shit that wastes everyone’s time, you should stop reading right now. Everything beneath the dotted line is going to kill your love life, and probably your pets.

—

First, let’s consider the basic out-of-sample prediction process. We run a very simple linear model and then we provide R with a set of new data – containing exactly the same variable and type of variable – and get it to work out what the expected value of the line is on the new predictor variables. Taking the example straight from the manual, here is the code for a trivial predictive model:

x <- rnorm(15)
y <- x + rnorm(15)
predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
predict(lm(y ~ x), new, se.fit = TRUE)

This is very simple code: we produce a set of x values and make the y values equal to these x values plus a random amount[1]. Then we run the model and get the predicted values (just for shits and giggles, btw), and then we make a new data set that contains values ranging from -3 to 3 in steps of 0.5, and run the predictions on these new x values. In this code, the data frame called new is the out-of-sample data, and we get the expected value of y given the observed values in new, for the given model. This is stats 101, right?

In my case, the x values are

1.5313262 1.5307600 -0.5067223 -0.1366010 -0.6557527 1.4245267 -0.3917136 1.7941995 2.0511560 0.3602334 -0.8798598 -0.5755816 -1.6419118 -0.7885237 1.1478910

and the coefficients of the resulting linear model are

[0.367,1.047]

That is my model has an intercept of 0.367 and a slope of 1.047. This means that, e.g. for a new x value of -3 I would expect y to be 0.367+-3*1.047, or about -2.774. The correct predicted values for the new data set, based on these coefficients, produced by the 5th line of my code, are:

-2.7751002 -2.2514963 -1.7278925 -1.2042886 -0.6806847 -0.1570809 0.3665230 0.8901269 1.4137307 1.9373346 2.4609385 2.9845424 3.5081462

Now, let’s produce our first example of a problem. Consider the following very simple code:

yy<-cbind(y,x)
pred.m1<-predict(lm(yy[,1]~yy[,2]))
pred.mnew<-predict(lm(yy[,1]~yy[,2]),new,se.fit=TRUE)

Here, I’ve bound my previous x and y values together to make a matrix, with y in column 1 and x in column 2. Then I’ve rerun the model using the perfectly natural R method, of referring to each variable in the data set by its column reference ([,1] for the y, [,2] for the x). Then I’ve rerun the out-of-sample prediction on the same set of x values as I used before (the vector new). This is the simplest possible non-trivial linear model, and out-of-sample prediction in this case should be as simple as in the case of the previous model. So what happens? First, the following “warning”:

Warning message:
‘newdata’ had 13 rows but variable(s) found have 15 rows

Note this is a warning. So what values do we get? Here they are:

1.97013967 1.96954674 -0.16412052 0.22347339 -0.32018633 1.85829845 -0.04368247 2.24542264 2.51450943 0.74376222 -0.55487302 -0.23623050 -1.35289974 -0.45922514 1.56860335

These values are different. In fact, they’re the in-sample predictions. You can check it – they’re the result of applying the coefficients to the original observations of the predictor variable. This means that two important things happened:

R failed to produce a solution to the simplest possible linear modeling problem, simply because I changed the input method for my regression model – even though the revised input methods (for the formula and the data) are entirely consistent with the R language
The “warning” was not a warning: R didn’t just warn me about a problem, it failed to apply the command I told it to, and didn’t tell me about its decision. I could publish this shit, and not realize that I’ve been given my original values back.

Note how thoroughly horrible this would be if I were running some kind of automated series of regression processes – I once did a capture/recapture model that was heavily dependent on automation in R, and I’m wondering if I fell for this crap then. R should not be declaring a “warning” when it is actually refusing to run the command I asked it to and producing completely different output. I know I’m only a statistician, but imagine if this kind of programming were allowed in Jumbo Jets. Unhappy punters we would all be, young Jedi.

Also let’s just pause and dwell on the meaningless nature of that warning. I’ve supplied a vector of 13 observations, and the vector of 1s (for the constant) is implicit in everything I’ve done to date. So this means I’m (reasonably) assuming R will construct a 13×2 matrix with a column of 1s and a column of (out-of-sample) x values. Then, we have a 2×1 vector of coefficients. Thus, R should be able to calculate a vector of outputs from the information I’ve supplied it. The only possible explanation is that it is expecting me to supply it with a full-rank matrix, i.e. it has decided arbitrarily to stop implicitly assuming the constant. So, shall we try that?

new.fr<-cbind(rep(1,13),new)
pred.frnew<-predict(lm(yy[,1]~yy[,2]),new.fr,se.fit=TRUE)

Same prediction process, only now the out-of-sample data that I provide is full-rank, containing a column of 1s and a column of (out-of-sample) x values[2].

The result? The same problem: the same warning in red, and in-sample expected values. So the problem is not that I’m providing a less-than-full-rank matrix of predictors on the assumption that one column of the design matrix is implicit. The problem is deeper than that.

I can, however, fix the problem in a weird way. If I assign column names to the original data, and also assign the new data a name that matches the name of the predictor in the original data, I get the correct output:

yy2<-data.frame(yy)
names(yy2)<-c(“Outcome”,”Predictor”)
new2<-new
names(new2)<-c(“Predictor”)
pred.mnamed<-predict(lm(Outcome~Predictor,data=yy2),new2,se.fit=TRUE)

In this case I’ve made a new data frame for the original data and the out-of-sample data, to avoid confusion, and I’ve given them all consistent names. Note now that in the prediction process the out-of-sample data, new2, is not full rank – the column of 1s is assumed in the calculation process. But it doesn’t cause any trouble: the previous warning about seeing only 13 observations doesn’t appear, and I get the correct out-of-sample predictions. I still only have 13 observations, but R doesn’t care. I get correct predictions and everything is fine.

This might seem like a trivial problem until you’re doing automation. I recently wrote a program to do a search through a large number of univariate models, and then combine the variables from the top 10 in a predictive model. The predictions were the only thing I wanted. But I spent hours trying to work out why R wouldn’t give me out-of-sample predictions: my training sample had about 50k records, and my validation sample had about 20k. I was using numeric column references for the purposes of automation (the process should be fairly obvious: you make a matrix of variable numbers with AIC values from the corresponding linear models, sort by AIC values, choose the variable numbers corresponding to the best 10, and then put them into a formula in a linear model) but it wasn’t working. Eventually I worked out this problem, made a vector of column names and used the paste() function to produce a formula.

But this work around is stupid. In both cases – using column numbers or referring to names – the design matrix is implicit, not specified by the user. The underlying R code has to construct this design matrix. So why does it screw me around like this? Why provide two ways of referencing columns and then have one of them bork your most basic modeling function? What would Tibshirani say? Dividing data into training and validation sets is an essential part of modern machine learning principles, but R can’t handle it because – omfg – one has a different number of records to the other. Even though the matrix mathematics is in no way affected by the difference in the number of records.

That, dear reader, is called crap program design. Especially crap is the issuing of a warning that isn’t a warning at all, but an indication that the program has arbitrarily decided to produce a result different to the one you asked for! Rather than declaring a “warning,” R should say clearly “Catastrophic failure! Providing default prediction using original design matrix only!” Or better still it could say “What are you doing, Dave?” and refuse to let me out of the cargo bay doors.

The manual says the following about giving out-of-sample data to the program:

If the fit is rank-deficient, some of the columns of the design matrix will have been dropped. Prediction from such a fit only makes sense if newdata is contained in the same subspace as the original data. That cannot be checked accurately, so a warning is issued.

and

A warning will be given if the variables found are not of the same length as those in newdata if it was supplied.

This is not exactly the most informative help in the world. I don’t think that what is described in the first sentences (to the extent that I understand the English) is what happens in practice, and the last sentence is certainly misleading. In R, warnings usually indicate something you should be aware of that does not affect the numerical processing of results (see e.g. Andrew Gelman’s statement in the above blog link, “I don’t care about warning messages.” Except you should, because sometimes R issues a message called a “warning” that is actually an indicator of insouciant and catastrophic failure to enact the expected mathematical process.

Open source software: it sucks. Except that R is far and away the best software for automation in stats – until you run into crap like this. Fix it, CRAN!

—

fn1: of course the R manual makes this part just slightly more obscure than it needs to be, because hey! R is a really pithy language, so why would you use comments to make the process easier to grasp, or separate the process of generating x from the process of generating epsilon? Anyone who knows what they’re doing can figure it out, and manuals are for people who already know what they’re doing – right?

fn2: I guess I could do this more accurately by supplying a variable containing only 1s, and specifying “no constant.” That might work, I suppose. How horrid.

Here is the full code to reproduce all these results:

## Predictions
x <- rnorm(15)
y <- x + rnorm(15)
pred.v1<-predict(lm(y ~ x))
new <- data.frame(x = seq(-3, 3, 0.5))
pred.vnew<-predict(lm(y ~ x), new, se.fit = TRUE)

## now make a data set yy that contains a column for x and a column for y

yy<-cbind(y,x)

pred.m1<-predict(lm(yy[,1]~yy[,2]))

pred.mnew<-predict(lm(yy[,1]~yy[,2]),new,se.fit=TRUE)

## now give the data set names and retry it
yy2<-data.frame(yy)
names(yy2)<-c(“Outcome”,”Predictor”)
new2<-new
names(new2)<-c(“Predictor”)

pred.mnamed<-predict(lm(Outcome~Predictor,data=yy2),new2,se.fit=TRUE)

# now try a full rank matrix
new.fr<-cbind(rep(1,13),new)
pred.frnew<-predict(lm(yy[,1]~yy[,2]),new.fr,se.fit=TRUE)

x.mat<-as.numeric(lm(yy[,1]~yy[,2])$coefficients)

# try again with matrix maths
new.fr<-as.matrix(cbind(rep(1,13),new))
pred.frnew2<-new.fr %*% x.mat

Posted in Meat Life, Nerd Culture, off-topic ranting

23 responses to “Problems with Out of Sample Prediction using R”

fanguad

February 16, 2012

I don’t understand the stats part, but if your out-of-sample data contains the same number of rows as your sample data, and you use the borked method, do you get the expected results? That seems like an even more dangerous case to me: it gets the expected number of rows, so it doesn’t complain, but then goes ahead and ignores your input anyway.
faustusnotes

February 16, 2012

That would be truly disastrous, wouldn’t it? Fortunately it’s not that bad – it returns the original number of rows. Not that you would necessarily expect that of R – there are other situations where, if the data you provide to an argument isn’t sufficient for the dimensions of the matrix it is creating, it just cycles through the data again instead of inserting missing values. And in that case it gives no warning. But here, fortunately, it gives the original number of observations again, so if you are doing dimension checks you will notice quickly. I stumbled on the problem because, assuming I had the correct output (20k records for the validation data) I tried to bind it back to an identifier variable in the original data, and thus got non-conforming matrix errors.

So it at least has the saving grace of being consistently wrong on this occasion.
David

April 24, 2013

yy[,2] is a vector, not the named column of a matrix. when lm builds its model object, the resulting coefficient is for something called yy[,2]:

> lm(yy[,1]~yy[,2])

Call:
lm(formula = yy[, 1] ~ yy[, 2])

Coefficients:
(Intercept) yy[, 2]
0.1216 1.1643

As a result, predict.lm can’t associate it with variable x in data frame new.

Hope this helps.
faustusnotes

April 24, 2013

Thanks for commenting David. Don’t you think it’s a bit strange that R has this idiosyncratic distinction between vectors and matrices? To the best of my recollection matlab doesn’t do this. It seems natural that you can assume a single column of a matrix is equivalent to a vector. Am I being unreasonable in this assumption?

BTW, I think you commented at Crooked Timber to recommend RStudio? I downloaded it and it looks very shiny. Next time I use R, I will try and work with it, so thanks for the tip!
David

April 25, 2013

That was me recommending RStudio. I’m glad you like it so far. It was a relief to me when I found it.

I guess the vector/matrix distinction and type/class issues generally are just things I’ve gotten used to. The issue here is that while the columns of a matrix or data frame are themselves vectors, the name of the columns are property of the matrix, not of the vectors, and that extracting exactly one column of a matrix returns the vector and not a single column matrix. I guess R could do it differently, but there might be confusing results from that way too.

With regard to your stepwise problem, you may have been able to construct the matrix or data frame beforehand e.g. df2 <- df1[ , ], which would keep the column names, then use lm (y ~ ., df2) where . means all columns in df2 other than y. There’s also a function stepAIC in the MASS package — if you type MASS::stepAIC in RStudio and go to Code -> Go to Function Definition, you can see how they handled it. I don’t have time to look at it now unfortunately.
David

April 25, 2013

Oops, part of my reply that I put in disappeared. It should be:

df2 <- df1[ , (vector of column indices)]
faustusnotes

April 25, 2013

I would really like to be able to blame R for these complexities, David, but I have to say that part of the problem here is I just haven’t taken the time to really delve into how R handles this stuff. When I write automation in R I often have to put in exceptions for situations where the output of a process is a vector not a matrix, but the next step has to handle matrices. I think this is an automatic action on R’s part: if you extract a single column from a matrix it becomes a vector, which has a different type, but it seems to me to be unavoidable in automation that you will sometimes extract a vector from a matrix, and then have to change its type. For example if I am doing an automated model-building process (perhaps e.g. because I have a very large data set and I’m concerned with a predictive model rather than an explanatory model) then I’ll likely write code that is going to extract different numbers of columns (like stepAIC does) and it’s inevitable I’m going to sometimes have vectors and sometimes matrices. Matlab seems to handle this smoothly, and it pisses me of that R can’t. You’re right that stepAIC will provide an example of how to handle this, but because I don’t use R as the main package at my work any more, it’s unlikely that I’ll go to the trouble of finding and checking the code (especially since reading other people’s R code is like staring into the soul of Cthulhu). I think this might happen to a lot of people using R, because it’s just idiosyncratic enough and the language just dense enough that it sets barriers to continuing to work with it.

I guess these problems will never go away either: back compatibility is important in R, and you couldn’t change the way it handles vectors and matrices without potentially damaging a huge amount of historical code (including MASS, which is essential material!)
Jerzy

November 20, 2013

(1) As David said, predict.lm’s newdata argument is looking for something with the same name as the predictors you used lm. This took me a while to learn myself, and I agree the 13 vs 15 rows warning message is unhelpful for this problem. What the warning should say instead is that you fed it newdata that it just doesn’t know what to do with, so it’s going to ignore newdata and use the original data supplied to lm instead. I too am frustrated that it doesn’t.

If you do know the number of predictors you’ll be using, the simplest solution is to keep everything as dataframes before you feed it into lm or into the newdata argument of predict.lm.

(2) If you don’t know ahead of time how many arguments you will pass to lm or predict, you can use do.call(functionName, listContainingArguments). Your program can build a list of arbitrary length, and then do.call will pass it to functionName.

(3) If you want R to keep a one-column subset of a matrix as a matrix, not a vector, then you can use drop=FALSE when subsetting:

(a <- matrix(1:4,2,2))
# [,1] [,2]
# [1,] 1 3
# [2,] 2 4
a[,1]
# [1] 1 2
a[,1,drop=FALSE]
# [,1]
# [1,] 1
# [2,] 2

I suspect that R treats vectors differently from matrices because the dataframe is a list of same-length vectors, so that each column can have a different type (unlike in a matrix). So for building dataframes, it's useful to have a distinction between vectors and 1D matrices. MATLAB has nothing like dataframes AFAIK, so they have no need to distinguish vector vs matrix.
faustusnotes

November 20, 2013

Thanks for the comments Jerzy – these are very helpful. In order …
1) I understand this now but I really would have appreciated clearer information on this kind of thing, so that I didn’t have to make repeated mistakes trying to figure it out. I’m not a great programmer and I don’t use R as a first choice but I think I’m fairly competent with it and if I don’t know something this basic I think a lot of other people who use R a fair bit also wouldn’t – which makes me think there is a fair amount of published work out there that has problems based on this. Also, R is trying to make it possible to use column numbers (like a matrix package) and variable names (like a stats package) and if it’s going to do this it really would be better if the developers and the help file made clearer what the limitations are of this method, and/or made the software better able to handle both simultaneously.

2) & 3) I didn’t know about the do.call and drop=FALSE stuff, that’s extremely helpful. Currently I use a stupid little if command to check is.vector before I can proceed with my programs and I hate it. Thanks for that!

As you can see, I don’t use R because I am a programmer but because sometimes I need automated statistics. I think that R is actually really opaque to people like me, who I suspect are a majority of its actual users. People who are intimately familiar with the way the language operates and operate comfortably in it like a native package are rare, i suspect, but people who are using it to get around a challenging data manipulation problem in SAS or SPSS are very common (I use it because its automation methods are a little more accessible than those of Stata). More awareness of this user group would be helpful, I think. But if you venture onto R’s message boards, these people are treated with something very close to scorn … something that doesn’t happen as much with Stata.

One of the things I don’t like about matlab is its lack of proper dataframes, because it means you can’t treat it like a stats language. It’s useful for modelling and matrix problems but when you want to think of your data as data rather than as numbers, it is weaker. So then I like to use R… especially when i need easily accessible statistical results (matlab is often poor for this because it is not its main purpose). But I really wish that R would clean up the help and/or the implementation!
Jerzy

November 21, 2013

I agree that there’s a lot of room for improvement in the documentation and in the tone of the help mailing lists 🙂

For learning these “tricks” like drop=FALSE that aren’t obvious from the help files, here’s what I found helpful:
Teetor’s “R Cookbook” has well-explained ways to do a lot of common tasks, some of which I had been overcomplicating before.
Matloff’s “The Art of R Programming” gives just enough depth to explain *why* acts in such counterintuitive ways, without getting overly technical.
Cosma Shalizi’s course notes on Statistical Computing are also quite good.
And Stack Overflow’s [r] tag is often much more pleasant than the R help mailing lists.
Jane

March 1, 2014

Thanks so much for the details! I’m also facing with this problem now by R and got so confused.

Do you know how to include multiple regressors in “Predictor”?
Ben

August 12, 2014

This is a bit late, but a friend directed me here. Anyway, maybe I’ve just drank the R kool-aid, but I don’t see the issue here. The point of the formula in lm is to remember how you’re referring to the variables so you can substitute different variables later.

The first version worked because “predict(lm(y ~ x), data.frame(x=seq(…)))” tells R to regress y on x, and then you provided a substitute value for x via that data frame.

The second way doesn’t work because “predict(lm(yy[,1]~yy[,2]),new)” is telling lm to regress part of yy on a different part of yy. Then you don’t give it a new value of yy, so it just takes the current value from the environment, which returns to you the original (not out of sample) data.

I agree that R makes it easy to shoot yourself in the foot though, especially if you’re just learning it. But if you’re just learning you should follow basic style guidelines, for instance when using “lm(formula, data)” you should always specify data, and you should always make sure that the variables in “formula” refer to columns in “data”. Introductory tutorials should give you these basic guidelines, and I agree the official manual makes a poor tutorial—it’s more written for users who already know the basics.
faustusnotes

August 14, 2014

Hi Ben, thanks for commenting. Don’t get me wrong, I’m a big fan of R – currently I’m using it to boostrap confidence intervals for splines, so that I can generate a 95% confidence interval for an optimal value of a key epidemiological variable, and I’m doing it in R because there is absolutely no other package I can think of on this earth that can do it as quickly, simply or elegantly. It’s a thing of beauty. But it is also hideous, and this is an example of why.

The problem here is not that R has two ways of handling variables, it is that 1) it doesn’t warn you that what you asked for doesn’t work with either of its ways, 2) it produces a wrong result instead of failing, 3) it gives only a warning and 4) the warning is completely incomprehensible. It should fail at the first step and tell you – either use column numbers or variable names, but don’t mess around with both.

Another example of this weirdness from R is its idiosyncratic way of handling vectors and matrices. Why treat them as different things? That’s really frustrating – the matlab approach to matrices, vectors and lists (arrays in matlab) is much more coherent and accessible. Another example is the way that R handles difference in vector sizes when handling assignments. If the right hand side of the assignment is a different dimension to the left hand side it will fail and produce a warning – unless the right hand side is smaller than the left hand side, in which case it will just replicate it within the left hand side object, without even giving a warning. That is incredibly dangerous.

These kinds of problems make it really frustrating to work with, as do the completely meaningless error messages that give no information about what actually went wrong, and the incredibly rude and arrogant online help. You say that introductory tutorials should provide these guidelines, but I received formal training in S and that was not enough to warn me about some of these problems – no introductory tutorial can cover everything. Which is why the program needs comprehensible warnings and error messages, better runtime error handling (with more catastrophic failures to ensure careful coding) and a more accessible and better-mannered online community. Without these things, using R is frustrating and teaching it to non-statisticians is incredibly difficult!
Ben

August 15, 2014

I have to agree that R can be “hideous” in some respects. However, there is usually some kind of perverted logic behind it. Eventually most stuff will make sense 🙂

For instance, in the example you mentioned, actually R will produce an error sometimes if the assignment is too small:

> m m[1, ] <- c(1, 2)
Error in m[1, ] m m[1, ]

Why? Because one really useful case of the above is assigning a scalar. For instance, if I want to set the first row of the matrix to 0, I can do it like this:

> m m[1, ] m
[,1] [,2] [,3]
[1,] 0 0 0
[2,] NA NA NA
[3,] NA NA NA

The 0 is replicated across the first row. This is a natural use of assignment and I think most R users would agree there’s no need for a warning in this case.

About vectors and matrices, in R a matrix actually is a vector, along with some dimensional information tagged onto it. I don’t know matlab, but it wouldn’t surprise me if the Matlab implementation is easier to use. In R I can see how they came up with the scheme though: they thought “hey we just programmed vectors, which are the basic building blocks—computers internally don’t know how to handle multi-dimensional arrays. We can build matrices on top of these!”

About the online help, I agree some people on the R mailing list can be jerks, but there are a lot of helpful people too. Stackoverflow is also really good for R problems. In general I think the online help in R is really impressive—I bet there are more people sharing their R knowledge online than Matlab and Stata combined.

Anyway, not disagreeing with anything you said necessarily, just looking on the bright side 🙂
Ben

August 15, 2014

Oops, looks like a couple of lines of my code above didn’t survive posting. Anyway, in the first example I created a 3×3 matrix and tried to set the first row to c(1, 2). In the second example tried to set the first row just to 0.
Mahsa

October 24, 2014

Hi all,

I have the same problem and I don’t know how ti fix it.
I ran a k-means clustering and then split my data set into the representative clusters. C1 shows the training dataset. C2 shows the test dataset. I ran clustering in C1 and tried to fit a regression line to each cluster. I assigned a cluster membership to each case in the test dataset and and predicted my outcome (Death) for the test dataset for each cluster. I got the following warning:
In predict.lm(y[[i]], C2[[i]]) :
prediction from a rank-deficient fit may be misleading
I spend few days to figure out how to fix this but I couldn’t solve the issue. I apprecite if you could possibly help me with this 😦

Please find my code:

# Fit regression model to each cluster in C1 and predict Death for C2

y <- list()
length(y) <- k

vars <- list()
length(vars) <- k

f <- list()
length(f) <- k

for (i in 1:k) {

vars[[i]] =1) {

if (nrow(C1[[i]]) >= 100) {

f[[i]] <- as.formula(paste("Death ~", paste(vars[[i]], collapse= "+")))

y[[i]] <- lm(f[[i]], data=C1[[i]]) #probit,cloglog

C1[[i]] <- cbind(C1[[i]], fitted(y[[i]]))

C2[[i]] <- cbind(C2[[i]], predict(y[[i]], C2[[i]]))}}}
Mahsa

October 24, 2014

# Fit regression model to each cluster

y <- list() # list of clusters
length(y) <- k

vars <- list() # list of clusters
length(vars) <- k

f <- list()
length(f) <- k

for (i in 1:k) {

vars[[i]] =1) {

if (nrow(C1[[i]]) >= 100) {

f[[i]] <- as.formula(paste("Death ~", paste(vars[[i]], collapse= "+")))

y[[i]] <- lm(f[[i]], data=C1[[i]]) #probit,cloglog

C1[[i]] <- cbind(C1[[i]], fitted(y[[i]]))

C2[[i]] <- cbind(C2[[i]], predict(y[[i]], C2[[i]]))}}}
Mahsa

October 24, 2014

# Fit regression model to each cluster

y <- list() # list of clusters
length(y) <- k

vars <- list() # list of clusters
length(vars) <- k

f <- list()
length(f) <- k

for (i in 1:k) {

vars[[i]] =1) {

if (nrow(C1[[i]]) >= 100) {

f[[i]] <- as.formula(paste("Death ~", paste(vars[[i]], collapse= "+")))

y[[i]] <- lm(f[[i]], data=C1[[i]]) #probit,cloglog

C1[[i]] <- cbind(C1[[i]], fitted(y[[i]]))

C2[[i]] <- cbind(C2[[i]], predict(y[[i]], C2[[i]]))}}}
Mahsa

October 24, 2014

# Fit regression model to each cluster

y <- list() # list of clusters
length(y) <- k

vars <- list() # list of clusters
length(vars) <- k

f <- list()
length(f) <- k

for (i in 1:k) {

vars[[i]] <- names(corc[[i]][corc[[i]]!= "1"])

f[[i]] <- as.formula(paste("Death ~", paste(vars[[i]], collapse= "+")))

y[[i]] <- lm(f[[i]], data=C1[[i]]) #probit,cloglog

C1[[i]] <- cbind(C1[[i]], fitted(y[[i]]))

C2[[i]] <- cbind(C2[[i]], predict(y[[i]], C2[[i]]))}
Mahsa

October 24, 2014

Please ignore the previous codes and only consider this one:

# Fit regression model to each cluster

y <- list()
length(y) <- k

vars <- list()
length(vars) <- k

f <- list()
length(f) <- k

for (i in 1:k) {

vars[[i]] <- names(corc[[i]][corc[[i]]!= "1"])

f[[i]] <- as.formula(paste("Death ~", paste(vars[[i]], collapse= "+")))

y[[i]] <- lm(f[[i]], data=C1[[i]]) #probit,cloglog

C1[[i]] <- cbind(C1[[i]], fitted(y[[i]]))

C2[[i]] <- cbind(C2[[i]], predict(y[[i]], C2[[i]]))}
faustusnotes

October 25, 2014

Hi Mahsa, thanks for commenting. It looks like an interesting problem. Have you considered binding your C1 and C2 together, then running the regression model on the subset of the full matrix that corresponds with C1, and predicting on C2? I’m not particularly familiar with working on lists, but if you do that you will at least have the same variable names and structure for your training and test datasets. ALternatively, try rewriting the formula in terms of column numbers rather than variable names (I think this was the solution I found).

Good luck!
KELVIN TAN

September 14, 2015

is the issue solved? I am getting similar error/warning messages as what described above.
Warning messages:
1: ‘newdata’ had 331 rows but variables found have 773 rows
2: In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
faustusnotes

September 14, 2015

Hi Kelvin, thanks for commenting. I think it’s a problem with the way you define and reference data elements – a previous commenter has given the solution, I think!