I’m working on a couple of complex multi-level models at the moment, using Stata, and I’ve run into some big problems getting them to work. In the spirit of helping others to solve problems I’ve spent a lot of time attacking, in this post I’m going to describe the problems I encountered and my solutions. In searching the internet for these problems I haven’t found many solutions, so hopefully other people can approach the problems more easily with the help of this post.

The two main problems I have encountered occur when I encounter the initial values not feasible error, and also problems of the time it takes to run the multi-level models. Solving the first problem raises another interesting and related problem that I think is a Stata bug, and that I have found a workaround for. Solving the second one might be related to the first one, but largely involves utilizing some properties of binomial or Poisson distributions. I have also encountered a third problem in which different multi-level modeling functions give different results, but I haven’t found a solution to that one. I will mention it here for completeness.

Obviously if you aren’t a statistician, everything that goes on from here down will seem like satanism, and you should probably stop reading.

The models

I am currently working on three unrelated and complex multi-level models, all of which use binomial- or Poisson-distributed responses and all of which have very large data sets. I’m not going to give details about the research projects, but the three models are listed here.

  1. An analysis of Demographic and Health Survey (DHS) data from multiple countries and years, with a binomial outcome and about 800,000 records. This model should have individuals clustered within households clustered within clusters within years within countries, but currently I’m not using any household level variables and I only have 1-4 years per country (mostly 1) so I’m ignoring the serial dependence in years. Even so, running a simple model in xtmelogit takes a fairly long time (perhaps 8 hours) on a 12-core PowerPC with Stata/MP 12 and 64Gb of RAM. Because I’m combining surveys across countries and years, I don’t think survey weights are valid, so I’m ignoring probability sampling effects (thank god!)
  2. An analysis of DHS data from a single country with more than a million records, with a binomial outcome. This model has individuals within households clustered within regions within states, so should be a four-level model. It also has a lot of variables (it is examining socioeconomic determinants of health outcomes). Running a single model in xtmelogit with Laplace transformation (the fastest maximization method) on an 8-core Dell Precision platform with 32Gb of RAM takes … well, days, mostly. And currently I’m ignoring one level (region, I think) for simplicity.
  3. An analysis of NHS hospital admission data with about 5 million records, Poisson outcomes, and only two levels: individuals within regions. Fortunately it only has a couple of variables, though it has a massive interaction term, so it runs in a couple of hours. I should probably have crossed levels (hospital X region) but I think this would fry anything it touched (as well as my brain) and probably be unidentifiable (because of lots of regions sending single individuals to big hospitals). This is also running on the 8-core Dell.

I’ve tried running model 2 in GLLAMM, but models 1 and 3 so far have used only xtmelogit and xtPoisson respectively. Obviously I don’t have infinite time for trouble-shooting, since each model takes at least one night to run; and I can’t run tests on random sub-samples because the problems often only occur in very large data sets; or they will appear to often in poorly-chosen random sub-samples. In all the models I also have to fiddle with categorical variables to ensure that some combinations of predictors don’t become too rare (and thus non-identifiable) in some cluster/country/time combinations. Also, as a final note, the heirarchical structure is essential. Results are completely different without it, and/or some key variables are estimated at regional level rather than individual.

Problem 1: Non feasible initial values

The first problem I have encountered is that xtmelogit crashes as soon as it begins initial parameter selection, and returns the following error:

initial values not feasible

This is rarely frustrating and rarely unusual in statistical software: usually the starting values don’t matter that much. As an initial solution to this problem I tried using simplified models just to see if I could get it running, but I found I had to simplify them so much they became meaningless. I dug around on the internet and eventually found this kind of solution, which advocates using the – from() – option to load up your own initial values. The option suggested there is to run a non-heirarchical model, extract the coefficients from that, and input them into the xtmelogit using the from option. If you have k levels, Stata will be expecting k additional coefficients, but apparently it can handle this automatically, so you just give it the output from logit and off it goes.

Unfortunately it doesn’t work. For example, if I run this code:

xi: logit outcome i.year i.education i.wealthIndex
mat a=e(b)
xi: xtmelogit outcome i.year i.education i.wealthIndex || country: || cluster:,laplace from(a)

I get the following error:

extra parameter outcome:_Iyear_2001 found
specify skip option if necessary

and again the model fails. I’ve never seen this error before. It basically stops me from adding the coefficients from logit into the xtmelogit command. The frustrating thing though is that if you display the matrices of coefficients from the two models, they are ordered the same way and have the same labels for variables, so there is no reason why Stata should find an “extra” parameter. What is happening? I searched for this online, and all I found was this useless advice, or (after much digging) a link I can’t find again, which suggested that the model might be “non-identifiable.” This problem is not arising through non-identifiability or complexity: I know this because if I can find the right starting values (see below) I can get the model to run just fine. Furthermore, this error doesn’t sound like a mathematical problem. It sounds like a programming problem.

In fact, the solution to this problem is remarkably simple: you simply add the “copy” option to the -from()- option. Without this option, the -from()- option tells xtmelogit to insert values from the matrix of coefficients from logit into their corresponding place in the xtmelogit procedure based on variable labels. When it runs out of variable labels it is then supposed to make up additional starting values using defaults. However, this doesn’t work because for some reason xtmelogit doesn’t understand the matrix output from logit. However, if you use the copy option, xtmelogit inserts the coefficients from the matrix based only on their position in the matrix. This means you need to supply the k extra starting values for the error terms of the random effects, but otherwise you’re good to go. You can supply these by guessing them, giving zeros (don’t know if this is a good idea!) or running an intercept-only heirarchical model and taking them from the results of that. The full code (with me supplying zeros in this case) is then:

xi: logit outcome i.year i.education i.wealthIndex
mat a=e(b)
mat a1=(a,0,0)
xi: xtmelogit outcome i.year i.education i.wealthIndex || country: || cluster:,laplace from(a1, copy)

Do this and you won’t run into the extra parameter problem. But note that supplying starting values from logit isn’t always a good idea – they can be radically different to the true final coefficients, even differing in sign, and they can again lead to the initial values not feasible problem.

In this case the only solution I could find was to run a much simpler multi-level model and then extract the values from that. In fact, I found an intercept-only model was sufficient to provide functioning starting parameters for the full xtmelogit. So if you can’t get your logit starting values to work, try just running a simple intercept-only model with all the necessary levels, and supplying those starting values to xtmelogit.

This bothers me for two reasons: first of all, the extra parameter error is obviously a programming error; and secondly, if supplying the results of an intercept-only model is enough to make the full model run, this suggests pretty extreme sensitivity to initial values. Is there maybe a more stable starting process for these maximization algorithms? It takes humans days to select good starting values, and if you don’t stumble on them immediately you have to do a search through a range of possibilities – isn’t it faster for the computer to do this? What starting values is it using?

Problem 2: Optimization time is just too long

As I mentioned in the introduction, these models can take a long time to run – between 6 hours and a couple of days depending on the model. I had hoped that finding new and better initial values would solve this problem at least partially, but it doesn’t much and the Stata manual admits that spending a long time looking for good initial values will have little impact on the time it takes. So what to do? The number of levels is a huge determinant of the time it takes to run (processor time depends on a factor of 2^k, I think), but if you can’t drop your levels, you’re in big trouble. Fortunately you can use a simple workaround (in some cases) to solve this problem. Because xtmelogit works on binomial data you can reduce the dataset in size by calculating summary data at the lowest level: you collapse the data at this level into a data set of events and trials. Not all Stata logistic regression procedures accept the events/trials framework, but xtmelogit does. If you’re dealing with, e.g. a school with classrooms, each classroom will have only two ages and two sexes. So you may be able to reduce each classroom to just 4 records, containing the count of the number of students in each age/sex combination, and the number of events. I tried this with model 1, and managed to reduce the data set to about 100,000 records, and the processor time by a factor of about 8 or maybe more, and get exactly the same results. Of course, if you have a household level above the individual, this method will be largely impossible, but if you are working with larger low-level clusters it will work a treat. Note also that it doesn’t work where you have a genuinely continuous variable, or a lot of diversity in predictors. But it’s worth trying if you have a lot of reasonably-sized clusters, especially if you are hoping to get a more accurate estimate than the laplace method.

Problem 3: Different results in GLLAMM and xtmelogit

I’ve noticed as well that in some cases GLLAMM and xtmelogit produce remarkably different results for the same model. On about page 450 of Rabe-Hesketh’s text she mentions this problem but puts it down to choice of integration points: it appears to me that this isn’t the whole story. The Stata list seems to also think this. I have yet to work out the details of this, so will come back to it when I have a better idea. Right now I’m suspicious that GLLAMM and xtmelogit are doing … well, not quite the same thing.

A note on software comparisons

Note that this problem doesn’t just exist in Stata. I found PROC GLIMMIX in SAS to be horribly unstable, and apparently glmer in R uses the Laplace approximation as its default optimization method, and doesn’t allow any other where there are more than two levels! Multi-level modeling with non-normal responses is one of those situations where you really have to be aware of the underlying engineering of the software, and cautious about any results until you have checked every aspect of the defaults. This can be very dangerous if you’re under pressure to produce results and each model is taking a couple of days. Indeed Rabe-Hesketh recommends choosing multiple different optimization procedures (primarily choices of numbers of integration points) to select a stable one before presenting a final model. That’s really pushing your luck if one model takes a couple of days to run, and you have to go through 12 different integration point choices. I wonder if there are any methods for identifying likely optimization complexity from the data itself, without having to run extremely time-consuming guesstimates?

Probably not.

Conclusion

There are a couple of solutions to the initial value problem and they don’t just involve reducing the number of variables. One of these, choosing initial values from a standard generalized linear model, produces an error based entirely on what appears to be a programming but, but you can work around it. You can also workaround the time constraints put on you by big multi-level models, by collapsing data rather than removing or simplifying variables. But be careful about the choice of integration points, and the possibility that GLLAMM would give you a different result …

Good luck!

It's all Greek to you, isn't it?

It’s all Greek to you, isn’t it?

I received a very interesting hospital dataset recently, in excel format and containing some basic variable names and values in Japanese. These included the sex of the patient, the specialty under which they were admitted to hospital, and all variable names. Initially this would be reasonably easy to convert to English in excel before import, but it would require making a pivot table and fiddling a bit (my excel-fu) is a bit rusty, but also I have address data and though at this stage it’s not important it may be in the future. So, at some point, I’m going to have to import this data in its Japanese form, so I figured I should work out how to do it.

The problem is that a straight import of the data leads to garbled characters, completely illegible, and very little information appears to be available online about how to import Japanese-labeled data into Stata. A 2010 entry on the statalist suggests it is impossible:

Unfortunately Stata does not support Unicode and does not support other multi-byte character sets, such as those necessary for Far Eastern Languages. If you are working with a data set in which all of the strings are in a language that can be represented by single byte characters (all European languages) just choose the appropriate output encoding. However, if your dataset contains strings in Far Eastern langages or multiple languages that use different character sets, you will simply not be able to properly represent all of the strings and will need to live with underscores in your data.

This is more than a little unfortunate but it’s also not entirely correct: I know that my students with Japanese operating systems can import Stata data quite easily. So I figured there must be something basic going wrong with my computer that was stopping it from doing a simple import. In the spirit of sharing solutions to problems that I find with computers and stats software, here are some solutions to the problem of importing far Eastern languages for two different operating systems (Windows and Mac OS X), with a few warnings and potential bugs or problems I haven’t yet found a solution for.

Case 1: Japanese language, Windows OS

In this case there should be no challenge importing the data. I tried it on my student’s computer: you just import the data any old how, whether it’s in .csv or excel format. Then in your preferences, set the font for the data viewer and the results window to be any of the Japanese-language OS defaults: MS Mincho or Osaka, for example.

This doesn’t work if you’re in an English language Windows, as far as I know, and it doesn’t work in Mac OS X (this I definitely know). In the latter case you are simply not able to choose the Japanese native fonts – Stata doesn’t use them. No matter what font you choose, the data will show up as gobbledigook. There is a solution for Mac OS X, however (see below).

Case 2: English language, Windows OS

This case is fiddly, but it has been solved and the solution can be found online through the helpful auspices of the igo, programming and economics blogger Shinobi. His or her solution only popped up when I did a search in Japanese, so I’m guessing that it isn’t readily available to the English language Stata community. I’m also guessing that Shinobi solved the problem on an English-language OS, since it’s not relevant on a Japanese-language OS. Shinobi’s blog post has an English translation at the bottom (very helpful) and extends the solution to Chinese characters. The details are on Shinobi’s blog but basically what you do is check your .csv file to see how it is encoded, then use a very nifty piece of software called iconv to translate the .csv file from its current encoding to one that can be read by Stata: in the example Shinobi gives (for Chinese) it is GB1030 encoding, but I think for Japanese Stata can read Shift-JIS (I found this explained somewhere online a few days ago but have lost the link).

Encoding is one of those weird things that most people who use computers (me included!) have never had to pay attention to, but it’s important in this case. Basically there are different ways to assign underlying values to far Eastern languages (this is the encoding) and although excel and most text editors recognize many, Stata only recognizes one. So if you have a .csv file that is a basic export from, say, excel, it’s likely in an encoding that Stata doesn’t recognize on an English-language OS. So just change the encoding of the file, and then Stata should recognize it.

Working out what encoding your .csv file is currently in can be fiddly, but basically if you open it in a text editor you should be able to access the preferences of the editor and find out what the encoding is; then you can use iconv to convert to a new one (see the commands for iconv in Shinobi’s blog).

Unfortunately this doesn’t work on Mac OS X: I know this, because I tried extensively. Mac OS X has iconv built in, so you can just open a terminal and run it. BUT, no matter how you change the encoding, Stata won’t read the resulting text file. You can easily interpret Shinobi’s solution for use on Mac but it won’t work. This may be because the native encoding of .csv files on Mac is unclear to the iconv software (there is a default “Mac” encoding that is hyper dodgy). However, given the simplicity of the solution I found for Mac (below), it seems more likely that the problem is something deep inside the way Stata and the OS interact.

Case 3: English-language, Mac OS X

This is, of course, something of a false case: there is no such thing as a single-language Mac OS X. Realizing this, and seeing that the task was trivial on a Japanese-language Windows but really fiddly on an English-language windows, it occurred to me to just change the language of my OS (one of the reasons I use Apple is that I can do this). So, I used the language preferences to change the OS language to Japanese, and then imported the .csv file. Result? Stata could instantly read the Japanese. Then I just switched my OS back to English when I was done with Stata. This is a tiny bit fiddly in the sense that whenever you want to work on this file you have to switch OS languages, but doing so on Apple is really trivial – maybe 3 or 4 clicks.

When you do this though, if you aren’t actually able to read Japanese, you’ll be stuffed trying to get back. So, before you do this, make sure you change your system settings so that the language options are visible on the task bar (you will see a little flag corresponding to your default locale appear next to the date and time). Then, make sure you know the sequence of clicks to get back to the regional language settings (it’s the bottom option of the language options menu in your taskbar, then the left-most tab inside that setting). That way you can change back easily. Note also that you don’t, strictly speaking, have to change the actual characters on the screen into Japanese! This is because when you select to change your default OS language, a little window pops up saying that the change will apply to the OS next time you log in but will apply to individual programs next time you open them. So you can probably change the OS, open Stata, fiddle about, close Stata, then change the OS back to English, and so long as you don’t log out/restart, you should never see a single Japanese-language menu! Weird, and kind of trivial solution!

A final weird excel problem

Having used this trick in Mac OS X, I thought to try importing the data from its original excel format, rather than from the intermediate .csv file. To my surprise, this didn’t work! In programming terms, running insheet to import .csv files translates the Japanese perfectly, but running import to import the excel file fails to translate properly! So, either there is something inaccessible about excel’s encoding, or the import program is broken in Stata. I don’t know which, but this does mean that if you receive a Japanese-language excel file and you’re using Mac OS X, you will need to export to .csv before you import to Stata. This is no big deal: before Stata 12, there was no direct excel import method for Stata.

A few final gripes

As a final aside, I take this as a sign that Stata need to really improve their support for Asian languages, and they also need to improve the way they handle excel. Given excel’s importance in the modern workplace, I think it would be a very good idea if Microsoft did more to make it fully open to other developers. It’s the default data transfer mechanism for people who are unfamiliar with databases and statistical software and it is absolutely essential that statisticians be able to work with it, whatever their opinions of its particular foibles or of the ethics of Microsoft. It also has better advanced programming and data manipulation properties than, say, OpenOffice, and this makes it all the more important that it match closely to standards that can be used across platforms. Excel has become a ubiquitous workplace tool, the numerical equivalent of a staple, and just as any company’s staplers can work with any other company’s staples if the standards match, so excel needs to be recognized as a public good, and made more open to developers at other companies. If that were the case I don’t think Stata would be struggling with Asian-language excel files but dealing fine with Asian-language .csv files.

And finally, I think this may also mean that both Apple and Microsoft need to drop their proprietary encoding systems and use an agreed, open standard. And also that Windows need to grow up and offer support for multiple languages on all their versions of Windows, not just the most expensive one.

Lastly, I hope this post helps someone out there with a Japanese-language import (or offers a way to import any other language that has a more extensive encoding than English).

I’ve been burnt twice in the past two weeks by a strange graphics handling problem in Stata and Microsoft Office. In the spirit of presenting workarounds and warnings for obscure software problems that I stumble upon, I think I should report it here.

The basic problem is simple and very nasty: charts produced in Stata and then exported into Microsoft proprietary formats don’t work properly across platforms and, possibly, across machines. The way in which they fail is insidious, as well, because it looks as if the operator has made an error: axis titles disappear, or parts of the graph are shaved off so that the graph doesn’t match the description one has written in text. Worse still, the person who originally made the chart can’t see the error, and it doesn’t appear in printouts from the afflicted person’s work. This means that you can’t easily convince the person at the other end that you’ve not done something wrong.

As an example, consider this insidious cock-up from this week. One of my students sent a draft paper to a colleague last week, and he sent it back with the cryptic message “fix the charts.” We didn’t know what he wanted changed, so we changed a few things and sent them back. This week we received an angry reply, demanding that we fix the charts and specifically why did we forget the y-axis labels? The day that we fixed the charts, we were working on printouts, because we were rushing, and the y-axes were in the printouts – I had a distinct memory of correcting some text in the y-axes. So I asked my student to mail me the last version he’d sent to the colleague, thinking he’d stuffed up, and indeed I couldn’t see the y-axes in the charts. I asked him why he’d removed them after I painstakingly corrected them, and he told me he hadn’t, and he could see them – but by now he was overseas and I couldn’t check in person. So I forwarded the document to my partner, who works on a PC, and she could see them. What was going on? My colleague and I, on macs, couldn’t see the y axes, but my student and my partner, on PCs, could. Weird.

I asked my student how he had put the graphs in word, and he told me he had copied the figures directly from Stata and pasted them into word, essentially following instructions that can be found all over the web (for example here) and also, I think, in the Stata help. I did some digging and discovered that when you do this, the file is converted automatically by Office into a new format – possibly .wmf? – and this can’t handle all of Stata’s graphics rendering; this leads to approximations in the encoding of some aspects of the graph. Mac graphics are handled in a different format – possibly .eps? – and the badly rendered parts of .wmf files are simply ignored when it opens them. One of the main things that the .wmf rendering stuffs up is rotated text – such as one finds in y-axis titles. When I realized this, I asked my student to redo the files by saving as .png, and everything was fine. The .png files looked hideous though so we redid them in .tiff format, but we could at least see the details of the axis labels now.

I’m not sure, however that it’s just a platform issue. A few weeks ago I had a strange graphing problem with a journal, who mailed me to say that my text and the histograms I had provided didn’t match – specifically, parts of the range of values I had referred to in the text weren’t appearing in the histogram. I couldn’t understand this, because I could see the histograms clearly. I thought perhaps they were just being a bit weird, so I sent them hi-res images with an explanation, and they were fine. The original file had charts in it as .png files – I had included them as .png because they are low-res files, easy to produce, and a lot of journals like to receive low-res files until the production stage. But the hi-res files I sent were in .tif format. In light of what happened this week, I think that the same problem my student had also arose with the .png files in that article. I don’t know what platform the journal production staff were using, but I made the .png files on a mac. So it’s possible that the problem also arises in reverse using .png files, or it’s possible that it occurs across machines as well as platforms.

The problem with this issue is that it is insidious, and when one works across email it’s impossible to work out what is happening. It also leads to questions about professionalism – leaving out y-axis labels is pretty shoddy undergraduate stuff – and those questions are exactly the kinds of issues that people try to blame on technical problems. It also creates conflict, because if you are repeatedly sending graphs that don’t work to a colleague (or a journal!) they start to get pissed. As do you, because you start to think they’re behaving like dickheads. The worst possibility is that, if everyone in your institution is working on word, and the peer reviewers are, but the production staff at the journal are working on macs, they may produce a final published version of your article that has no axis titles. Anyone reading that will think you are incompetent, when in fact it was purely a technical problem.

The simple solution to this is:

  • never copy and paste from graphics to word (this also reduces the risk of loss of resolution)
  • don’t use .png or .wmf exports
  • only work with .tif or .eps files
  • if you get into a weird situation where you’re sure that you supplied the right file, don’t assume the other person is doing something wrong – check what platform they’re using and try sending a file in a different format

Preparing charts for journals can be a real hassle, and journals can be both simultaneously picky about their figures and singularly unhelpful in advising non-experts on how to prepare them. This kind of cross-platform (and cross-format) silliness is really unhelpful in the production process, and it’s extremely difficult to find definitive information about it on the web. These problems don’t just arise from copy-paste laziness either, and understanding the details requires delving into the world of graphics rendering – a world that many people who work with stats and scientific data don’t know much about (nor should we have to).  Stata and Microsoft and Apple all seem to be fairly silent on the issue, too. So be aware of it, and be ready to defend your work on technical grounds when colleagues or journals seem to be talking about a graph or figure that you’re sure has no resemblance to the one you sent them.

And if you’re reading this, Bill Gates – hurry up and move to a non-proprietary graphics handling format!

I have now had quite a bit of experience working with large datasets in Stata, and consistent with my previous efforts on this blog to publicize pr0blems with statistical software and solutions to computer problems, I thought I’d explain how I do it and why it’s a good idea to use Stata for large data. I approached this problem in 2008, when I was living in London and working with National Health Service (NHS) data. At that time it was a seemingly insoluble problem and there wasn’t much information out there about how to solve it; I guess since then things have improved, but just in case the information is thin on the ground, I thought I’d write this post.

What size is “large”?

When I researched solutions to the problem of analyzing large datasets in Stata, many of the people I contacted and the websites I looked at thought I meant data consisting of hundreds of thousands of records – this is a common size in statistical analysis of, e.g. schools data or pharmaceutical data. I was working with files of 100s of millions of records, up to 30Gb in size, and back in 2008 very few people were working with this size. Even now, this is still pretty uncommon in epidemiology and health services research. Four years of outpatient data from the NHS will contain about 250 million records, and the chances are that the correct analysis you need for such data is a multi-level model (facility and patient being two levels) with binary outcomes. With this kind of data most health researchers make compromises and use the linear probability model, or other approximations and workarounds. Most researchers also use SAS, because SAS is the only software package capable of analyzing files that don’t fit into RAM. However, it takes an enormous amount of time to do a logistic regression on 250 million records with SAS – my colleague would leave it running all day, and work on a different computer while he waited for it to complete. This is not acceptable.

Why Stata?

I’m not a fascist about statistical software – I’ll use anything I need to to get the job done, and I see benefits and downsides in all of them. However, I’ve become increasingly partial to Stata since I started using it, for these reasons:

  • It is much, much faster than SAS
  • It is cheaper than SAS or SPSS
  • Its help is vastly superior to R, and the online help (on message boards, etc) is much, much politer – the R online help is a stinking pit of rude, sneering people
  • R can’t be trusted, as I’ve documented before, and R is also quite demanding on system resources
  • Much of the stuff that epidemiologists need is standardized in Stata first – for example, Stata leads the way on combining multilevel models and probability sampling
  • Stata’s programming language, while not as powerful as R, is still very flexible and is relatively standardized
  • Stata has very good graphics compared to the other packages
  • SAS is absolutely terrible to work with if you need automation or recursive programming
  • Stata/MP is designed to work with multi-core computers out of the box, whereas R has no support for modern chips, and SAS requires some kind of horrendous specialized set up that no one with a life can understand

So, while I’ll use R for automation and challenging, recursive tasks, I won’t go near it for work that I really need to get trustworthy results on quickly, where I’m collaborating with non-statisticians, or where I need good quality output. I gave up on SAS in 2008 and won’t be going back unless I need something that only SAS can do, and I don’t think SPSS is a viable option for serious statistical analysis, though it has its uses (I could write a very glowing  post on the benefits of SPSS for standardizing analysis of probability survey analysis over large organizations).

The big problem with Stata is that, like R, it is vectorized, so you need to load the entire data file into RAM in order to be able to do any analysis on it. This means that if you want to analyze very large data sets, you need huge amounts of RAM – whereas in SPSS or SAS you can load it piecewise and analyze accordingly. Furthermore, until Windows 7 came along it was not possible to give more than 700Mb of RAM to any program (unless you were using Mac OS X/Unix), so you couldn’t load even medium-sized files into RAM. Sure, you could use Windows Professional 2000 or some such nightmare mutant package (which I tried to do) but it’s hell on earth to go there. Your best option was Mac OS and a huge amount of RAM.

I’m going to now prove that it’s better to buy Stata and invest in 32 or 64 Gb of RAM, than to keep working with SAS. And I’m not going to fall back on hazy “productivity gains” to do so.

Conditions for analysis of large datasets

The core condition for analysis of large datasets is sufficient RAM to load the entire dataset – so if you expect your basic analysis file to be 12Gb in size, you’ll need a bit more than that in RAM. If the file is coming in a size larger than this, you’ll need a database package to access it – I use MS Access, but anything will do. If the file comes in text (e.g. .csv) format you can break it into chunks in a text editor or database package and import these into Stata sequentially, appending them together. Also, don’t be discouraged by larger file sizes before you import – Stata has very efficient data storage and by careful manipulation of variable types you can make your data files much smaller. Also, if you are importing sequentially you can drop variables you don’t need from each chunk of file before appending. For example, if you receive NHS data there will be a unique id derived from some encryption software that is about 32 characters long. Turn this into an integer and you save yourself about 16 bytes per record – this adds up over 250 million records. Some spatial data is also repeated in the file, so you can delete it, and there’s lots of information that can be split into separate files and merged in later if needed – in Stata it’s the work of a few seconds to merge a 16 Gb file with another 16 Gb file if you have sufficient RAM, whereas working with a single bloated 25Gb file in SAS will take you a day. It’s worth noting that SAS’s minimum sizes for a lot of variable types are bloated, and you can shave off 30-40% of the file size when you convert to Stata.

So, loop through chunks to build up files containing only what is relevant, compress them to minimum sizes, and use a judiciously constructed master file of IDs as a reference file against which to merge data sets with secondary information. Then, buy lots of RAM. You’ll then have the dual benefits of a really, really nice computer and a fast statistical analysis package. If you were working with large datasets in SAS, you’ll have cut your analysis time from hours to seconds, increased the range of analyses you can conduct, and got yourself improved graphics. But how are you going to convince anyone to buy you that computer?

Stata and a large computer is cheaper

Obviously you should do your own cost calculations, but in general you’ll find it’s cheaper to buy Stata and a beast of a computer than to persist with SAS and a cheap computer. When I was in the UK I did the calculations, and they were fairly convincing. Using my rough memory of the figures at the time: SAS was about 1600 pounds a year, and a basic computer about 2000 pounds every three years: total cost 6800 pounds every three years. Stata costs 1500 pounds, upgrades about every 2-3 years, and a computer with 32Gb of RAM and 4 processors was about 3000 pounds. So your total costs over 3 years are about 2300 pounds less. Even if you get a beast of an apple workstation, at about 5000 pounds, you’ll end up about even on the upgrade cycle. The difference in personal satisfaction and working pace is huge, however.

Conclusion

If you work with large datasets, it’s worth your while to switch to Stata and a better computer than to persist with slow, clunky, inflexible systems like SAS or SPSS. If you need to continue to interact closely with a large SQL backend then obviously these considerations don’t apply, but if your data importation and manipulation needs are primarily flat files that you receive in batches once or twice a year, you’ll get major productivity gains and possibly cost savings even though you’ve bought yourself a better computer. There are very few tasks that Stata can’t solve in combination with Windows 7 or Mac OS X, so don’t hold back – make the case to your boss for the best workstation you can afford, and an upgrade to a stats package you can enjoy.