Compromise and Conceit

Infernal adventuring…

about

An Australian statistician and role-player living in Japan, interested in the theory and practice of role-playing…

Working with Large Datasets in Stata

April 17, 2012

I have now had quite a bit of experience working with large datasets in Stata, and consistent with my previous efforts on this blog to publicize pr0blems with statistical software and solutions to computer problems, I thought I’d explain how I do it and why it’s a good idea to use Stata for large data. I approached this problem in 2008, when I was living in London and working with National Health Service (NHS) data. At that time it was a seemingly insoluble problem and there wasn’t much information out there about how to solve it; I guess since then things have improved, but just in case the information is thin on the ground, I thought I’d write this post.

What size is “large”?

When I researched solutions to the problem of analyzing large datasets in Stata, many of the people I contacted and the websites I looked at thought I meant data consisting of hundreds of thousands of records – this is a common size in statistical analysis of, e.g. schools data or pharmaceutical data. I was working with files of 100s of millions of records, up to 30Gb in size, and back in 2008 very few people were working with this size. Even now, this is still pretty uncommon in epidemiology and health services research. Four years of outpatient data from the NHS will contain about 250 million records, and the chances are that the correct analysis you need for such data is a multi-level model (facility and patient being two levels) with binary outcomes. With this kind of data most health researchers make compromises and use the linear probability model, or other approximations and workarounds. Most researchers also use SAS, because SAS is the only software package capable of analyzing files that don’t fit into RAM. However, it takes an enormous amount of time to do a logistic regression on 250 million records with SAS – my colleague would leave it running all day, and work on a different computer while he waited for it to complete. This is not acceptable.

Why Stata?

I’m not a fascist about statistical software – I’ll use anything I need to to get the job done, and I see benefits and downsides in all of them. However, I’ve become increasingly partial to Stata since I started using it, for these reasons:

It is much, much faster than SAS
It is cheaper than SAS or SPSS
Its help is vastly superior to R, and the online help (on message boards, etc) is much, much politer – the R online help is a stinking pit of rude, sneering people
R can’t be trusted, as I’ve documented before, and R is also quite demanding on system resources
Much of the stuff that epidemiologists need is standardized in Stata first – for example, Stata leads the way on combining multilevel models and probability sampling
Stata’s programming language, while not as powerful as R, is still very flexible and is relatively standardized
Stata has very good graphics compared to the other packages
SAS is absolutely terrible to work with if you need automation or recursive programming
Stata/MP is designed to work with multi-core computers out of the box, whereas R has no support for modern chips, and SAS requires some kind of horrendous specialized set up that no one with a life can understand

So, while I’ll use R for automation and challenging, recursive tasks, I won’t go near it for work that I really need to get trustworthy results on quickly, where I’m collaborating with non-statisticians, or where I need good quality output. I gave up on SAS in 2008 and won’t be going back unless I need something that only SAS can do, and I don’t think SPSS is a viable option for serious statistical analysis, though it has its uses (I could write a very glowing post on the benefits of SPSS for standardizing analysis of probability survey analysis over large organizations).

The big problem with Stata is that, like R, it is vectorized, so you need to load the entire data file into RAM in order to be able to do any analysis on it. This means that if you want to analyze very large data sets, you need huge amounts of RAM – whereas in SPSS or SAS you can load it piecewise and analyze accordingly. Furthermore, until Windows 7 came along it was not possible to give more than 700Mb of RAM to any program (unless you were using Mac OS X/Unix), so you couldn’t load even medium-sized files into RAM. Sure, you could use Windows Professional 2000 or some such nightmare mutant package (which I tried to do) but it’s hell on earth to go there. Your best option was Mac OS and a huge amount of RAM.

I’m going to now prove that it’s better to buy Stata and invest in 32 or 64 Gb of RAM, than to keep working with SAS. And I’m not going to fall back on hazy “productivity gains” to do so.

Conditions for analysis of large datasets

The core condition for analysis of large datasets is sufficient RAM to load the entire dataset – so if you expect your basic analysis file to be 12Gb in size, you’ll need a bit more than that in RAM. If the file is coming in a size larger than this, you’ll need a database package to access it – I use MS Access, but anything will do. If the file comes in text (e.g. .csv) format you can break it into chunks in a text editor or database package and import these into Stata sequentially, appending them together. Also, don’t be discouraged by larger file sizes before you import – Stata has very efficient data storage and by careful manipulation of variable types you can make your data files much smaller. Also, if you are importing sequentially you can drop variables you don’t need from each chunk of file before appending. For example, if you receive NHS data there will be a unique id derived from some encryption software that is about 32 characters long. Turn this into an integer and you save yourself about 16 bytes per record – this adds up over 250 million records. Some spatial data is also repeated in the file, so you can delete it, and there’s lots of information that can be split into separate files and merged in later if needed – in Stata it’s the work of a few seconds to merge a 16 Gb file with another 16 Gb file if you have sufficient RAM, whereas working with a single bloated 25Gb file in SAS will take you a day. It’s worth noting that SAS’s minimum sizes for a lot of variable types are bloated, and you can shave off 30-40% of the file size when you convert to Stata.

So, loop through chunks to build up files containing only what is relevant, compress them to minimum sizes, and use a judiciously constructed master file of IDs as a reference file against which to merge data sets with secondary information. Then, buy lots of RAM. You’ll then have the dual benefits of a really, really nice computer and a fast statistical analysis package. If you were working with large datasets in SAS, you’ll have cut your analysis time from hours to seconds, increased the range of analyses you can conduct, and got yourself improved graphics. But how are you going to convince anyone to buy you that computer?

Stata and a large computer is cheaper

Obviously you should do your own cost calculations, but in general you’ll find it’s cheaper to buy Stata and a beast of a computer than to persist with SAS and a cheap computer. When I was in the UK I did the calculations, and they were fairly convincing. Using my rough memory of the figures at the time: SAS was about 1600 pounds a year, and a basic computer about 2000 pounds every three years: total cost 6800 pounds every three years. Stata costs 1500 pounds, upgrades about every 2-3 years, and a computer with 32Gb of RAM and 4 processors was about 3000 pounds. So your total costs over 3 years are about 2300 pounds less. Even if you get a beast of an apple workstation, at about 5000 pounds, you’ll end up about even on the upgrade cycle. The difference in personal satisfaction and working pace is huge, however.

Conclusion

If you work with large datasets, it’s worth your while to switch to Stata and a better computer than to persist with slow, clunky, inflexible systems like SAS or SPSS. If you need to continue to interact closely with a large SQL backend then obviously these considerations don’t apply, but if your data importation and manipulation needs are primarily flat files that you receive in batches once or twice a year, you’ll get major productivity gains and possibly cost savings even though you’ve bought yourself a better computer. There are very few tasks that Stata can’t solve in combination with Windows 7 or Mac OS X, so don’t hold back – make the case to your boss for the best workstation you can afford, and an upgrade to a stats package you can enjoy.

Posted in Meat Life, Science

8 responses to “Working with Large Datasets in Stata”

Sara Allen

August 1, 2013

In deed its a great analysis and very practical one. I used to work with all statistical packages including SAS and SPSS. However I switch to Stata around 2008 and my need to use multiple packages has reduced dramatically due to Stata’s vast and robust online support and trove of information to handle large datasets. I have an hP laptop with 8 processors and 32 GB ram and running any statistical analysis on big data has been just breeze.
Saraja

May 13, 2014

i agree!
Rachel M.

October 24, 2014

Great analysis here. ” …the R online help is a stinking pit of rude, sneering people”. I laughed out loud at this. You just characterized the last two years of my life dealing with the R community. I am new to Stata, but so far, am finding it far superior for my needs. As you said, it’s much quicker, and the support is much better. I may have to use R occasionally, but I am hoping to use Stata as much as possible, plus I am tired of bouncing between software packages every few months. Quick question: Have you come across any plugins available for taking the model output from a multi-level model in STATA to score a population in SQL Server? I’ve been looking for information on how to do that. If my model needs to be re-run weekly/monthly, I suppose the client I am working with my buy a copy of Stata if that’s the only way. Thanks for your advice.
faustusnotes

October 25, 2014

Thanks for commenting Rachel M. I pity you if you had to spend two years dealing with the R community … Your quick question is a nasty one! I actually don’t know anything about how to take model output from Stata in an automatically exportable format, and I think this may be one of Stata’s downsides compared to R. Is there some way MATA could be programmed to export the equation? I really doubt that there is a way to make the predict functions work outside of Stata, but you might be able to at least export the model coefficients in a form that SQL Server can recognize.

I feel very bad for you if you have spent the last two years working with R, the R community and SQL. The things we do for a living …
feenberg

June 21, 2015

I have posted quite a bit of advice on working with very large Stata files at
http://www.nber.org/stata/efficient based on our experience at NBER with Medicare billing record data – hundreds of millions of records.
faustusnotes

June 21, 2015

Thanks for your comment feenberg. I like this piece of text from the link:

It is a truism that computers are cheap and people are expensive. However, people waiting for computers are also expensive, and often a little thought put into programming can pay dividends in faster results, especially when programs are run repeatedly on datasets with tens or hundreds of million observations and take days or weeks to complete.

When I first started working in stats I think it was more the opposite, or at least that computers were more expensive than software. Have you found that there are datasets at the NBER that modern Stata can’t handle?
Andrew

May 4, 2017

Great thoughts here! I have a question for the author… I am a newish Stata user and have begun delving into larger datasets (20-30+ gb files, 25million+ observations, 200 variables, etc.), and am feeling the burden of being limited to 16gb of RAM.

I’m looking to upgrade my computer but don’t want to break the bank, so which of these options would I see the most gains with in terms of faster data cleaning and analysis? I am a capital markets researcher and much of what I analyze can be done with traditional OLS and MLE analysis. So nothing too fancy statistically speaking.

1) Upgrade to a computer with 64gb of RAM (Dell Precision 7720) and continue using Stata IC?
2) Continue using my current computer (2016 MB Pro, 4-cores, i7, 16gb of RAM) and buy Stata MP for 4 cores?

I would love to do both, but (1) will cost me ~$2500-3200, and (2) will cost me $1000. Both would realistically put me at or above $4k…
faustusnotes

May 4, 2017

Thanks for your comment Andrew. First I should say that if you are using 20-30gb files your ram will likely be a hard limit – unless it has changed in the last few years, stata needs more ram than the file you are opening in order to load it into ram. This is independent of the flavor of stata as far as I know. Of course you can likely use an external package to reduce the file size (e.g. MySQL to skim off half the variables) before you open it in stata. But if you aim to do data manipulation in stata you will likely need big ram. Regarding stata flavour, first check online to see the maximum number of variables and observations stata IC can handle – it has a limit independent of file size and if you have more observations than the limit you simply can’t use it (I have been burnt by this before). Stata has a helpful online table if you Google “which type of stata should I use”. Next regarding MP, the main benefit of MP is ability to use the full multi core architecture but this benefit is not actually so great – the maximum speed increase of four cores over 1 is a 4 fold improvement but this isn’t always observed and in some processes (e.g. Certain types of multilevel model) there is almost no benefit to more cores. Stata have performance diagnostics online so you can check this too. My memory is that OLS gets the full multiplicative benefit of more cores and so do most basic MLE techniques, but you likely see less improvement for more exotic models (e.g. GLAMM). You can check on the stata website. My guess is your best option is simply to get more ram and stick with stata IC. I doubt MP is worth forking out for given IC will handle OLS models at relatively good speed even with lots of observations, and the productivity gains for these types of model aren’t worth the money. But the key is to check the observation limit of IC and then the likely ram limits you will hit, and the speed improvements for your particular model type on the stata website. I hope that helps (and sorry for no links- I’m writing this on a phone).