Every year I have to teach a basic statistics class to new Master’s students, but every year I find my students come from very diverse mathematical and science backgrounds without necessarily any understanding of the fundamentals necessary to grasp a classical statistics course. I have one year to polish these students up to a level where they can complete a fairly demanding research thesis in their second year, and I also have to get them understanding the fundamental principles of statistics so that when they move on from my department they don’t embarrass themselves or others. Of course, I started teaching statistics to these students from the framework in which I learnt it, but I soon realized that the concepts just weren’t sticking. Not just because the students didn’t understand some of the maths, but because translating ideas from mathematical notation into solid concepts is tough for people who know the maths but don’t have a really strong background in it. It’s like learning something in a second language – you can’t think about the language and grasp the concepts at the same time. But a lot of statistics ends up being done on computers, and in practice people don’t need to know the maths as much as they need to have a good grasp of the concepts.

In addition, I noticed that a lot of what I was teaching based on my classical experience of learning stats in the 1990s was basically deadweight, and some of this deadweight was tough to grasp. So I started thinking about changing the way that I taught the principles, to try and move away from unnecessary mathematics, to remove some of the historical details that crowd a basic stats course long after their expiry date, and to try to find new, practical ways to teach some of the core principles. Because when I sit back and think about the core principles of statistics, there are really only two parts that are tough, and it is those two parts that are, I think, most commonly taught in a clunky and old-fashioned way – but they’re also crucial components to the whole edifice of basic statistics, and I think the alternative to teaching them is often seen as jumping straight to computers, which is in many ways worse. So here I want to outline my revisions to teaching statistics, and the principles behind them.

In a nutshell, I have decided to teach distribution theory by starting with a practical class based on dice; and I have also completely ditched the use of standard tables of distributions. I’m in the midst of thinking what else in the classic statistics curriculum is unnecessary or needs to be radically re-taught.

Teaching distribution theory with dice

This year I trialed a class on distribution theory that I taught using 10-sided dice. My distribution theory class is 3 hours long, so I spent 1.5 hours on a practical with dice, and then I introduced the mathematics of the distributions, as an addendum really to playing with the dice. For the dice class, I divided the students into pairs and gave them 10d10 each. I also handed out an excel spreadsheet that was pre-designed to enable them to generate probability distributions from counts of values they rolled – they could have written this themselves at the beginning of class but this always slows class down, and I don’t want the students wasting time on or getting confused by something which is at this stage just a tool, so I prepared the basic spreadsheet for them.

The practical was then divided into these stages:

  1. Generate a uniform distribution: Choose ONE ten-sided die and roll it multiple times (I suggested 30 times), counting the number of times each number was rolled and entering them in the spreadsheet. Graph the resulting probabilities. Is the die loaded? [In fact most students had loaded dice – one group managed to roll almost entirely 1s and 4s (I think) and another group rolled very few high numbers]. Then show them the theoretical distribution on the board. This distribution is so simple that students immediately understand it. This is the key to linking a practical sense of numbers in with the principles of distribution functions; we will build up to more complex distributions, and we will also be leveraging this question about whether the die is loaded for a bit of a Bayesian chat.
  2. Generate a bernoulli distribution: Ask one person from each group to pick a number between 6 and 10; this is the threshold on their d10 for a success. Make sure they use the same d10 that they just built a uniform distribution from. Again, get them to roll about 30 times, and generate the resulting distribution. This distribution is so trivial that the students will be wondering what the point is, but it gets you to a very simple couple of questions that bear on the nature of statistical tests. After about 30 rolls the proportion of successes will be pretty close to the “true” proportion – unless their die is loaded. So I asked the students what they thought the probability of success should be, and they all immediately calculated it as the sum of probabilities in the prior uniform distribution. I asked them what the theoretical probability should be, and again they could easily answer this trivial question – and then I asked them to suggest ways to test whether their die differed from the theoretical probability. This is all preparatory to talking about cumulative distribution functions, probability mass and (later) methods for statistical testing. Often at this stage in my class some students don’t even really know why we would do a statistical test, and by posing these questions I present a natural example of a test you might want to do. I also gave a brief explanation of Bayesian statistics here (in a very heuristic way), explaining the relationship between the Bernoulli distribution and the prior Uniform distribution they had rolled, and pointing out how their knowledge about that prior affected their judgment of the true distribution of the bernoulli. This is all so intuitive with the dice in your hand, that it’s impossible to confused by the theory. Whereas if I had started from Bayes’ theorem and the formula for a Bernoulli distribution the students would be in great pain, even though the maths of each of these ideas is not complex in and of itself.
  3. Generate a sum of uniform distributions: roll 2d10 30 times. Plot the resulting distribution. Of course this distribution is already halfway to being normal (it looks normal), and although we haven’t introduced the maths of the normal distribution everyone knows it from popular culture, so when you say it looks a bit like the bell curve they immediately get it. You can also ask students the probability of a 2, and explain the probability of e.g. a 10; this helps everyone to see in a very practical way just how distorted the probability distribution becomes from just adding two uniform distributions together (I have been doing stats for 20 years and I still think this is a really cool kind of magic!) They can see it in their distribution plot, and they can calculate the probabilities easily from just thinking about the dice. By now everyone is thinking about distributions in a natural and intuitive way, and we haven’t come up with a single actual formula yet.
  4. Generate a binomial distribution: by rolling all 10 dice and adding together the successes, following the success rule in 2. Again, this is an example of the Central Limit Theorem, but the probability calculations for the extreme values are even more potent examples of how adding together random variables makes them behave very differently. They’re also now building a real distribution, and can get a real sense of how probability distributions work to describe the real probability of particular events.

Finally, I have the students build cumulative distribution functions, and relate the calculation of probabilities in the cumulative distribution function for the uniform distribution to the calculation they performed in step 2. Having done all of this they are very comfortable with the concept and application of distribution functions. For the second 1.5 hours I then introduced the equations for these distributions, then introduced the normal distribution and plotted it, and talked about its properties. Where they would previously have been looking at equations that are quite daunting for people without much mathematical background, now they were looking at equations they were already familiar with. Knowing the shape and method of forming these distributions, they can focus on the only important point, which is the relationship between x and the probability that comes out of the function.

Ditching tables

The next step of distribution theory in a traditional stats class is then the tedious task of learning how to calculate cut points of distributions from tables. Having been through the dice exercise the students already have an intuitive feel for cut points and for cumulative distribution functions, so I don’t bother showing them tables. Instead, I give them an excel spreadsheet that contains the functions to do these calculations, and we work through some examples together. I then explain about why we used to have to use tables, but don’t anymore. I explain that the properties of the normal distribution (stability under shifts of location and scale) were useful back in the day when we only had one table to work from, but they’re not anymore. In the past I have noticed that this transformation of the normal distribution really kills a lot of students, it’s really hard for non-mathematicians to think about. But it’s really not important anymore to have to learn about tables. I have new textbooks which still have tables in the back. Why? When was the last time you used a statistical table?

Putting history in its proper place

This shift to teaching cut-points of distributions first practically and then using Excel is part of a move to dump some of the parts of statistics that are largely of historical value only. A lot of classical statistics was invented for a period when experiments were hard to do and very expensive, but they just aren’t as important anymore, or they have been superseded. For example no one uses correlation as a measure of the relationship between two variables anymore – we just use regression, because it’s much more flexible and by associating the relationship between two variables directly with the line through their scatter plot you force students to think about the possibility that a linear model is inadequate. So why bother teaching correlation in this context at all? I teach correlation as a stepping stone to understanding the challenges of longitudinal modeling, and so that students can understand the concept of non-independent observations – not because correlation is a useful tool in its own right – but I think a lot of courses teach it as if it still had the importance it had when it was first used back in the day. I think we could probably even – as a whole community – rejig the way we write basic statistical tests (such as the Z test) so that they don’t rely on calculating a standardized test statistic – there’s no reason modern statistical software needs to calculate a test statistic standardized to N(0,1), but the need to standardize adds a layer of complexity to understanding the theory of testing. Could we rejig our statistical practice so that this standardization process is recognized as a throwback to a time when we only had tables of cut-off points for N(0,1)?

Should we forget about the T-test and non-parametric tests?

Following on from these questions, I wonder about the T-test and non-parametric tests. If you are working in epidemiology it is highly unlikely that in the modern era you will only have 30 or 40 observations. You won’t get into the Lancet without doing a major multi-country study with tens or hundreds of thousands of observations. In this case, the difference between a T-test and a Z-test for a mean is going to be … irrelevant. Should we consider teaching T-tests as a historical oddity, something that you only really need to care about in a few rare fields of modern science? Every other field of physical science makes approximations all the time, but for some reason in statistics we insist on carefully distinguishing between Z- and T-tests, instead of saying “the assumptions of the Z-test don’t work in small samples (that you shouldn’t be relying on anyway).” I know this is not theoretically correct, but with students outside of the physical/engineering sciences, it just adds extra confusion. I compromise on this by explaining the test in full as a basic test, but then pointing out how irrelevant the difference is in the modern world of massive samples.

I think the same might apply to non-parametric tests – we just don’t use them, and the theory of non-parametric statistics is so much richer and more profound than one would ever realize from studying the Wilcoxon Rank-Sum test. Should we bother with tests that are under-powered, and that get many students mired in confusion over when the Central Limit Theorem holds, and what test to use in what setting? Especially in epidemiology, where we will almost always be working with binary outcomes?

My students seemed to enjoy and benefit from the dice class. I certainly find they grasp the issue of critical points in the distributions more easily if they work from Excel than from tables, and I think it helps them to get a sense of what’s important if we teach some other aspects of the topic as being accidents of history rather than essential parts of theory. Are there other things that we can change? Are there other ways we can make this very beautiful, profound topic interesting and accessible to people with limited mathematical background and even more limited mathematical patience? I think there are, and we should strive to find them.