This week I have been engaged in a Bayesian Statistics Death March with two students and a colleague: four hours a day, locked in the seminar room teaching ourselves Bayesian statistics. We are working through the core chapters of the book Bayesian Data Analysis (3rd Edition) (BDA 3) by Gelman et al. This book is famous and much-loved, and Gelman himself is quite well known (for a statistician): he blogs on statistics and politics and is also a well-respected analyst of politics and social science, with a separate blog on this topic. So it is good to be learning from the best, and I have to say that the book is a really enjoyable read[1], with many clear explanations of the most difficult concepts, illuminating examples and a nice style. It’s also very comprehensive, and I won’t be able to cover all of it in one Death March – I’m restricting myself to the basic principles, heirarchical models, regression and a brief touch of simulation. And it’s difficult – I have been dredging up a lot of mathematical methods that had been sleeping, and having to do that thing you do when facing serious maths seriously, where you have to look at the gap between two equations and say to yourself, “okay, yes, I gotta do this … send a rescue team if I’m not back by dinnertime…” and then start tackling the actual derivations. My fellow travellers on this Death March don’t have undergraduate education in maths and physics (where you learn many of the tricks employed routinely in the book) so it’s me who has to sigh and head once more into the breach …

Anyway, in working through the basic principles I have been struck by how similar Bayesian methods are to some of the methods in Quantum Mechanics (QM), especially those methods related to bra-ket notation. It has been 20 years since I studied QM, so my observations are based on vague memory of what I did then rather than a deep understanding of the principles, but there seem to be a lot of similarities between those methods and Bayesian approaches. The first and most obvious is that QM and statistical mathematics both take place in a Hilbert Space, but this is just because they are both based on probability. But some more subtle ones I noticed were based around some of the techniques and tricks of the trade. First, consider the common trick in bra-ket notation of breaking up an inner product with an outer product, which is integrated over the full space so that it is equivalent to the identity operator (i.e. it doesn’t change the original inner product at all).

In Bayesian statistics, there appears to be a similar technique based on conditional probabilities. For example, the conditional probability of a predicted value given the data, might be written like this:


So here the outer product in question is the expansion over the distribution of the prior, theta.

Another common phenomenon in the work I’m reading through is the phenomenon of unnormalized probability distributions – that is, probability distributions that cover a total probability greater than 1. This seems to come about through a relatively lax process of not caring about constants in the various stages of the generation of posterior distributions, so that at the end of the process of constructing the final posterior distribution one often has to normalize it, sometimes by huge amounts. For example if you follow example 5.3 in the text, your final probability distributions will run into values of 10^(-200), but through normalization you’ll end up having them run from 10^(-12) to 0.04, and summing to 1. This is interesting because QM has a similar huge problem, which is covered by the process of renormalization.

Finally, and most interesting to me from a philosophical perspective, is the process of generating predicted values given data – that is the process of generating the left hand side of the above equation. To me this looks a lot like collapsing the wave function in QM, and seems to have an analogous intepretation. When a wave function collapses it means that the probability distribution of probable states has been forced into a single observed state in the real world, and this observed state is restricted to an element in the physical space spanned by the basis functions that define the available physical states it can collapse into. To me this seems very similar conceptually to the process of Bayesian prediction shown in the formula above: conceptually, the observed data (y) describes the available space of physical states that the future observation can collapse into, and our prior distribution of theta defines the probability of collapse of a future observation into a particular state. So, the predicted value is a kind of wave-state collapse onto a space spanned by the experimental data already available. The difference, I suppose, is that in QM the space the wave-function collapses into is defined by a model, whereas for Bayesian stats the space is defined by observed data. In the QM case, bad model=poor predictive properties of the wave function collapse; in Bayesian stats, a bad prior experiment means that the ability to predict the wave function collapse is poor.

Don’t these seem conceptually similar? A brief dig on the internet produced this blog post by the great John Baez, observing similarities between QM and Bayesian statistics. It also includes a discussion of some of the different interpretations of QM ideas, and of the contrast between Bayesian and Frequentist statistics. So, I’m wondering, is it possible that QM physicists developed some of the ideas of modern Bayesian statistics independently of Bayesian statisticians, and being focused on different goals, working in (essentially) different mathematical formalisms, and not concerned with the full totality of experimental processes, they didn’t realize what they were doing or communicate with each other to see it? Obviously, Bayesian statisticians developed the basic framework of Bayesian stats first – in 1763, through the work of Bayes – but like many aspects of statistics, the full implications of Bayes’ Rule weren’t really developed until the last half of the 20th century, when computers had begun to be useful, simulation was becoming possible, and other aspects of the mathematical background had been developed. QM developed a little earlier, but really started to hit its stride in the 1960s. So could it be that these two fields were developing similar techniques with very different language and formalisms at approximately the same time, in parallel?

Or, has my knowledge of QM lapsed so much in the past 20 years that I am seeing similarities that are really very superficial, with no deeper similarity than the possibility of analogizing a few tricks?

fn1: for certain definitions of “enjoyable,” obviously – we are talking about Bayesian statistics here after all, not the entire collection of Hustler back issues.