By Kevin Gray
We owe London to Rome and, in a way, Statistics to Monaco. Quite a few of our earliest statisticians were mathematicians with a interest in games of chance and, in jest, I have even described myself as a professional gambler. (Attention wanders quickly, though, when I begin to describe what I really do…)
Some of you may have heard of MCMC – Markov Chain Monte Carlo – which is used extensively in Bayesian Statistics. There is actually a thick volume entirely devoted to that subject entitled Handbook of Markov Chain Monte Carlo (Brooks et al.). I won’t get into Bayesian methods in this post but Professor Andrew Gelman gives a nice overview on his blog in Bayesian statistics: What’s it all about?
So what are Monte Carlo studies? Though we may sometimes feel we’re drowning in data, truth be told, we frequently don’t have the data we need for specific purposes. This holds for university professors as well. Collecting and assembling new data takes time and money and the new data are not guaranteed to suit our objectives.
Say a statistician would like to evaluate a new (or existing) statistical procedure, machine learning algorithm or fit index. A single sample of real data will rarely be enough. Moreover, it would only be a sample and by definition subject to sampling error. What if, instead, we could create our own population and “sample” from it? Since we have designed the population ourselves, we know the answer in advance.
With today’s computers it’s easy to simulate thousands or even hundreds of thousands of samples of based on hypothetical populations (which are often inspired by real data). Importantly, it’s possible tweak these imaginary populations to see how well the new algorithm (or whatever) holds up under various experimental combinations of sample size, skewness, kurtosis, collinearity and what have you. Knowing the truth – because we’ve created it – allows us to accurately assess the algorithm’s performance and also compare it against alternatives.
Academic publications such as the Journal of the American Statistical Association and Structural Equation Modeling describe the results of Monte Carlo studies in every issue. One of my key takeaways is that many “classical” statistical methods are quite robust to violations of their assumptions and are nowhere near as fragile as often feared or claimed. Some do fall down badly, however, when they’d been expected to perform well. This holds for newer methods too. Applied statisticians like me need to keep our eyes peeled for new guidelines from the academic community. Not as exactly like watching MythBusters, but part of the job. We also occasionally perform Monte Carlo simulations to estimate the sample size required for statistical modeling we’re contemplating.
Monte Carlo simulations are also conducted by some political pollsters and analysts. A few of you might also associate the term with “Wall Street Quants.” I have a book on this latter, financial, application called Handbook in Monte Carlo Simulation (Brandimarte), and many others about this topic have been published in addition to hundreds of papers. Here, the basic idea is to generate a large number of forecasts based on an econometric model (or ensemble of models) under varying economic conditions or events that could plausibly occur.
Monte Carlo studies have made R&D for statistics and machine learning much easier and faster. Real data are still needed, but no longer do we have to rely on mathematical theory and wait years for empirical studies to tell us how a new technique stacks up against established methods or other new kids on the block.
I hope you’ve found this interesting and helpful!