"Where Do I Start?” A Beginner’s Roadmap to Bayesian Statistics

Sep 10

One of the most common questions I get, especially from graduate students, is, "How do I get started with Bayesian statistics?" It’s a fair question, given that most university programs focus heavily on ‘traditional’ frequentist methods, leaving Bayes as a bit of an enigma. While this is slowly changing, for now, learning Bayesian statistics can feel like an overwhelming and lonely experience. When I started, I had no clue where to begin, so I started reading everything with the word ‘Bayes’ attached. That approach worked —eventually— but it involved a lot of trial and error. From my experience, here is a more streamlined roadmap to get you started.

What do you need to know to get started?

1) Learn to code.

There’s no way around this one—you need to know how to code. Whether it’s R, Python, or Julia, choose a language and get comfortable with it. Point-and-click software won’t cut it if you’re serious about Bayesian modeling. There are several reasons for this, with flexibility, reproducibility, and open access to the scientific community being chief among them. Once you get over the initial hurdle of typing commands, the benefits of coding will become obvious, and you’ll never turn back.

2) Understand Probability Theory.

This is fairly straightforward but cannot be emphasized enough. Bayesian statistics is fundamentally derived from Bayes’ Theorem, which provides a framework for updating our prior beliefs with new data. Because we’re dealing with uncertainty—we don’t know for sure what the answer is —we express our beliefs using probabilities. This allows us to quantify this uncertainty and update our beliefs as we gather more evidence. Understanding probability theory is essential for grasping this process.

3) Acquire a general understanding of MCMC sampling and the creation of a posterior distribution

Here’s where things get interesting. Unlike traditional statistics done using point-and-click software where you press a button and get p-values, Bayesian statistics involves sampling from something called a posterior distribution, which is created via Markov Chain Monte Carlo (MCMC). Not to get too far into the weeds just yet, but a posterior distribution is something we generate for each of our model parameters (parameter = think of a regression coefficient). In Bayesian statistics, we don’t assume we know exactly what our parameters values are. Instead, we create a distribution of all the possible values they may take, given the data and our prior beliefs. The resulting posterior distribution provides a full picture of the uncertainty around our estimates, which is critical for making informed decisions. We create this distribution using MCMC sampling, which allows us to estimate posterior distributions, even when they are complex and can’t be computed directly (which is often the case in real-world problems). If this sounds overly complicated or even frightening, don’t worry, that’s completely normal; for many, this is the first big hurdle and common barrier to getting into Bayes, but it’s also what makes it so powerful. Once you have some understanding of this process, the rest becomes much easier to digest.

Where do you get started?

1. Coding.

· Pick a language: I mostly use R, but Python and Julia are great options too. Choose the one you’re most comfortable with. Something to keep in mind, however, is that right now the Julia community is relatively small, and let me tell you, when you’re learning a new coding language the amount of help you can get matters… a lot.

· Install the tools: Download the language and find an IDE (integrated development environment) you like. An IDE is the user interface, the thing you’re going to open on your computer and look at when you’re coding. You spend a lot of time here, so try a few and pick one you like. RStudio is popular for R; Jupyter works well for Python.

· Don’t overthink tutorials: Pick and work through an introductory course, but don’t linger here (like I did). Coding can feel abstract and boring without a real project. There are several courses to help get you started; one of my favourites can be found by the folks at Software Carpentry.

· Use a real dataset: In view of the last point, the best way to learn is by doing. Open up a dataset—something you’re working on or a free one from the web—and start exploring it. Clean it up, make tables, run analyses, visualize results.

· Leverage ChatGPT: As you’re learning, tools like ChatGPT can help. Need to import data? Ask it for code. Unsure about a function? Ask for an explanation. Over time, you’ll need less assistance and gain fluency.

2. Probability Theory, MCMC, and Posterior Distributions

Start with Statistical Rethinking by Richard McElreath (2nd edition). It’s hands down the best resource for beginners. The only real prerequisite is a basic undergrad stats course, though even that may not be entirely necessary. McElreath does an excellent job breaking down probability theory, MCMC sampling, posterior distributions, and commonly used models in a clear and accessible way. Plus, the book is packed with R code examples (there are also Python and Julia translations available online).

Once you grasp the basic concepts of MCMC and posterior distributions, it’s time to start building your own models. For this, I highly recommend using STAN. It’s an incredibly robust platform for Bayesian inference that uses HMC (Hamiltonian Monte Carlo), and is built for flexibility and efficiency. STAN integrates seamlessly with R and Python and allows you to write custom models. While there’s a bit of a learning curve, the payoff is huge—STAN allows for flexibility and customization in a way that few other platforms can match. Take your time here—it’s worth the effort, and once you’re comfortable with STAN, you’ll see how powerful Bayesian statistics can be.

What’s especially valuable about Statistical Rethinking is that it uses STAN throughout the book for Bayesian modelling. So, as you're working through the book, you’ll be learning how to build models in STAN through McElreath’s ‘ulam’ function. While this is still a step away from using STAN directly, it is incredibly helpful in introducing STAN in a manageable and intuitive way.

Keep in mind that you will, however, need to learn how to use STAN directly. The STAN website (mc-stan.org) has great resources and documentation for beginners and advanced users alike.

And don’t forget, McElreath teaches a course that mirrors the content in Statistical Rethinking at the Max Planck Institute for Evolutionary Anthropology in Leipzig, Germany; he posts the lectures free on YouTube; he’s an excellent teacher and his slides are great.

That’s it! Get comfortable with coding, dive into Statistical Rethinking, and start applying these concepts to real data. In no time you’ll be running basic Bayesian models—comparing group means, running simple regressions, and interpreting results. From there, the sky’s the limit. Bayesian statistics offers incredible flexibility and insight once you master the basics. The real purchase of analyzing data this way is that at some point all your models will be bespoke (custom), at least to some degree; this is where things get really fun and where real progress is made. Until then, this roadmap should help you get up to speed.

If you have any questions about getting started or want advice on a specific part of the process, don’t hesitate to reach out! I’d love to help you on your journey into Bayesian statistics.

#Bayes#BayesianStatistics#StatisticalModelling#RStats#Statistics#Stats#STAN#RSTAN#DataScience#StatisticalRethinking

Alex Di Battista