A Comprehensive Overview of Bayesian Statistics for Beginners
Written on
Understanding Bayesian Statistics
Bayesian statistics can be both engaging and challenging. For instance, consider the question: What is the likelihood that this subject is difficult, given that it is also enjoyable? While this may seem trivial, it captures the essence of Bayesian statistics: using historical data or evidence to forecast future outcomes. This topic is essential in data science and machine learning curricula, yet many find it daunting due to conventional teaching methods.
Bayesian statistics is typically presented in one of two ways: either through overly technical mathematics that leaves learners confused or through an overly simplistic approach that lacks depth. Both methods can be frustrating for learners.
I, too, faced difficulties grasping the fundamentals. As a physicist rather than a Bayesian expert, I struggled initially, but once I understood the core principles, clarity emerged. In this concise tutorial, I aim to explain Bayesian inference from the ground up, so you won't be left puzzled about its workings.
What Is Bayesian Statistics?
In contrast to frequentist statistics, which draws conclusions based on the frequency of observed outcomes, Bayesian statistics takes a different approach. For example, if you flip a coin 10 times and observe 6 heads and 4 tails, you might conclude a "60% chance" of heads and a "40% chance" of tails. However, as you flip the coin more times, your belief about these probabilities can adjust. After 1,000 flips, you might find that heads occur approximately 51% of the time, leading you to believe that the coin is likely fair, suggesting a 50-50 chance.
The key distinction in frequentist statistics is that the probability of heads or tails is considered a fixed parameter—unchanging no matter how many times the coin is flipped. This approach is deterministic, asserting that whether you flip the coin 10, 100, or 1,000 times, the probability remains constant.
In Bayesian statistics, we also refine our assumptions as we gather data, but crucially, the parameter (like the probability of heads) is treated as a random variable rather than a fixed entity. This reflects the inherent uncertainty about the true likelihood of heads, regardless of the number of flips observed. Thus, instead of assigning a single value to the probability (like θ = Prob(Heads)), we treat θ as a random variable that adheres to a probability distribution. The mean of this distribution indicates the expected value of θ, while the variance illustrates your uncertainty regarding its true value. In Bayesian analysis, it is only when the number of observations approaches infinity that you can arrive at an exact value for θ; anything less results in some uncertainty, albeit minor with larger datasets.
Bayes' Theorem
The foundation of Bayesian statistics lies in Bayes' theorem. To understand this concept, let’s revisit some basic probability principles:
Using these principles, we can express Bayes' theorem, which states that the probability of A given B is equal to the probability of B given A multiplied by the probability of A, divided by the probability of B. In Bayesian terms, this can be summarized as follows:
This is intriguing, but how does it connect to our earlier coin toss example? We can express the same relationship using probability distributions, such as:
Here, "data" refers to a single measurement or multiple observations. The integral at the bottom represents the probability distribution of the data, p(data).
Remember that a probability density function (p.d.f.) like p(θ) resembles a histogram, where higher peaks indicate more frequent outcomes. However, p.d.f.s are independent of the number of observations and remain constant.
We can also simplify our equation further, noting that the denominator acts as a normalization constant and can be treated as any number. The critical components to focus on are the likelihood function and the prior.
Essentially, the prior p(θ) represents our initial estimation of the distribution of θ. This could be any distribution of our choosing (e.g., a uniform distribution over a reasonable range for θ), as we aim to update our belief by calculating p(θ|data) based on the equation above. In fact, p(θ|data) becomes the new prior once we introduce new data or evidence.
The likelihood function p(data|θ) conveys the probability of obtaining specific measurements given a value of θ. This likelihood is derived from a selected probability distribution that aligns with our observations. For instance, if you conduct an experiment with two outcomes (like a coin toss), using a binomial distribution for p(data|θ) is logical since it only accommodates two outcomes.
The posterior distribution p(θ|data) updates our understanding of θ based on the new data we have observed, and it becomes the new prior for subsequent measurements.
This may seem complex, but take your time to digest it. If you grasp the purpose of the posterior, likelihood, and prior functions, you're on the right path.
How This Works with Multiple Measurements
Returning to our coin toss scenario, let’s assume we begin with no knowledge about the coin's fairness. We denote the probability of heads as θ and treat it as a random variable. After observing one coin toss outcome, denoted as x1, we can compute the posterior distribution.
When we conduct a second toss, represented by x2, we can update our posterior distribution, now incorporating two outcomes. If we then proceed to a third outcome x3, we can continue this process.
If we toss the coin n times, the posterior distribution can be mathematically represented as:
Equation X — Posterior distribution for a series of outcomes
Here, the product symbol signifies a sequence of terms from k = 1 to n. Notice that our initial prior distribution p(θ) remains unchanged, as it was our starting assumption. The more outcomes we aggregate, the more precise our posterior distribution becomes. This equation is applicable not just to coin tosses but to any Bayesian scenario involving a single unknown parameter θ.
For our example, we need to define a function for the prior p(θ) and a function for the likelihood p(x|θ). I’ll select a uniform distribution for the prior over the range [0,1], since θ is a probability, which must lie between 0 and 1. For the likelihood function, I'll opt for a binomial distribution:
This initial term indicates the probability of heads, while the second term indicates the probability of tails. The parameter x represents the outcome of tosses (1 for heads, 0 for tails). Plugging these into Equation X leads us to a significant result.
We have effectively combined the terms in our lengthy product into a sum of outcomes. This means we can conveniently express the posterior distribution for any number of outcomes in this streamlined format. The final step is to normalize our posterior distribution by dividing by the marginal probability of the data.
This integral in the denominator is best calculated numerically. So, what does this yield? If we were to implement this in Python (which I’ll cover in a future post) using a coin with a bias of θ = 0.25 (indicating a 25% chance of heads and a 75% chance of tails), we would generate the following plots:
In these plots, the initial prior p(θ) is a uniform distribution across [0,1], given that we had no prior knowledge about the coin. This is visually represented as a horizontal black line in each graph. The blue curve illustrates the posterior distribution p(θ|data) after updating it with N coin tosses (where N = 20 in the first plot, N = 100 in the second, and N = 500 in the third).
From Figure 2, after 20 tosses, the distribution appears broad, spanning θ = 0.1 to 0.7, peaking around θ = 0.4. This suggests a tentative bias towards lower θ values, although uncertainty remains high. Increasing N to 100 in Figure 3 shows the peak of the posterior distribution moving closer to the true bias, with reduced spread across the axis, indicating decreasing uncertainty with more data. Finally, in Figure 4, at N = 500, the peak aligns with the true bias of 0.25, showcasing a marked reduction in uncertainty.
The beauty of Bayesian statistics lies in its utility for hypothesis testing. For example, if someone presents you with a coin and you’re unsure of its fairness, calculating the posterior distribution after N tosses enables you to determine not just whether the coin is biased but also the extent of that bias and your confidence in the result.
There is, of course, a great deal more to Bayesian statistics than what we've covered here; it's an expansive field. However, I hope this tutorial has clarified some of its core concepts for you. Thank you for reading, and I wish you a great day!