Bayesian statistics for confused data scientists

statskier · 107 days ago

I went through grad school in a very frequentist environment. We “learned” Bayesian methods but we never used them much.

In my professional life I’ve never personally worked on a problem that I felt wasn’t adequately approached with frequentist methods. I’m sure other people’s experiences are different depending on the problems you gravitate towards.

In fact, I tend to get pretty frustrated with Bayesian approaches because when I do turn to them it tends to be in situations that already quite complex and large. In basically every instance of that I’ve never been able to make the Bayesian approach work. Won’t converge or the sampler says it will take days and days to run. I can almost always just resort to some resampling method that might take a few hours but it runs and gives me sensible results.

I realize this is heavily biased by basically only attempting on super-complex problems, but it has sort of soured me on even trying anymore.

To be clear I have no issue with Bayesian methods. Clearly they work well and many people use them with great success. But I just haven’t encountered anything in several decades of statistical work that I found really required Bayesian approaches, so I’ve really lost any motivation I had to experiment with it more.

jhbadger · 107 days ago

I think Rafael Irizarry put it best over a decade ago -- while historically there was a feud between self-declared "frequentists" and "Bayesians", people doing statistics in the modern era aren't interested in playing sides, but use a combination of techniques originating in both camps: https://simplystatistics.org/posts/2014-10-13-as-an-applied-...

therobots927 · 107 days ago

That’s Bayesian propaganda

emma379 · 107 days ago

Hah, maybe. But we've been shipping ML models for five years and we stopped arguing about this around year two. Whatever helps you explain the uncertainty in your estimates to non-technical stakeholders — that's the one you use. Priors make that easier sometimes.

rgr1 · 107 days ago

The pragmatic framing sounds reasonable but I'd push back a little. "Just use both tools" risks glossing over cases where they give genuinely different answers for the same problem. The reconciliation is sometimes real, sometimes hand-wavy. What's the principled guide for when to reach for which?

jmalicki · 107 days ago

I agree... I feel like "The Elements of Statistical Learning" was possibly one of the first "postmodern" things where "well, frequentist and Bayesian are just tools in the toolbox, we now know they're not so incompatible."

After Stein's paradox it became super hard to be a pure frequentist if you didn't have your head in the sand.

cold_edge · 107 days ago

Stein's paradox is one of those things that permanently breaks your brain once you actually sit with it. Pure frequentism after that is just stubbornness.

jrumbut · 107 days ago

The author makes a comparison to Haskell, which I think might be a little misleading.

Haskell is a little more complicated to learn but also more expressive than other programming languages, this is where the comparison works.

But where it breaks down is safety. If your Haskell code runs, it's more likely to be correct because of all the type system goodness.

That's the reverse of the situation with Bayesian statistics, which is more like C++. It has all kinds of cool features, but they all come with superpowered footguns.

Frequentist statistics is more like Java. No one loves it but it allows you to get a lot of work done without having to track down one of the few people who really understand Haskell.

smokel · 107 days ago

This article made me enthusiastic to dive into Bayesian statistics (again). A quick search led me to Think Bayes [1], which also introduces the concepts using Python, and seems to have a little more depth.

[1] https://allendowney.github.io/ThinkBayes2/

oliver236 · 107 days ago

Nice writeup. Something that clicked for me reading this is how much the prior/likelihood/posterior dynamic mirrors transfer learning in deep learning. The prior is basically your pre-trained weights: broad knowledge you bring to the table before seeing any task-specific data. The likelihood is your fine-tuning step. And the Bernstein-von Mises result at the end is essentially saying "with enough fine-tuning data, your pre-training washes out."

Obviously the analogy isn't perfect (priors are explicit and interpretable, pre-trained weights are not), but I think it's a useful mental model for anyone coming from an ML background who finds Bayesian stats unintuitive. Regularization being secretly Bayesian was the other thing that made it click for me. If you've ever tuned a Ridge regression lambda, you were doing informal prior selection.

algolint · 107 days ago

The frequentist vs. Bayesian debate often becomes more about "what can I compute easily?" than "what is the correct mental model?". With tools like Stan and PyMC getting better, the "computational cost" argument is weakening, but the "intuition cost" remains high. Most people are naturally frequentists in their day-to-day reasoning, and switching to a mindset of "probability as a degree of belief" requires a significant cognitive shift that isn't always rewarded with better results in simple business or engineering contexts.

NeutralCrane · 106 days ago

The exact opposite is true. Virtually everyone’s intuition is aligned with the Bayesian model. That intuition has to be hammered out of people in their stats classes because for decades frequentist approaches were computationally more feasible, even if they don’t align with how most humans interpret probability.

lns97 · 107 days ago

Teaching Bayesian updating to practitioners, I found the intuition gap runs deeper than just confidence intervals. People also tend to conflate their prior with the likelihood - once you've collected data, it genuinely takes effort to keep them separate mentally. The research on credible interval misinterpretation matches what I've seen in workshops.

zaik · 107 days ago

I would argue the opposite is true. It takes a long time to beat the Bayesian thinking out of students when presenting them with a confidence interval: https://link.springer.com/article/10.3758/s13423-013-0572-3

fumeux_fume · 107 days ago

As a data scientist, I find applied Bayesian methods to be incredibly straightforward for most of the common problems we see like A/B testing and online measuring of parameters. I dislike that people usually first introduce Bayesian methods theoretically, which can be a lot for beginners to wrap their head around. Why not just start from the blissful elegance of updating your parameter's prior distribution with your observed data to magically get your parameter's estimate?

oliver236 · 107 days ago

can you explain what you're saying please?

hawtads · 107 days ago

I think it would be interesting if frequentist stats can come up with more generative models. Current high level generative machine learning all rely on Bayesian modeling.

DeathArrow · 107 days ago

Most ML algorithms, be it SVM, random forest or neural networks require parameter tuning. That in itself is using bayesian statistics.

jmalicki · 107 days ago

I'm not well versed enough, but what would a frequentist generative model even mean?

The entire generative concept implicitly assumes that parameters have probability distributions themselves that naturally give rise to generative models...

You could do frequentist inference on a generative model, sure, but generative modelling seems fundamentally alien to frequentist thinking?

jononor · 105 days ago

Highly recommend Statistical Rethinking for anyone looking for practical/applied/intuitive approach to Bayesian Statistics. For example the 2023 lecture series: https://youtu.be/FdnMWdICdRs?is=KycmwPL-cn8clOK5

7777777phil · 107 days ago

Most ML practitioners use L1/L2 daily without realizing they're making Bayesian prior assumptions. Gaussian prior = Ridge, Laplace prior = Lasso. Once you see it that way, "choosing a regularization strength" is really "choosing how informative your prior is."

bvan · 107 days ago

Nicely done. I have the same challenge with Bayesian stats and usually do not understand why there is such controversy. It isn’t a question of either/or, except in the minds of academics who rarely venture out into the real world, or have to balance intellectual purity with getting a job done.

In the very first example, a practitioner would consciously have to decide (i.e. make the assumption) whether the number of side on the die (n) is known and deterministic. Once that decision is made, the framework with which observations are evaluated and statistical reasoning applied will forever be conditional on that assumption.. unless it is revised. Practitioners are generally OK with that, whether it leads to ‘Bayesian’ or ‘frequentist’ analysis, and move on.

eimrine · 106 days ago

What a stupid idea to put a pesky GIF in the middle of enough complicated article. Isn't the freaking article supposed to be readable? Play your stupid meme once than go south. It even optimized to look good in a loop-form, what a stupid mad world if some math article causes a trouble like this.

lottin · 107 days ago

> In Bayesian statistics, on the other hand, the parameter is not a point but a distribution.

To be more precise, in Bayesian statistics a parameter is random variable. But what does that mean? A parameter is a characteristic of a population (as opposed to a characteristic of a sample, which is called a statistic). A quantity, such as the average cars per household right now. That's a parameter. To think of a parameter as a random variable is like regarding reality as just one realisation of an infinite number of alternate realities that could have been. The problem is we only observe our reality. All the data samples that we can ever study come from this reality. As a result, it's impossible to infer anything about the probability distribution of the parameter. The whole Bayesian approach to statistical inference is nonsensical.

lisa · 106 days ago

The article frames prior selection as straightforward, but in practice isn't that where most of the actual difficulty lives? I've seen analyses where the "weakly informative" prior choice quietly dominated the posterior. Has anyone found reliable heuristics for when prior choice stops mattering?