Is It Luck or Skill? The Fight Over How to Judge a Tea Taster
The Girl Who Knew Her Tea

Picture a student who makes a bold claim: she can taste the difference between tea poured before milk and milk poured before tea. It sounds impossible—most people can’t. You decide to test her. For each of five cups, you flip a fair coin. Heads means milk first, tails means tea first. She tastes each cup and writes down which order she thinks it was.
She gets every single one right.
Now you face a question that scientists, doctors, and even your own brain wrestle with every day: does this result prove she has a real talent, or could it just be dumb luck? And how do you decide fairly?
This is the core puzzle of statistics, the branch of math and philosophy that tries to squeeze truth out of data. And it’s not just about tea—the very same logic powers everything from clinical drug trials to your TikTok feed. So let’s sit down with this tea-tasting girl and see what the big thinkers have been fighting over for a hundred years.
The Null Hypothesis: Assuming She’s Just Guessing

Before we get impressed, we start with a skeptical idea called the null hypothesis. The null hypothesis says, “Nothing special is happening here. She’s just guessing.” In this case, if she guesses randomly, she has a 50% chance to get each cup right. That means each sequence of right and wrong answers has a certain probability, worked out with probability theory—the math of chance.
Our data is clear: five correct guesses. Under the null hypothesis, the probability of that exact outcome is ½ × ½ × ½ × ½ × ½ = 1/32, or about 3%. That’s pretty small. If you were betting, you’d be surprised.
But here’s the tricky part. If a million students all guessed blindly, about 31,250 of them would also get five in a row just by luck. So is this one student one of those lucky guessers, or does she really have the skill? The number 3% alone doesn’t answer that—it only tells us how rare the result would be if nothing but chance were at work.
The Classical Way: Reject When It’s Too Unlikely

The most common approach to this puzzle, taught in almost every science class, grew out of work by statisticians like Ronald Fisher in the 1920s. They said: before you run the experiment, pick a threshold for surprise, usually 5%. If the outcome you observe has a probability lower than that threshold under the null hypothesis, you reject the null. In the tea case, 3% is less than 5%, so Fisher’s rule tells you to reject “she’s just guessing.”
The actual probability you calculate (3%) is called a p‑value. And the pre‑chosen boundary (5%) is the significance level. If p < 0.05, you announce that the result is “statistically significant.” Many scientists treat this as a green light that their discovery is real.
But already there’s a problem. The significance level doesn’t tell you how likely it is that she really has the skill. It only tells you how often you’d get data like this if she didn’t. Imagine a lottery: winning the jackpot is extremely unlikely, but that doesn’t mean the winner must have cheated. Yet that’s exactly how many people misinterpret p‑values—they treat an unlikely event under the null as strong evidence the null is false. In reality, you need to know how common real tea‑tasting talent is in the first place. If skill is super rare, even five correct guesses might still be luck.
That’s a headache known as the base‑rate fallacy. Classical statistics struggles to connect its rules to what you actually should believe.
The Bayesian Flip: Updating Your Beliefs

There’s another camp, named after the eighteenth‑century thinker Thomas Bayes. Bayesian statistics lets you talk directly about how probable a hypothesis is—not just whether to reject it. It does this by treating probabilities as credences: numbers that express how strongly you believe something, from 0 (impossible) to 1 (certain).
You start with a prior probability—your best guess before seeing the data. Then, as data rolls in, you update to a posterior probability using a simple rule called Bayes’ theorem. The math is kind, but the idea is intuitive: new evidence shifts your confidence.
In the tea example, suppose that before the test you think it’s very unlikely she has the ability—maybe only a 1‑in‑3 chance. After she nails all five cups, Bayes’ formula raises that to about 4‑in‑5. Your belief got a huge boost, but it’s still not 100%, because you factored in how rare the talent probably is. The Bayesian approach feels natural: you’re always adjusting your opinions as you learn. It also respects a principle that many philosophers find essential: only the data you actually saw should matter.
But it forces you to pick a prior. And if your initial hunch is way off, your final answer might be too. Critics worry that priors can be too personal—like letting your gut choose the result.
The Stopping Rule Scandal

Here’s a twist that has divided statisticians for decades. Imagine two researchers both test the same student. The first, a careful planner, decides to run exactly six trials. The second, a bit impatient, says, “I’ll stop the moment she gets one wrong, or after six trials, whatever comes first.”
As it happens, the student gets the first five right and the sixth wrong. Both researchers end up with the exact same data: five successes, one failure. But under the classical rules, the impatient researcher can reject the null hypothesis (“she’s just guessing”), while the careful planner cannot. The difference boils down to what other possible data each researcher could have seen—the impatient one had a smaller sample space, so the same observed result clears the 5% hurdle.
This is called optional stopping, and it leads straight to a deep philosophical principle. The likelihood principle says that two experiments with the same actual data should give the same evidence for a hypothesis, regardless of how the experiment was planned or when you decided to stop. Bayesian statistics obeys this principle. Classical statistics often violates it. Some philosophers, like Deborah Mayo, argue that the stopping plan does matter, because it reveals how severely you tested the idea. If you peek at the data and stop when it looks good, you haven’t really put the hypothesis through a tough trial.
The debate forces you to ask: should evidence depend on what you would have done, or only on what did happen?
From Teacups to TikTok: Why You Should Care

You might never blindfold a classmate and brew five cups of tea. But you rely on statistical reasoning all the time. When you hear that a new medicine “works,” that news came from a clinical trial that used null hypothesis testing or Bayesian updates. When YouTube decides which video to recommend next, an algorithm is quietly updating something like a Bayesian belief about what you’ll enjoy. The very standards we use to call something a “discovery” in psychology, economics, or astronomy are hotly contested because of the philosophical cracks you’ve just seen.
In recent years, many scientific findings have failed to hold up when other labs try to repeat them—a crisis called the reproducibility problem. Some blame the mis‑use of p‑values and the pressure to hit that magic 5% significance level. Others think we’d be better off if more scientists thought like Bayesians, openly stating their prior beliefs and updating them publicly. And the optional stopping wrangle has real consequences: a drug company that stops a trial early because results look promising might be selling you a false hope.
So next time you face a claim that sounds too good to be true—a friend who always guesses the coin flip, a supplement that promises perfect skin—remember the tea‑tasting student. The data alone rarely hands you certainty. How you planned to look, what you already believed, and when you stopped watching all shape what you conclude. Paying attention to that, philosophers say, is how you learn to be smart about evidence, not just impressed by it.
Think about it
- Imagine a friend claims she can predict the outcome of a coin toss. You flip five times, and she gets all five right. Would you believe she has a special ability? Why might a classical statistician say it’s still just luck—and what extra information would you want?
- Two scientists test a new vitamin. Scientist A planned to stop after 100 people and sees a promising result already. Scientist B planned to test 200, and after 100 people the data is exactly the same, but B says it’s too early to be sure. Whose judgment would you trust more, and why?
- If the same data can lead to opposite conclusions depending on when you decided to stop collecting it, does that mean data alone never “speaks for itself”? What else should go into a good decision?





