Does the Treatment Work? It Depends Who You Ask
The Doctor’s Puzzle

Dr. Dana runs a small trial of a new heart medicine. She splits 52 patients into two groups: 40 get the drug, 12 get a placebo. After a month, she counts recoveries. Overall, 20 out of 40 treated patients recover (50%) and 6 out of 12 untreated patients recover (50%). The drug appears to be useless.
But Dr. Dana is careful. She looks at men and women separately. Among men, 8 out of 13 treated recover (about 61%), while 4 out of 7 untreated recover (57%). Among women, 12 out of 27 treated recover (44%), compared to 2 out of 5 untreated (40%). For both men and women, treatment raises the chance of recovery. Yet when she puts everyone together, the benefit vanishes.
This is not a trick. It is a real phenomenon that puzzled statisticians for over a century. The problem even has a name: Simpson’s Paradox, after the statistician Edward H. Simpson, who wrote about it in 1951—though earlier researchers like Karl Pearson and George Udny Yule had already noticed it. The paradox shows that how you group data can flip a conclusion from “the treatment works” to “it does nothing” and back again.
It’s All About the Weights

Why does the drug’s effect disappear? In Dr. Dana’s trial, the groups were not balanced. Most treated patients were women (27 out of 40), and women overall had lower recovery rates than men, regardless of treatment. Meanwhile, the control group was mostly men (7 out of 12). Because the mixtures were so different, the overall average recovery rate got pulled around.
Imagine two baskets of apples and bananas. Apples are cheaper than bananas on average. Basket A has mostly apples, Basket B has mostly bananas. If you calculate the average price of all the fruit together, it depends heavily on how many of each fruit are in each basket. In the same way, the overall success rate in the treatment group is a weighted average of the success rates for men and women, with weights equal to how many men and women are in that group. The same is true for the control group. When the weights differ—more women in treatment, more men in control—the overall averages can end up equal, even though treatment helps each subgroup.
The Hidden Cause: Confounders and Back Doors

Why were there more women in the treatment group? In a real study, something might influence both who gets treated and who recovers. That “something” is called a confounder. Here, gender acts as a confounder: it affects whether someone ends up in the treatment group and it affects recovery chances. When a confounder ties treatment and recovery together without being a true cause, the numbers mislead.
To see the causal effect of the drug—whether it actually makes people better—we need to imagine forcing everyone’s assignment so that gender can no longer influence it. Computer scientist Judea Pearl uses a special symbol, do(X), to represent this kind of forced intervention. The ordinary probability P(recovery | treatment) just tells you what happened when people chose (or were assigned) to take the drug. The causal quantity P(recovery | do(treatment)) tells you what would happen if you made everyone take it, cutting off outside influences like gender. When a confounder is present, these two numbers can be very different.
Pearl also developed the back‑door criterion. Think of a hidden back door: a path from gender to treatment and then to recovery that sneaks extra information into the correlation. To see the true causal arrow from the drug to recovery, you must block that door. One way is to look at men and women separately, then average the results after ensuring the groups are equally weighted. So deciding to partition by gender is not just a statistical move—it is a causal judgment.
Why Our Brains Get Tricked

Why does Simpson’s Paradox feel so shocking? Many people assume that if a treatment works in every subgroup, it must work overall. But that rule only holds when the subgroups are the same size—or, in causal language, when intervening on the treatment does not change the mix of subgroups. Pearl argues that we confuse two ideas: seeing that treated people recover more (an observation) versus making people take the treatment (an intervention). If you interpret the probabilities as causal, a reversal is impossible. When you read them as mere observations, the reversal is mathematically fine. Our minds tend to default to causal interpretations without noticing the difference.
Another explanation comes from philosopher Branden Fitelson. He notes that when we say “if you’re a woman, treatment raises your chance of recovery,” the phrase can be read in two ways. The suppositional reading: suppose you are a woman; then treatment raises your chance. The conjunctive reading: being a female‑treatment‑receiver raises your chance compared to everyone else. In Simpson’s cases, the suppositional reading is true for each gender, but the conjunctive reading can be false—a treated woman might actually be less likely to recover than the overall group. People easily miss this difference, and that contributes to the sense of paradox.
No matter which explanation you favor, one lesson is clear: Simpson’s Paradox is not a mathematical mistake but a deep lesson about how we interpret numbers and causes.
The Paradox in the Real World

Simpson’s Paradox isn’t just a puzzle in a doctor’s office. In 1973, the University of California, Berkeley noticed that a lower percentage of women were admitted to graduate school than men. Critics suspected discrimination. But researchers Peter Bickel, Eugene Hammel, and J. William O’Connell looked department by department. In nearly every department, women were admitted at the same or even higher rates than men. The paradox arose because women applied more often to departments with very low acceptance rates. The department was a mediator—a middle step from gender to the outcome. To judge whether a department discriminated, you have to look within each department, not only at the overall rate.
The same pattern appears in education. Between 1992 and 2002, average verbal SAT scores in the United States rose from 501 to 516. But when the data were broken down by letter grade (A+, A, A‑, B, C), scores within each grade fell. Grade inflation pushed top students into higher grade categories, lowering each category’s average even while the overall average crept up. If you saw only the overall rise, you might assume students were improving; the breakdowns told a different story.
Even sports are affected. A baseball player’s batting average can drop over several seasons, yet when you separate pitches by difficulty—fastballs versus tricky curveballs—the player might be hitting both better than before. The catch is that pitchers started throwing far more tough pitches, so the overall average slid while the player’s true skill improved.
And in biology, Simpson’s Paradox helps explain how altruism can evolve. An altruistic animal helps others at a cost to itself. Within any single group, selfish individuals are fitter than altruists. Yet populations with many altruists can outcompete populations with few altruists. If you only look at fitness within each group, altruism seems doomed. But once you treat whole populations as the unit of interest, the paradox vanishes—and the door to group‑level selection opens.
What’s the Right Choice? Asking Better Questions

So what should Dr. Dana do? Now that she understands the paradox, she realizes the drug does help. She must prescribe it to all patients, regardless of gender. The overall 50% was a mirage caused by uneven grouping. The lesson is not to distrust all statistics, but to ask smarter questions. Whenever you see a headline like “more students are failing” or “new treatment shows no benefit,” you should wonder: are the groups balanced? Is there a hidden confounder? Would the story change if you looked at men and women separately, at different schools, or at different levels of difficulty?
Simpson’s Paradox shows that the big picture can mislead when it buries important differences. By thinking causally—imagining what would happen if you made everyone take the treatment, or by blocking back‑door paths—you can uncover truths that raw numbers alone might hide. The next time you meet a surprising statistic, you can be like Dr. Dana: peel the data apart, hunt for confounders, and discover what is really going on.
Think about it
- A school principal says, “Students who eat breakfast score higher on tests.” Could it be that breakfast‑eaters also have other advantages, like more adult support in the morning? How would you design a fair test to see if breakfast really causes higher scores?
- If you were a doctor and a drug trial showed the treatment helped every age group but seemed useless overall, would you prescribe it? What would you need to know before making your decision?
- A sports announcer claims a tennis player’s first‑serve percentage dropped this season, yet the player’s coach says the serve actually improved. How could both statements be true? (Think about the kinds of serves she attempted.)





