Skip to content
Philosophy for Kids

When Science Can't Repeat Itself: A Crisis in the Lab

The Baking Soda Volcano That Refused to Repeat

Your first try was huge. The second one fizzled out early. Did you do something wrong — or was the first eruption a fluke?

You build a baking soda volcano for the science fair. The first time, it erupts like a fountain — you mark the height on a ruler. The next morning you do exactly the same steps, same vinegar, same amount of baking soda. This time the foam barely clears the rim. Did the first volcano prove your design works, or was it just an especially lucky batch? And if a friend tries your recipe and gets a different result, whose experiment tells the truth?

Scientists face this question all the time. When a study claims something exciting — a drug relieves pain, a teaching trick boosts memory — the only way to know if it is real is to repeat it. But repeating is trickier than it sounds, and when massive teams of researchers tried to redo a hundred famous psychology studies, they got a shock. Fewer than half gave the same result the second time around. This shook confidence in whole fields of science and forced everyone to ask: when a result cannot be repeated, what counts as knowledge?

A Hundred Famous Studies, Checked: The Replication Crisis

When 270 researchers recreated 100 famous experiments, most graphs looked nothing like the originals.

By 2015, psychologists had grown uneasy. A few headline-grabbing findings — like the claim that hearing Mozart makes you briefly smarter — had crumbled when other labs tried to repeat them. So an enormous team, the Open Science Collaboration, set out to direct replication one hundred studies published in top journals in 2008. In a direct replication, you follow the original recipe as exactly as possible: the same methods, the same kind of participants, the same measurements. Their goal was simply to see whether the results held up.

The answer was sobering. Only about 39 percent of the original findings could be matched on a key statistical measure — roughly four out of ten. Effect sizes, which measure how strong a pattern is, shrank by about half in the replications. Even when the team used a broader standard, fewer than half of the studies “replicated.” (A later project, the Many Labs collaboration, had better luck with thirteen classic effects — eleven of them survived — but that was an exception.)

This became known as the replication crisis. It didn’t mean that all those original findings were fake. But it showed that a single published experiment, no matter how clever, often isn’t enough to settle a question. And it raised an uncomfortable possibility: that the scientific record was packed with results that looked real but would vanish under a second attempt.

Why Do So Many Results Vanish?

For every exciting result that gets published, many “boring” ones stay locked in a file drawer.

To understand the mess, you need to know about the p-value. Very roughly, a p-value is a number that tells you how surprising your data would be if nothing real was going on — if the result were just dumb luck. Many scientists treat a p-value below 0.05 as a green light: “significant!” So there is enormous pressure to get that magic number.

That pressure creates a cascade of problems. First, journals strongly prefer to publish novel, positive, significant results — a bias against “null” findings that show no effect. As a result, many studies that failed to find anything get tossed into a file drawer and never see daylight, a habit called the file-drawer problem. If you only see the successful volcanoes, you might conclude that all volcanoes erupt spectacularly. But the quiet ones were just hidden.

Second, researchers have plenty of wiggle room to nudge a borderline result into significance. They might stop collecting data as soon as they hit their p-value, drop a few “outlier” points, or test many measures but report only the ones that worked. These practices are called p-hacking and cherry-picking. None of them is outright fraud, but together they pump up the number of false positives in the literature — effects that look real but are just noise.

Third, many studies are simply too small. When you test just twenty people, a real but subtle effect can easily be missed. The field calls this low statistical power — the study didn’t have enough horsepower to find what it was looking for. Combine a weak engine with the pressure to publish, and you get a recipe for shaky findings that fail when repeated with larger, more careful samples.

The Circle of Doubt: If Two Experiments Disagree, Which One Is Right?

If your two thermometers disagree, which one do you trust — and how can you check without another thermometer?

When a replication gets a different result, a deeper philosophical puzzle wakes up. Suppose you read a study that claims kids who chew a certain brand of gum blow bigger bubbles. A second researcher repeats the method and finds no difference. The failure could mean the original claim was false. But it could also mean the second attempt was a poor copy — maybe the gum was stale, or the tape measure was different. How do you decide?

Philosopher Harry Collins (born 1943) called this the experimenters’ regress. To know whether an experiment was done correctly, you need to know what the right outcome should be. But to know the right outcome, you need a correctly done experiment. It’s a circle. Collins argued that when a scientific dispute runs deep, additional experiments can’t always break the tie — each side simply doubts the other’s setup. In those cases, who gets believed may depend on whose reputation or social position carries more weight, not just on the data.

Other thinkers pushed back. Allan Franklin (born 1938) pointed out that scientists have plenty of rational strategies to escape the loop. They can test their instruments independently, eliminate alternative explanations, or measure the same thing in two completely different ways. If both ways agree, the result gains credibility. Uljana Feest (21st century) examined a real debate — whether listening to Mozart actually boosts spatial thinking. When later labs failed to find the effect, the original researcher argued the new studies weren’t true replications because they measured the wrong kind of spatial ability. Feest showed that what looked like an endless argument was really a clash of hidden assumptions about what “Mozart effect” meant. Exposing those assumptions through further experiments slowly moved the debate forward.

The dispute isn’t settled among philosophers. Some think that good methods can always, in principle, decide whether a replication is faithful. Others, like Collins, think that when a controversy gets deep enough, the evidence underdetermines the choice, and nonscientific factors creep in. Either way, the crisis forced scientists to admit that “repeating” something is never a simple recipe test — it always involves judgment calls about what counts as a faithful copy.

How Science Is Learning to Check Itself

New badges for sharing data and preregistering plans encourage scientists to make their work checkable.

The good news is that the crisis triggered a wave of reforms. If journals reward novelty and significance too much, then the whole system needs an upgrade. Here are some of the most important changes.

Preregistration: Before collecting a single data point, a researcher publicly locks in their question, sample size, and analysis plan. That makes it harder to p-hack after the fact — the plan is time-stamped and visible.

Registered Reports: Journals agree to accept or reject a study based on the introduction, method, and planned analysis, before anyone sees the results. If the plan passes, the paper gets published no matter whether the findings are “significant” or “boring.” This kills the bias against null results.

Open Science Badges: Papers earn badges for sharing their raw data and materials, or for preregistering. When the journal Psychological Science introduced such badges, the rate of data sharing jumped from 3 percent to 39 percent in under two years. That helps other scientists (and curious kids) actually check the work.

Changing values: Many reformers argue that science has drifted away from an old set of ideals — communality (sharing results openly), universalism (judging ideas, not who said them), disinterestedness (letting evidence lead), and organized skepticism (questioning everything). Instead, the current culture often rewards secrecy, status, and self-promotion. Burning the file drawer and making checking normal again is, at its root, an effort to bring those older norms back to life.

None of these fixes is magic. Deeply rooted habits are hard to shift, and people still need jobs and funding. But the conversation has changed, and more and more journals and universities treat replication not as a dull copycat exercise, but as the core of honest science.

From Volcanoes to Vaccines: Why This Matters for You

You can’t repeat every claim you hear. But you can ask: has anyone tried? And what did they find?

Science isn’t just something that happens in white coats. In your own life, you constantly test ideas: will a different study playlist help you focus? Does a certain friend always tell the truth? If a claim rests on a single try — one great volcano, one news report, one social media post — it’s fragile. The best evidence comes from checking again, in similar and slightly different ways, and seeing whether the pattern holds.

The replication crisis taught us that even professional scientists, with all their training, can be fooled by a lucky first try. But it also showed that when people get serious about checking — sharing data, preregistering plans, welcoming skeptical colleagues — the truth has a much better chance of surviving. The question “Does it replicate?” isn’t just a lab slogan. It’s a habit of careful thinking that you can apply to any claim that matters to you.

Think about it

  1. If a friend tells you a new method for remembering spelling words works every time, how many tries would you want to see before you trust it? What if some tries fail — would you blame your friend’s method or your own way of following it?
  2. Imagine two science teams disagree about whether a vitamin improves focus; each side says the other did the experiment wrong. Without doing more experiments, how could you decide which side to believe?
  3. In your everyday life, when is it wise to trust a single experience, and when do you need repeated checks before you’re sure? Give a real example.