How Do Biologists Know What’s True? Experiments, Evidence, and the Search for Causes

Imagine you’re in a kitchen, and you want to know whether a new cleaning spray actually kills germs. You could spray it on one counter and leave another counter alone, then test both. If the sprayed counter has fewer germs, you’d probably conclude the spray works. But what if the spray just smells like lemon and the smell scares germs away? No—that’s silly. But the worry is real: how do you know the spray itself caused the difference, and not something else you didn’t notice?

Biologists face this same puzzle every day, but their questions are much stranger. How do cells know when to divide? What causes a gene to turn on or off? How does a fertilized egg become a full-grown animal with arms, legs, and a brain? To answer such questions, biologists run experiments. But what makes an experiment a good one? And how do scientists know when they’ve actually found a cause, not just an accidental pattern?

This is where philosophy of biology meets the messy reality of the lab. Let’s look at how biologists reason about causes, why some experiments are considered “crucial” while others aren’t, and whether the whole enterprise is as rational as we’d like to think.

The Logic of a Simple Experiment

The most basic kind of experiment is what philosophers call the Method of Difference. The idea is simple: you set up two situations that are identical in every way except one. If something different happens, that one difference must be the cause.

The philosopher John Stuart Mill described this method back in 1843. The great-grandparent of all modern experiments, he said, goes like this: take two cases—one where the thing you’re studying happens, and one where it doesn’t. If you can find a single factor that’s present in the first case and absent in the second, that factor is either the cause, the effect, or part of the cause.

Here’s a real biological example. Suppose you want to know whether a newly discovered chemical kills bacteria. You take a bacterial culture, divide it into equal samples, and add the chemical dissolved in buffer to one group (the treatment) and only the buffer to the other (the control). Then you measure bacterial growth. If the treatment group has less growth, you infer the chemical is an antibiotic.

This sounds straightforward, but there’s a hidden problem. The method only works if the two situations really are identical in every other respect. But how can you be sure? Maybe one of your samples got a little warmer. Maybe there was a stray chemical on the glassware. Maybe the bacteria in one sample were slightly different. In real labs, scientists go to great lengths to prevent these confounders—they stir the cultures, use sterile equipment, and randomize their samples. But they can never be 100% certain.

Philosophers have noticed something interesting about the Method of Difference. It can actually be turned into a logical deduction, but only if you add some very strong assumptions: that the world is deterministic (everything happens for a reason), that you know all the relevant factors, and that your two situations are truly identical in every way except the one you’re testing. These assumptions are never fully justified. Scientists just have to act as if they’re true and hope for the best.

This has led to the slogan “no causes in, no causes out.” You can’t get causal knowledge from an experiment unless you already have some causal knowledge to begin with—at minimum, you have to know what factors might be relevant and which ones you’ve controlled for. Experimentation doesn’t start from scratch.

Beyond Simple Causes: How Parts Make Up Wholes

Not everything in biology is about one thing causing another. Sometimes scientists want to understand how parts of a system work together to produce a larger phenomenon. This is called the search for mechanisms.

Consider how a signal travels from one neuron to another in your brain. An electrical impulse arrives at the end of one neuron, causing calcium ions to flow in. This calcium then causes tiny sacs of neurotransmitter to release into the gap between neurons. The neurotransmitter then binds to receptors on the next neuron, causing it to fire. All of these events are causal connections between parts of the mechanism.

But there’s another kind of relationship here too. The calcium influx, the neurotransmitter release, the binding—these events together constitute the phenomenon of synaptic transmission. They don’t just cause it; they are what it is. This is called mechanistic constitution, and it’s a tricky philosophical idea because the parts and the whole aren’t completely separate things. You can’t have synaptic transmission without the calcium influx—the calcium influx is part of what synaptic transmission is.

How do biologists figure out which parts are genuinely part of a mechanism and which are just along for the ride? They run what philosophers call interlevel experiments. In a bottom-up experiment, you mess with a part and see if the whole phenomenon changes. Block the calcium channels, and synaptic transmission stops—that’s evidence calcium is part of the mechanism. In a top-down experiment, you mess with the whole phenomenon and see if the parts change. Make the brain more active during a memory task, and you can measure increased calcium influx.

If a part and the whole phenomenon can be shown to affect each other in this way, that’s strong evidence the part is genuinely constitutive—it’s part of what makes the phenomenon what it is. This idea, called mutual manipulability, is one of the main ways biologists discover mechanisms.

Can a Single Experiment Settle a Debate?

Biology textbooks are full of famous experiments that supposedly decided between competing theories once and for all. But philosophers are suspicious of the idea of a “crucial experiment.”

The philosopher Pierre Duhem argued back in 1905 that crucial experiments are impossible in physics, and his argument applies to biology too. Here’s why: even if an experiment rules out one hypothesis, that doesn’t prove the remaining hypothesis is true. Maybe the true hypothesis hasn’t been thought of yet. You can eliminate all but one candidate, but you can never be sure you haven’t missed the real answer.

Let’s look at a famous case from biochemistry. In the 1950s and 60s, scientists were trying to understand how cells use the energy from food to make ATP—the molecule that powers almost everything in your body. They knew that respiration (burning food with oxygen) somehow connected to ATP production, but how?

One camp believed there was a chemical intermediate—a special molecule that carried energy from respiration to ATP-making. Another camp, led by the British biochemist Peter Mitchell, proposed a radically different idea: respiration pumps protons (hydrogen ions) across a membrane, creating a concentration gradient, and the flow of protons back across the membrane powers ATP production. This was called the chemiosmotic mechanism.

For years, the evidence was messy. Mitchell could show that respiring mitochondria expelled protons, but his critics said this could be a side effect—the real coupling might still be through a chemical intermediate. Meanwhile, other labs claimed to have found the chemical intermediate, but none of these findings held up.

The controversy dragged on for over a decade. Then in 1974, two biochemists named Ephraim Racker and Walter Stoeckenius did an experiment that many consider decisive. They created artificial membrane bubbles (called vesicles) and inserted two purified enzymes into them: one was a light-driven proton pump from a bacterium, the other was the mitochondrial enzyme that makes ATP. When they shone light on the vesicles, the ATP-making enzyme started working. There were no respiratory enzymes present, so there was no way a chemical intermediate from respiration could be involved. The only connection was the proton gradient.

This experiment seemed to prove the chemiosmotic mechanism. But was it really “crucial” in Duhem’s sense? Strictly speaking, it only ruled out one specific version of the chemical intermediate hypothesis. It didn’t prove that Mitchell’s mechanism was the only possible explanation. What made the experiment special was something else: it was a beautifully clean test. The scientists had complete control over what was in their test tube. They knew exactly what was there and what wasn’t. This made it very hard for critics to find loopholes.

So maybe the value of a “crucial experiment” isn’t that it proves a theory true, but that it raises the standard of evidence so high that opponents can’t easily challenge it.

The Most Beautiful Experiment in Biology

Another famous case is the Meselson-Stahl experiment on DNA replication. After Watson and Crick discovered the double helix structure of DNA in 1953, three models were proposed for how DNA copies itself:

Conservative: the original double helix stays intact, and a completely new copy is built alongside it.
Semi-conservative: the two strands separate, and each serves as a template for building a new partner strand.
Dispersive: the two strands are cut into pieces, and the pieces are mixed together with new pieces.

In 1957, Matthew Meselson and Frank Stahl designed an experiment to decide between these models. They grew bacteria in a medium containing a heavy form of nitrogen (¹⁵N), so all the DNA became heavy. Then they transferred the bacteria to a medium with normal light nitrogen (¹⁴N). After one round of replication, they extracted the DNA and spun it in an ultracentrifuge—a machine that separates molecules by density.

What they found was beautiful. After one generation, the DNA was exactly of intermediate density—a hybrid of one heavy strand and one light strand. After two generations, there were two bands: one intermediate and one light. This was exactly what the semi-conservative model predicted.

But Meselson and Stahl were cautious. They noted that the hybrid band could also be explained if conservative replication produced molecules that somehow stuck together end-to-end. It took years of follow-up work to fully rule out this possibility. We now know that the experiment actually worked partly because of something the scientists didn’t know at the time: their method of handling the DNA had chopped it into fragments, which prevented the sticking-together problem from occurring.

Philosophers debate what this case tells us. One view is that the Meselson-Stahl experiment wasn’t really a crucial experiment, but rather an inference to the best explanation. The semi-conservative model explained the data better than the alternatives because it was simpler and didn’t require extra assumptions. But inference to the best explanation is never as strong as direct proof.

If experiments can’t decisively prove theories, how do scientific communities ever reach agreement? This question leads into social epistemology—the study of how groups of people can arrive at knowledge together.

Consider the history of genetics. In the early 1900s, there were several competing approaches to studying heredity. The Biometric school in England used statistics. William Bateson championed Mendel’s laws. Thomas Hunt Morgan’s lab at Columbia University used fruit flies to map genes onto chromosomes. Each approach had its own experimental systems, its own techniques, and its own theoretical commitments.

Morgan’s approach eventually won out. But was this because his theory was more strongly supported by evidence? Or was it because his methods were more productive—generating more publishable results, attracting more students, and proving more useful to other fields?

Some philosophers argue that the choice between competing experimental systems isn’t made by individual scientists weighing evidence. It’s made by the scientific community as a whole, on the basis of fruitfulness. Morgan’s fruit fly system produced a flood of new discoveries. It worked. The other approaches just didn’t produce as much.

This doesn’t mean the choice was irrational. It just means that rationality operates at the level of the community, not the individual. The community selects for experimental practices that are productive, flexible, and able to connect with other areas of research. Over time, this process tends to converge on practices that generate reliable knowledge—but it doesn’t happen through neat Popperian falsification or Bayesian updating.

When Data Lies: The Problem of Artifacts

Sometimes experimental data are wrong. They don’t represent reality at all—they’re artifacts of the experimental procedure itself.

One famous example is the mesosome. For about 20 years, from the 1960s to the 1980s, microbiologists believed that bacterial cells contained a structure called the mesosome—a folded-inward pocket of the cell membrane. They could see it clearly in electron micrographs. The evidence seemed robust.

Eventually, it turned out that mesosomes were artifacts. The chemicals used to prepare bacteria for electron microscopy (fixatives like osmium tetroxide) were actually damaging the membranes, creating the folded structures. When better preparation methods were developed, the mesosomes disappeared.

This case raises a deep question: how can scientists ever tell if their data are real or artifactual? If you only have one way of looking at something, you’re stuck. But if you have multiple independent methods that all point to the same conclusion, you can be more confident.

This is the idea of robustness. If the same result can be obtained using different techniques that rely on different physical principles and different theoretical assumptions, the result is more likely to be real. It’s the same logic that makes you confident your eyes are working when both the alarm clock and your phone show the same time—they’re independent systems, and when they agree, you trust them.

But robustness isn’t a magic bullet. In the mesosome case, different electron microscopy techniques initially seemed to show the same structure. It was only when researchers developed completely different ways of preparing samples that the artifact was revealed. And even then, it took years to resolve the debate.

Some philosophers argue that the real solution to the artifact problem isn’t robustness, but something simpler: ordinary causal reasoning. Scientists eventually showed that the fixative chemicals caused the mesosome structures to appear. This was just a standard causal inference, like the Method of Difference. The artifact was discovered by figuring out what was causing the misleading data.

What Should You Believe?

After all this, you might be wondering: if experiments are so messy, how can biologists know anything at all? It’s a fair question.

The answer is that biological knowledge is built slowly, from many small experiments that each provide a piece of the puzzle. No single experiment is decisive. What matters is the whole network of evidence, the convergence of different methods, and the long process of checking and rechecking.

This might sound unsatisfying. But it’s also how you know most things in your own life. You don’t have one perfect reason to believe your best friend is trustworthy. You have hundreds of small observations that add up: the times they kept their promises, the times they showed up when they said they would, the times they didn’t spread your secrets. Each observation is weak by itself. Together, they form a strong case.

Biology works the same way. The knowledge that DNA is the genetic material, that mitochondria produce energy, that genes are located on chromosomes—all of this rests on thousands of experiments, each with its own limitations and uncertainties. The fact that we’re reasonably sure about these things is a testament to the power of scientific methods, even if those methods are messier than we might like.

Appendices

Key Terms

Term	What it means in this debate
Method of Difference	A way of reasoning about causes: if two situations are identical except one factor, and a difference results, that factor must be the cause
Confounder	An unknown or uncontrolled factor that could produce a false appearance of causation
Mechanism	A collection of parts and activities that together produce a phenomenon scientists want to explain
Mechanistic constitution	The relationship between the parts of a mechanism and the whole phenomenon they make up—not the same as causation
Mutual manipulability	The idea that if you can change the parts by changing the whole, and change the whole by changing the parts, the parts are genuinely constitutive
Crucial experiment	An experiment that supposedly settles a debate by decisively ruling out all but one hypothesis
Inference to the best explanation	Accepting a hypothesis because it explains the data better than alternatives, even without direct proof
Robustness	A result is robust if multiple independent methods all produce the same finding
Artifact	Data that look like real phenomena but are actually produced by the experimental method itself
Social epistemology	The study of how groups of people can arrive at knowledge through social interactions, not just individual reasoning
Fruitfulness	A measure of how productive an experimental approach is in generating new discoveries, techniques, and connections to other fields

Key People

John Stuart Mill (1806–1873): A British philosopher who wrote down the basic rules of causal reasoning that scientists still use today.
Pierre Duhem (1861–1916): A French philosopher of science who argued that crucial experiments are impossible because you can never be sure you’ve considered all possible hypotheses.
Peter Mitchell (1920–1992): A British biochemist who proposed the chemiosmotic mechanism for ATP production. His ideas were rejected for years before being proven correct.
Matthew Meselson and Frank Stahl: Two American molecular biologists who performed the famous “most beautiful experiment in biology” in 1957, showing that DNA replication is semi-conservative.

Things to Think About

The Meselson-Stahl experiment is often called “the most beautiful experiment in biology.” But the article suggests it wasn’t as decisive as people think. What makes an experiment “beautiful” if not its ability to prove something once and for all? Could there be aesthetic qualities in scientific reasoning?
Racker and Stoeckenius’s experiment with artificial membrane vesicles seemed to settle the oxidative phosphorylation debate. But Duhem would say it only eliminated one hypothesis, not proved the other. Is there something unsatisfying about saying scientists can never prove anything, only disprove things? Or does that actually match your experience of learning?
The article suggests that sometimes scientific communities choose between competing approaches based on “fruitfulness” rather than on direct experimental evidence. Is this rational? Can a theory be “working better” in practice even if you can’t prove it’s true?
Think about something you believe that’s based on evidence—maybe that vaccines work, or that a friend is honest. How many independent pieces of evidence would you need before you were convinced? Would robustness help you decide, or do you rely on something else?

Where This Shows Up

In science classes: When you do a lab experiment and compare treatment and control groups, you’re using the Method of Difference. Most high school science assumes this logic works, but philosophers point out how many hidden assumptions it requires.
In medicine: The debate about whether a single study can “prove” a treatment works (or doesn’t) is exactly the same debate as the one about crucial experiments. Drug trials are designed to be as clean as the Racker-Stoeckenius experiment, but they never completely escape Duhem’s worries.
In everyday reasoning: Whenever you try to figure out why something happened by comparing two situations, you’re reasoning like a scientist. “I put the plant by the window and it grew, but the one in the hall didn’t—so it must need more light.” But did you control for everything else? This is why good reasoning is hard, even in ordinary life.
In debates about science in society: People sometimes say “science has proven X” when really the evidence is messy and contested. Understanding how experiments actually work (and don’t work) might help you be more critical about such claims—and more patient with uncertainty.