Can a Million Spreadsheets Replace a Scientist?

A Cure Hidden in a Million Files

Big data lets researchers combine many different kinds of information to hunt for hidden patterns.

A cancer researcher logs into her computer. She does not look through a microscope. Instead she opens a digital file that holds thousands of patients — their genes, blood tests, scans, and how they responded to treatments. She hopes that all this information, when mixed together, will whisper a secret: which drug works best for which person. This way of doing science relies on big data, enormous collections of digital information that can be sifted by algorithms.

But a quiet question nags at her. Is the answer really hiding in the pile, just waiting to be found? Or does the computer need her ideas, her guesses, her very human curiosity, to turn noise into real knowledge? Philosophers of science argue about this all the time. Let’s step into their conversation.

What Even Is “Big Data”?

The same picture can be just a snapshot or be used as scientific data — it all depends on what we do with it.

You might think big data is simply a lot of files. That is partly right. Data scientists often talk about volume (the sheer size of the information) and velocity (the breakneck speed at which it arrives). But philosophers notice something deeper: we can look at data in two very different ways.

Some thinkers see data as reliable little mirrors of reality. A scientist measures the width of a cell, writes down a number, and that number is a fixed fact. The philosopher Patrick Suppes (1922–2014) helped shape this view. He believed that once we clean up data with statistics, the messy human decisions that produced them stop mattering. Data, for him, become tidy mathematical objects that can be tested against a theory.

Sabina Leonelli, a philosopher of science working today, offers a different picture. She describes a relational view: an object becomes a datum only when we treat it as evidence for a claim. A tourist’s photo of a rare mushroom is just a vacation snapshot — until a biologist uses it to study where that fungus lives. Suddenly the same picture becomes a scientific data point. What it “says” depends on how we use it and what we already know.

These two views lead to different expectations. If data are mirrors, then more data automatically means more truth. If data are relational, then having more is useful only when we carefully ask: who collected this, why, and what did they leave out?

Can Data Think Without a Theory?

Algorithms can spot patterns, but without a guiding idea those patterns might be illusions.

In 2008, a technology writer named Chris Anderson made a bold claim: we no longer need scientific theories. Throw enough data at smart algorithms, he suggested, and the correlations will speak for themselves. The computer doesn’t need to understand the world; it just needs to find what predicts what. This dream is called data‑driven science.

Philosophers immediately raised red flags. Two mathematicians, Cristian Calude and Giuseppe Longo, pointed out that truly enormous datasets always contain accidental links — funny coincidences that mean nothing. If a program hunts hard enough, it will find that the number of people eating ice cream in July matches the number of drownings. That’s a spurious correlation; ice cream doesn’t cause drownings, but hot weather causes both. An algorithm that never asks “why” can’t tell the difference.

There are other traps. Overfitting happens when a computer learns patterns in one batch of data so tightly that it can’t handle new examples — like memorizing the answers to a single practice test and then failing the real exam. Data scientists call another problem the curse of dimensionality: the more details you try to juggle at once, the more data you need, and the easier it is to see patterns that aren’t really there.

Because of these snags, most philosophers and many scientists now agree that big data analysis is never theory‑free. It is theory‑informed. The software, the measurements, the very choice of which data to collect all contain hidden assumptions. Human intelligence isn’t pushed aside; it’s baked into the machine’s first step.

The Hidden Values in Every Dataset

Data can be valued for money or for truth, and those values pull research in different directions.

Data feel dry and neutral, like a phone book. But every dataset is a bundle of human decisions. Which patients get measured? Which behaviours get tracked? Which results get shared? The answers are shaped by what people value.

Sometimes scientists reach for what is easiest, not for what is best. Philosopher Ulrich Krohs calls this convenience experimentation: using a fancy DNA sequencer because it’s available, even when an older tool would ask a sharper question. Over time, the data pile fills with things that were convenient to collect — and leaves out whole communities, species, or questions that are harder to digitise.

Money plays a large part too. Powerful corporations own many of the datasets and algorithms that researchers need. They may share only the information that has low commercial value, or they may hide the details of how their recommendation engines work. This deepens the digital divide — the gap between people who can afford to use data technologies and those who cannot.

When we pretend that data are purely factual and free of values, we make a dangerous mistake. The line between a fact and a value blurs as soon as we ask who paid for the data, who formatted them, and whose interests the results serve. Recognising those hidden values is not a weakness of science; it is a form of honesty.

Why It Still Matters: Your Data, Your World

The same philosophical puzzles about data live inside the apps you touch every day.

You might not dig through medical records or write machine‑learning code. But you meet these exact puzzles every afternoon. The video a platform recommends, the route your map app suggests, the price a website shows you — they all come from algorithms trained on mountains of data. How do you know the map isn’t steering you past a sponsor’s shop? Could the playlists you see be missing the songs you’d truly love, just because they weren’t “convenient” for the model?

The questions that keep philosophers up at night are the same ones you should ask when you swipe. Where did this data come from? What got left out? Who benefits when I follow this suggestion? Thinking this way doesn’t mean you throw your phone out the window. It means you treat the digital world the way a smart scientist treats a massive database: with wonder, yes, but also with careful, curious doubt.

Think about it

If an app recommends a video based on millions of past views, does that mean it’s the best video for you — or could it be missing something important?
Should scientists trust a computer’s discovery if no human can explain why it works?
When you use a map app to find the fastest route, who benefits most: you, the company that owns the app, or the data itself?

Email

Can a Million Spreadsheets Replace a Scientist?

A Cure Hidden in a Million Files

What Even Is “Big Data”?

Can Data Think Without a Theory?

The Hidden Values in Every Dataset

Why It Still Matters: Your Data, Your World

Think about it

Why Scientists Need Troublemakers (and a Little Chaos)

Is Google Telling You the Truth, or Just What You Want to Hear?

Can We Keep AI Honest? The Big Questions About Smart Machines

A Cure Hidden in a Million Files

What Even Is “Big Data”?

Can Data Think Without a Theory?

The Hidden Values in Every Dataset

Why It Still Matters: Your Data, Your World

Think about it

Keep exploring

Why Scientists Need Troublemakers (and a Little Chaos)

Is Google Telling You the Truth, or Just What You Want to Hear?

Can We Keep AI Honest? The Big Questions About Smart Machines