The Vast Majority of Raw Data From Old Scientific Studies May Now Be Missing
A new survey of 20-year-old studies shows that poor archives and inaccessible authors make 90 percent of raw data impossible to find
One of the foundations of the scientific method is the reproducibility of results. In a lab anywhere around the world, a researcher should be able to study the same subject as another scientist and reproduce the same data, or analyze the same data and notice the same patterns.
This is why the findings of a study published today in Current Biology are so concerning. When a group of researchers tried to email the authors of 516 biological studies published between 1991 and 2011 and ask for the raw data, they were dismayed to find that more 90 percent of the oldest data (from papers written more than 20 years ago) were inaccessible. In total, even including papers published as recently as 2011, they were only able to track down the data for 23 percent.
"Everybody kind of knows that if you ask a researcher for data from old studies, they'll hem and haw, because they don't know where it is," says Timothy Vines, a zoologist at the University of British Columbia, who led the effort. "But there really hadn't ever been systematic estimates of how quickly the data held by authors actually disappears."
To make their estimate, his group chose a type of data that's been relatively consistent over time—anatomical measurements of plants and animals—and dug up between 25 and 40 papers for each odd year during the period that used this sort of data, to see if they could hunt down the raw numbers.
A surprising amount of their inquiries were halted at the very first step: for 25 percent of the studies, active email addresses couldn't be found, with defunct addresses listed on the paper itself and web searches not turning up any current ones. For another 38 percent of studies, their queries led to no response. Another 7 percent of the data sets were lost or inaccessible.
"Some of the time, for instance, it was saved on three-and-a-half inch floppy disks, so no one could access it, because they no longer had the proper drives," Vines says. Because the basic idea of keeping data is so that it can be used by others in future research, this sort of obsolescence essentially renders the data useless.
These might seem like mundane obstacles, but scientists are just like the rest of us—they change email addresses, they get new computers with different drives, they lose their file backups—so these trends reflect serious, systemic problems in science.
And preserving data is so important, it's worth remembering, because it's impossible to predict in which directions research will move in the future. Vines, for instance, has been conducting his own research on a pair of toad species native to Eastern Europe that seem to be in the process of hybridizing. In the 1980s, he says, a separate team of researchers was working on the same topic, and came across an old paper that documented the distribution of these toads in the 1930s. Knowing that their distribution had changed relatively little over the intervening decades allowed the scientists to make all sorts of calculations that wouldn't have been possible otherwise. "That original data being available, from a very small old study written in Polish, was incredibly useful to researchers that came along 70 years later," he says.
There's also the fact that so much of this research is paid for with public funding, much of it coming through grants that stipulate that resulting data be made freely available to the public. Additionally, field data is affected by the circumstances of the environment in which it's collected—thus, it's impossible to perfectly replicate later on, when conditions have changed.
What's the solution? Some journals—including Molecular Ecology, of which Vines is a managing editor—have adopted policies that require authors to submit raw data along with their papers, allowing the journal itself to archive the data in perpetuity. Although journals, like people, are susceptible to changing email addresses and technological obsolescence, these problems can be much more easily managed at the institutional scale.