Big Data Just Got Bigger as IBM’s Watson Meets the Encyclopedia of Life
An NSF grant marries one of the world’s largest online biological archives with IBM’s cognitive computing and Georgia Tech’s moduling and simulation
After 2,000 years, the ultimate encyclopedia of life is at the cusp of a new data-driven era. A grant from the National Science Foundation has been awarded to The Encyclopedia of Life (EOL), IBM and Georgia Institute of Technology. The grant will enable massive amounts of data to be processed and cross-indexed in ways that will allow groundbreaking science to be done.
In the year 77 AD, Pliny the Elder began writing the world's first encyclopedia, Natural History. It included everything from astronomy to botany to zoology to anthropology and more. Pliny attempted to put everything he could personally gather about the natural world into a single written work. For the last 2,000 years, a long succession of scientists inspired by Pliny have pursued the same vision.
Pliny included 20,000 topics in 36 volumes but ran into the limitations of what a single person can discover, record and process within a human lifespan. He died during the eruption of Mount Vesuvius before he could finish a final edit of his magnum opus. Even in his own era, it wasn't possible for one person to read all the books, learn all the things, and explain it all to the world.
As later scientists, editors and librarians discovered in a world that adds more written knowledge with each passing year, even if you could store all of the world's books and research in one building, it is a challenge to make all of the relevant information available to researchers during the limitations of their brief human lives.
EOL might be able to change that by applying state-of-the-art computational power to disparate collections of biological data. The project is a free and open digital collection of biodiversity facts, articles and multimedia, one of the largest in the world. Headquarted at the Smithsonian Institution and with its 357 partners and content providers including Harvard University and the New Library of Alexandria in Egypt, EOL has grown from 30,000 pages when it launched in 2008 to more than 2 million, with 1.3 million pages of text, maps, video, audio and photographs, and supports 20 languages.
“I came to Smithsonian in 2010 from the software industry,” says EOL director Bob Corrigan. “One of the discoveries I made coming here is that while IT is everywhere, it has not penetrated the museum world the same way it has penetrated the commercial world. In biology especially, the most important data has been buried in textbooks and spreadsheets.”
How can biological data in various forms be combined and mined for new insights on life on Earth? What if data on, say, biodiversity of butterflies in Africa over a decade was combined with data on farming practices and rainfall? Could anything new be learned? It takes something bigger than a human brain to do this. Something like IBM's Watson supercomputer.
“IBM is contributing effort and access to a version [of Watson] that is not publicly available,” says Jennifer Hammock, program director at EOL. “They are also going to have people working on it. IBM is doing this as an in-kind contribution.”
Watson is a supercomputer that doesn't just crunch numbers in large volumes. It uses artificial intelligence to allow users to ask questions in plain language.
“I would say from a user point of view, it means that the database is something you can walk up to and ask a question as if you would of a human,” says Hammock. “Like, can you tell me if this purple butterfly occurs in Africa?”
“Answering a simple question in any language presumes the existance of a lot of knowledge behind the scenes,” says Corrigan. “Even [the word] purple, it assumes that we know what purple is. Or a butterfly, [the computer] has to understand the difference between a butterfly and a moth. On top of this, the data sets themselves have different ways of thinking about these different terms. All of this data has been difficult to mine without a Rosetta stone of terms. And that is part of the magic of what the EOL is doing.”
One scientific question that the partnership between EOL, IBM and Georgia Tech hopes to solve is the paradox of the plankton.
According to Hammock, scientists working with computer simulations “try to model what happens in the ocean by saying that the sun shines in and the algae grows. . . it has kind of a rough approximation but they can't get [the computer model of the ecosystem] to be stable. They go for a while and then they crash. Because they are too simple. They hope that if they can show a little more diversity in their modeled biosphere, they will become more stable. . . .the paradox being: how does the ocean biosphere exist? Why doesn't it crash?”
“People are sitting on data,” says Corrigan. “There's incredible reservoirs of biodiversity measurements all over the planet. I get a lot of phone calls from people who are sitting on this data and want help putting it in a broader context. It's important because we are in a race to study this planet and learn how our development is stressing our very finite resources. . . The Smithsonian can play a role in the increase of knowledge from all these sources and be a real power to diffuse it.”
A quarter of the $1 million grant will be awarded to Smithsonian for its share of the work, but EOL includes a lot of other players. Some developers are in Egypt; an education team is based out of Harvard; and the Spanish language unit is in Mexico City.
All of EOL's data will continue to be either in the public domain or licensed under Creative Commons. The research and data are intended to be publicly accessible and not hidden behind a paywall.
“It's a very old dream,” says Hammock. “One human probably can't learn it all. It's hard to put everything in one place where it can be consciously be checked against itself. But now we have computers.”
Pliny would be either very pleased or very jealous.