Genomics to push big data to its limits
Big data emerged out of the challenges companies were facing while storing and processing huge amounts of data generated by their customers, internal processes, external data requirements among other things and owing to its versatility, the technology has been received with open arms by technology companies that process huge amounts of user data, banking industry, space agencies, pharmaceutical companies, and government.
The robustness of big data has been challenged great many times, but it has managed to out do every challenge so far, but it seems that the technology will be up against one of its worst opponents till date – genomics.
According to a study published in PLOS Biology, genomics — a science that didn’t exist 15 years ago and is only now just beginning to break out from the field generates the most electronic bytes per year relative to all other fields.
Experts are calling for recognition of genomics as a grand-challenge problem with solutions thought of and put into place to capture, store, process and interpret all that genome-encoded biological information, stripped down to symbolic and, by themselves meaningless, ones and zeros.
“For a very long time, people have used the adjective ‘astronomical’ to talk about things that are really, truly huge,” says Michael Schatz, an associate professor at the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory (CSHL) and a co-author of the PLOS paper. “But in pointing out the incredible pace of growth of data-generation in the biological sciences, my colleagues and I are suggesting we may need to start calling truly immense things ‘genomical’ in the years just ahead.”
Researchers compared a range of field from social media on Internet to astronomy and found that each of these fields was generating huge amounts of data – each in tune of tens to hundreds of petabytes per year.
For the uninitiated, a petabyte is one quadrillion bytes – 10 followed by 15 zeros; it’s 1000 times more bytes than a terabyte, the amount of storage you might have on your home computer.
The researchers then placed all of the fields are on rapidly upward-sloping growth curves and found that YouTube generates the most data right now – about 100 petabytes a year.
However, researchers say that if all the fields are compared, genomics is catching up fast and at current rate, the quantity of genomics data produced daily is doubling every 7 months. Researchers have estimated that in a matter of just 10 years i.e. by 2025, genomics will be generating anywhere between 2 to 40 exabytes per year.
One exabyte is the equivalent of 1000 petabytes, about a million times more data than you can store on your home computer.
“As genome-sequencing technologies improve and costs drop, we are expecting an explosion of genome sequencing that will cause a huge flood of data,” said Gene Robinson, a professor of entomology and the director of the Carl R. Woese Institute for Genomic Biology at the U. of I. “The only way to handle this data deluge will be to improve the computing infrastructure for genomics.
“Astronomy, Twitter and YouTube represent three diverse domains that generate and use a huge amount of data, albeit with huge differences in computing needs. The diversity of these three forms of Big Data provides an excellent framework for comparative analyses with genomics,” he said.
Schatz and colleagues describe genomics as a “four-headed beast.” They refer to the separate problems of data acquisition, storage, distribution and analysis. Like data that flows over the Internet, biological data that is the raw material of genomics is highly distributed. That means it’s generated and consumed in many locations. Unlike Internet data, however, which is formatted according to a few standard protocols, genomic data is compiled in many different formats, a fact that threatens its broad intelligibility and utility.
This problem grows in importance as the quantity of data increases. As Schatz explains, much of the torrent of big data from biology will take the form of human genome sequences, as well as related medical information that also depends on sequencing technology. This related information takes the form of both snapshots and the equivalent of movies, and concerns, for instance, levels of gene messages, or transcripts, in specific tissue samples, as well as the identity and levels of protein in samples.
If all the human sequence data so far generated were put in a single place – about 250,000 sequences — it would require about 25 petabytes of storage space. That is a manageable problem, Schatz says. But by 2025, the team expects as many as 1 billion people to have their full genomes sequenced (mostly, people in comparatively wealthy nations). This poses an exabyte-level storage problem.
At some point, sequences in full may not need to be stored. In particle physics, data is read and filtered as it is generated, greatly minimizing storage requirements. But this parsing is not entirely practical for biological information, mainly because the question of which sequences can be safely thrown out is much harder to decide. Conceivably, a billion sets of individual data will need to be preserved if it is to be an aid to future physicians.
Schatz is especially interested in the problem posed by obtaining hundreds of millions, even billions of human full-length genome sequences. The problem is not really speed, which will grow rapidly and predictably, he says, but rather in figuring out how to align and represent different genomes so that they might be compared – and compared in very efficient, smart ways.
“The point of sequencing a billion genomes is not really to make a billion separate lists saying, ‘If you have these variants, you have the following risks.’ Of course, individuals will want to look at the list of DNA variants they possess. But the real power of having 1 billion human genomes comes from ways of comparing them and combining layers of analysis. Our belief is, by combining all this information, patterns will emerge – in the same way that when Mendel grew tens of thousands of pea plants, at the dawn of genetics 150 years ago, he was able to formulate laws of inheritance by looking a patterns for how specific traits were inherited.”
“Genomics is a game-changing science in so many ways,” Schatz says. “My colleagues and I are saying that it’s important to think about the future so that we are ready for it.”
“Genomics will soon pose some of the most severe computational challenges that we have ever experienced,” Robinson said. “If genomics is to realize the promise of having a transformative positive impact on medicine, agriculture, energy production and our understanding of life itself, there must be dramatic innovations in computing. Now is the time to start.”