Data overload and the pace of genomic science

by W. Gregory Feero, MD, PhD

Articles in several major journals (Nature, Science, etc.) have noted that genomics research has entered an era of data measured in petabytes.

For readers who are not used to thinking on the scale of Avogadro’s number, peta refers to a number with 15 zeros tagging along. A petabyte is a lot of data—think 250,000 four-gigabyte thumb drives. Now, are you ready to include genomics test results in your electronic health record?

Petabytes of information storage requirements arise from the advent of low-cost, high-capacity, next-generation sequencing technologies that are being used to study the genomes of humans and a wide variety of other organisms to spectacular effect. In research facilities, raw sequence data is commonly kept for reinterpretation, and often includes redundant sets of data for the same genome (“fold coverage”). This increases the data storage and manipulation hardware needed for the already considerable output of a single sequencing run from the newest machines. Despite dramatic advances in informatics technology over the last decade, the secure storage, manipulation and interpretation of data arising from next-generation sequencing present challenges.

Research labs are currently grappling with this issue of data overload. For example, the informatics core at Washington University’s NIH-funded sequencing center had about five petabytes of data storage capacity that was reported to be 80% to 90% full in the spring of 2010; it is currently undergoing expansion to approximately double its size, thanks in part to a $14 million 2009 Recovery Act grant. The core is so power-intense that it has its own electrical substation. To provide some idea of scale, according to Wikipedia, Google processes about 24 petabytes of data a day and ATT moves around about 19 petabytes of data through its networks daily in the U.S. The National Human Genome Research Institute (NHGRI) has also recently announced a major upgrade to its informatics systems to keep up with the processing demands for data storage and handling.

The pace of genomic science is challenging state-of-the-art research information technology systems in world-class research facilities. Meanwhile, the United States is undergoing a major federal effort to bring medical record keeping into the era of the computer. The effort, led by the Office of the National Coordinator for Health Information Technology, has been laborious and costly, and adoption rates of fully functional electronic health record systems remain low. Quite clearly, existing health informatics infrastructure across the United States is not prepared to effectively handle genomic data for even a small fraction of patients, even if the informatics demands in the clinical setting are potentially far less intense than those for research environments. Danielle Ofri, FACP, recently wrote in her New York Times blog [“The Doctor vs. the Computer”] that her medical record system couldn’t accommodate more than 1,000 characters in a notes field, and that she had to laboriously trim her comments to fit into exactly that space.

Will the clinical health informatics infrastructure in the U.S. be prepared to make optimal use of genomic data in 10, 20 or 50 years? Some might argue, not without justification, that we are still not sure how genomic information will be used in routine care for the majority of patients, and that there are many more pressing concerns in health care. I would argue that, despite the current tenuous state of health care economics, the time to consider these issues is now. Retrofitting the nation’s health informatics infrastructure in 50 years to handle genomic data is a dismal proposition. Not only would such a retrofit be costly, but it would undoubtedly slow the diffusion of new discoveries, particularly to under-resourced populations.

A variety of interesting issues warrant careful thought because the rapid pace of technology development ensures genomic data integration into health care will remain a moving target. For example, will it be cheaper to sequence the genome once and store the data over an individual’s lifespan, or to sequence the genome every time information is needed? Only a year or two ago the answer seemed obvious, but ultimately the answer boils down to a competition between costs of data storage and sequencing.

Some have argued that sequencing will become so cheap that data storage will be more costly. An obvious clinical downside of the “on-demand” sequencing approach is that it would slow the dissemination of new interpretations of genome sequence. Balancing this would be a potential decrease in the number of incidental findings that would need to be considered at any given time.

Assume, as many do, that sequencing and storing data will remain the most cost-effective approach; a finished and annotated genome for clinical use might conservatively occupy a gigabyte of disk space. Multiply one gigabyte by 310 million people (the estimated 2010 U.S. population) and you have about 310 petabytes. Now consider that samples from an individual may need to be sampled more than once, particularly in the setting of cancer care. Will there be sufficient storage capacity for that much information in our health systems? Will the information be routinely searchable as new and potentially relevant discoveries are made? Will it be possible to electronically share structured genomic data across health care systems without resorting to using PDFs, or, as too often happens with records now, by printing the information on paper?

In 2011, both the Institute of Medicine and NHGRI are planning to hold conferences to explore the myriad issues related to the integration of genomic data into health informatics systems. The effect these discussions have on the trajectory of the development of health information technology systems in the U.S. could be of considerable importance to generations of physicians to come.

W. Gregory Feero, MD, PhD, a family physician with a doctorate in human genetics, is Special Advisor to the Director of the National Human Genome Research Institute (NHGRI) and faculty at the Maine-Dartmouth Family Medicine Residency Program (MDFPR).

Originally published in ACP Internist.

Submit a guest post and be heard on social media’s leading physician voice.