The Virginia Bioinformatics Institute and the Department of Computer Science at Virginia Tech began using a network of supercomputers to locate undetected genes in a massive genome database. This and related work by other institutions has the potential to lead to exciting medical breakthroughs, including new cancer therapies and antibiotics used to combat the emergence of drug-resistant bugs.
However, as the size of genome databases grows, so has the challenge of analyzing them. And with the advent of next-generation sequencers (NGS), this growth has been exponential. “Of the estimated 2,000 DNA sequencers worldwide, they are generating 15 petabytes of genome data every year,” explains Wu Feng, Professor of Computer Science at Virginia Tech. Many life sciences institutions simply do not have access to the computational and storage resources required to work with data sets of this size. In other words, says Feng, “We’re generating data faster than we can analyze it.”
The team had already recognized the potential of high-performance cloud computing to address the resource challenge. But now they wanted to develop software that would make it even easier for scientists to take advantage of these cloud resources, which would lead to faster genome analysis. And that’s how Feng was introduced to the potential of the Microsoft Azure HDInsight Service running on the Microsoft Azure platform.
Feng’s team was one of only 13 from across the country elected by a research program called Computing in the Cloud. Run by the National Science Foundation in partnership with Microsoft, the program was designed to accelerate access to cloud computing for research discovery, data analysis, and multidisciplinary collaboration. Based on the potential of their proposal, Feng’s team was awarded both a grant that covered the cost of using the Microsoft Azure platform and its supporting technical resources.
Feng had looked at an alternative cloud service on which to do the work but found that it did not meet the requirements needed for the team’s development efforts. This included resource and support levels that simply weren’t as robust as the Microsoft offering. For Feng and his team, Microsoft Azure provided an ideal combination of infrastructure and technical support “to conduct the research and development necessary to facilitate personalized genomics for the broader research community.”
Since being awarded the grant, Feng and his team have developed two software artifacts: SeqInCloud, a popular genetic variant pipeline called the Genome Analysis Toolkit (GATK), and CloudFlow, a workflow management framework that uses both client and cloud resources.
SeqInCloud (short for “sequencing in the cloud”) is based on the Broad Institute’s Genome Analysis Toolkit (GATK), a toolkit for analyzing next-generation sequencing data, with the main focus on variant discovery and genotyping.
SeqInCloud seamlessly generalizes the GATK pipeline, allowing it to run in the cloud using HDInsight and Microsoft Azure in order to maximize portability. The SeqInCloud application also features a novel design strategy for data partitioning, data transfer, and storage optimization on Microsoft Azure. The result is more efficient use of Azure cloud resources and better performance overall.
CloudFlow is a workflow management framework that can be installed on a researcher’s PC to help interactions with the Microsoft Azure HDInsight Service. As Feng explains, “It allows us to compose flexible MapReduce pipelines that simultaneously utilize both client and cloud resources for running the pipeline and automating data transfers. This is where the HDInsight resource has been particularly useful.” To run large tasks, researchers can automatically provision HDInsight clusters on demand.
The CloudFlow framework delivers unique features that are not offered by existing MapReduce-based workflow managers, including enabling the simultaneous use of client and cloud resources, automatic data-dependency handling between client and cloud resources, and the flexibility of implementing user-defined plugins for data transformations.
By taking advantage of the Microsoft Azure platform, Feng and his team showed how well the Microsoft Azure HDInsight Service can be used seamlessly to deliver cloud applications with advanced capabilities.
By making the Microsoft Azure HDInsight Service more effective and accessible for DNA sequencing researchers, the project has produced several key benefits.
Provides Significant Cost Savings
The cloud computing solution developed by Feng’s team can address growing resource issues that come with analysis of genome sequencing data. As Feng notes, “Life scientists and their institutions no longer have to find millions of dollars to establish their own supercomputing center. Rather than incur the cost of housing their own data center resources and create their own provisioning and scheduling policies, this is done for them through the Microsoft Azure ecosystem.”
Because the data persists in Azure blob stores independently of HDInsight, additional costs savings are realized by only paying for the compute power of the HDInsight clusters for the duration of their actual use, all without losing data.
Supports Collaborative Analysis Anytime, Anywhere
Feng also notes the value of the Azure cloud platform as an effective collaborative tool. “The model enables the easy sharing of public data sets and helps to facilitate large-scale collaborative research.” And because the applications can be accessed from virtually anywhere, including on mobile devices, Feng sees an opportunity not far in the future when researchers will be able to engage in genome analysis outside the laboratory, “say at a hospital, which could lead to faster, prescribed treatments.”
Even in these early stages of development, the benefits of the solution are quickly being recognized by other top research institutions across the country. For example, says Feng, “The solution has already generated interest from the University of California at Berkeley and here at the Virginia Bioinformatics Institute.”
At the same time, Feng’s team continue to expand their research using the Microsoft Azure HDInsight Service. “Microsoft Azure is enabling us to keep up with the data deluge in the DNA sequencing space,” says Feng. “We’re not only analyzing data faster, but analyzing it more intelligently.”
For more information about Microsoft products and services, call the Microsoft Sales Information Center at (800) 426-9400. In Canada, call the Microsoft Canada Information Centre at (877) 568-2495. Customers in the United States and Canada who are deaf or hard-of-hearing can reach Microsoft text telephone (TTY/TDD) services at (800) 892-5234. Outside the 50 United States and Canada, please contact your local Microsoft subsidiary. To access information using the World Wide Web, go to http://www.microsoft.com.
For more information about Virginia Tech, visit the website at www.vt.edu.
To learn more about the DNA sequencing program described in this case study, please refer to: N. Mohamed, H. Lin, and W. Feng, “Accelerating Data- Intensive Genome Analysis in the Cloud,” in Proceedings of the 5th International Conference on Bioinformatics and Computational Biology (Honolulu, Hawaii, March 2013), pp. 297–304
Empowering more sustainable, prosperous, and economically competitive cities—with a simplified approach that puts people first! For more information please visit: