1 / 18

Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud

Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud. -- Don Preuss NCBI/NLM/NIH. Every decade a new, lower priced computer class forms with new programming platform, network, and interface resulting in new usage and industry. - Bell’s Law of computer classes.

elda
Download Presentation

Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Trends in Genomics Big Data, NCBI perspective, and 1,000 Genomes in the Cloud -- Don Preuss NCBI/NLM/NIH • Every decade a new, lower priced computer class forms with new programming platform, network, and interface resulting in new usage and industry. • - Bell’s Law of computer classes

  2. Outline • Emerging trends on "Big Data“ and large scale networking and "the cloud" in the genomics community. • Trends in data transfer and data compression • Cloud initiatives – 1,000 Genomes in the cloud

  3. National Center for Biotechnology Information • Created by Public Law 100-607 in 1988 as part of National Library of Medicine at NIH to: • Create automated systems for knowledge about molecular biology, biochemistry, and genetics • Perform research into advanced methods of analyzing and interpreting molecular biology data. • Enable biotechnology researchers and medical care personnel to use the systems and methods developed. • The NCBI advances science and health by providing access to biomedical and genomic information. • Builders and providers of GenBank, Entrez, BLAST, PubMed, dbGaP, SRA, dbSNP, Pubchem and much, much, more…. • Center for basic research and training in computational biology.

  4. NCBI Daily Users Web page views: 28 million per day Web users: 3.1 million per day Data downloaded: 26.6 TB per day Peak web hits: 7,000 per second

  5. Sequencers

  6. DNA Sequencing Caught in Deluge of Data • BGI, based in China, is the world’s largest genomics research institute, with 167 DNA sequencers producing the equivalent of 2,000 human genomes a day. BGI churns out so much data that it often cannot transmit its results to clients or collaborators over the Internet or other communications lines because that would take weeks. Instead, it sends computer disks containing the data, via FedEx

  7. Big Data in Scientific Discovery Physics: Large Hadron Collider Biology: 1000 Genomes Project Trunnell2012

  8. NLM I2 Traffic Stats 2009-2012

  9. Getting exponential growth under control

  10. What is the Big Data Problem in Biology?Example: Reducing the 1000 Genome Dataset Submitted BAM Read IDs as strings Original quality & recalibrated quality scores Additional analysis tags 250TB Size (Terabytes) cSRA (lossless) Read IDs as integers 40-level read qualities using recalibrated quality scores cSRA (lossy) 8 level qualities for all sites Uniform binning of recalibrated quality scores 85TB Variant Call Format (VCF) Genotype likelihoods for all variants 30TB 0.1TB Total Project Size Lossless cSRA Lossy cSRA Analysis Genotypes

  11. Flicek

  12. Problem: Enable Access to Data • 1,000 genome data set is very large • Many sites do not have capacity for 50-200TB downloads Request – Can the 1,000 genomes project store the data in the cloud? • Reduce cost for extramural investigators and increase accessibility to data • In addition, it supports Federal Open Data • A primary goal of Data.gov is to improve access to Federal data and expand creative use of those data beyond the walls of government… • Latest release announced at #ICGH2011, more releases coming. Part of the National Big Data Initiative Announcement

  13. Why is NCBI interested in cloud computing? • Quantity of Data • NCBI has petabytesof sequence data that is made available to researchers around the world. • Bandwidth • NIH has a good bit of network capacity, and • Network capacity is available for many sites to download data sets, especially those on Internet II. However, for many more, it is not available, reducing their practical access to research data • Analysis Tools and Platforms • Some need simple tools – Extract a portion of data (chromosome, area of interest) • Others use more complex tools – Genome browsers, analysis tools for epigenomics using Elastic MapReduce • If we can bring compute to the data we can improve access to the data References in this talk to any specific commercial products, process, service, manufacturer, company, or trademark does not constitute its endorsement or recommendation by the U.S. Government, HHS, or NIH. As an agency of the U.S. Government, NIH cannot endorse or appear to endorse any specific commercial products or services.

  14. 1,000 Genomes in the Clouds • The 1,000 Genome Project files are loaded in Amazon S3 • Millions of files have been uploaded (200TB) • AMIs have been developed to analyze and review the data • Cloudbiolinux, Galaxy • This is a public data set with storage provided by AWS • NIH is funding several efforts to port genome pipelines to cloud computing environments • Research labs, such as those at Emory and UCSC have placed versions of their software in AWS to make 1,000 genome data readily accessible through browser interfaces in the cloud

  15. What is Galaxy • Galaxyis a framework for integrating computational tools. It allows nearly any tool that can be run from the command line to be wrapped in a structured well defined interface. • On top of these tools, Galaxy provides an accessible environment for interactive analysis that transparently tracks the details of analyses, a workflow system for convenient reuse, data management, sharing, publishing, and more. • Even more – Galaxy has made it easy for a researcher to extend their compute power into cloud compute systems • Tools like Galaxy make it possible for a researcher to take advantage of much greater compute power without having to worry about the infrastructure details. http://usegalaxy.org From ASMB tutorial

  16. Summary/Questions • Compression will help slow this big data problem • Other big data problems remain • New file formats will compress data close to sequencers • Last mile networking is a big issue, prevents access for researchers • Cloud will enable access for many more researchers internationally and at underserved institutions Email: donp@nih.gov

More Related