Strategy for Physical Infrastructure

Strategy for Physical Infrastructure Tim Hubbard ELIXIR Stakeholders Meeting 20th May 2009

Objectives • Understand requirements for physical infrastructure for ELIXIR for next 5-10 years • Core databases (mainly at EMBL-EBI) • Specialized databases • Links to core throughout Europe • Explore options for construction and recommendations on strategy • Initial design and costing of preferred options for European Date Centre (EDC)

Questions • Size of computing & storage requirement • How will it grow over 5-10 years? • Distribution of resources (core, member states) • What criteria is appropriate to identify components for communal funding? • Physical infrastructure of EDC • Upgrade existing data centre or build extra data centre?

Challenges • Existing data scale is petabytes • Data size comparable with largest research communities • 1000 genomes / ICGC = ~1 petabyte per year

CERN Large Hadron Collider (LHC) ~10 PB/year at start ~1000 PB in ~10 years 2500 physicists collaborating http://www.cern.ch

Large Synoptic Survey Telescope (LSST) NSF, DOE, and private donors ~5-10 PB/year at start in 2012 ~100 PB by 2025 Pan-STARRS (Haleakala, Hawaii) US Air Force now: 800 TB/year soon: 4 PB/year http://www.lsst.org; http://pan-starrs.ifa.hawaii.edu/public/

Challenges • Existing data scale is petabytes • Data size comparable with largest research communities • Data security is a major issue • Tape backup no longer viable • Secure remote data replication required • Data growth has been exponential for many years • Has exceeded improvements in CPU/disk/network

11 months doubling time Trace Archive doubling time

Moore’s law: CPU power doubles in ~18-24 mo. Hard drive capacity doubles in ~12 mo. Network bandwidth doubles in ~20 mo.

Challenges • Existing data scale is petabytes • Data size comparable with largest research communities • Data security is a major issue • Tape backup no longer viable • Secure remote data replication required • Data growth has been exponential for many years • Has exceeded improvements in CPU/disk/network • Data growth is accelerating • Huge drop in price/jump in rate of growth in sequencing technology • Sequencing has become biological assay of choice • Imaging likely to follow

Doubling times • Disk 12 months • CPU 18-24 months • Network 20 months • Trace Archive 11 months • Illumina run output 3-6 months (this year)

WTSI projection of IT as fraction of cost of sequencing

Long term value of sequence • WGAS (Whole Genome Association Studies) in Human Genetics • Success of WTCCC (Wellcome Trust Case Control Consortium) • Confounded by human population structure • Can detect causal genes, but not casual SNPs • Determining the structure of human variation • 1000 genomes project • Used with WGAS, improves ability to detect causal SNPs • Data for pilot has already saturated internet backbone • Only scratching the surface: 3 populations; 400 individuals each; 1% allele frequency: • Resulting data structure will be large, frequently used, valuable • Underlying data still required to allow refinement of structure

Consequences • EBI/Genome Campus • Space/Electricity are limiting factors • More about storage than CPU • Tests using Supercomputer centres (WP13) • can be possible with extra work • cost/benefit unclear • for many applications would have to be near the data • Tests using commercial clouds

Recent trends • “Cloud” computing becomes a reality • Resources for hire • Amazon s3, ec2 • Google, Microsoft both developing competitors

Private sector datasets and computing capacity are already huge. Google, Yahoo!, Microsoft: probably ~100 PB or so Ebay, Facebook, Walmart: probably ~10 PB or so For example: Microsoft is constructing a new $500M data center in Chicago. Four new electrical substations totalling 168 MW power. About 200 40’ truckable containers, each containing ~1000-2000 servers. Estimated 200K-400K servers total. Comparisons to Google, Microsoft, etc. aren’t entirely appropriate; scale of their budgets vs. ours aren’t comparable. Google FY2007: 11.5B; ~ $1B to computing hardware Though they do give us early warning of coming trends: (container data centers; cloud computing)

Consequences • EBI/Genome Campus • Space/Electricity are limiting factors • More about storage than CPU • Tests using Supercomputer centres (WP13) • can be possible with extra work • cost/benefit unclear • for many applications would have to be near the data • Tests using commercial clouds • Many be useful for individual researcher • Not stable for archives (google abandoned academic data program) • Currently expensive

Physical infrastructure vision BioCloud A Large scale Centre A Can data Slice if wanted Data slice EBI/Elixir Aggregate Organise Can submit compute if wanted BioCloud B Submission For large scale Test/small datasets European BioCloud Data and Compute Infrastructure Bioinformatics researcher

3. Distributed, hierarchical, redundant data archives and analysis (CERN LHC’s four tiers of data centers: 10 Tier 1 sites, 30 Tier 2 sites) 4. Computational infrastructure is integral to experimental design

Physical infrastructure vision Elixir Node 2 Large scale Centre A Can data Slice if wanted Data slice EBI/Elixir Node 1 Can submit compute if wanted Elixir Node 3 Submission For large scale Test/small datasets European BioCloud Data and Compute Infrastructure Bioinformatics researcher

Conclusions • Need is real, biology just hasn’t been here before • Essential to upgrade data centre capacity at EBI • Implement data replication for Data security • Improve Data transfer for large data sets • Network of small number of nodes for storage and compute around Europe would address Data security; allow distributed compute access • Less hierarchical than physics, more data integration required • High reliability of replication needed: single command/control structure?

Recent trends • Very rapid growth in genome sequence data • Challenges of data transfers between centres (1000 genomes) • Tape backup of archives no longer sufficient, data replication required • Strong trend towards federation of separate databases • APIs, webservices, federation technologies (DAS, BioMart) • Proposed approach for International Cancer Genome Consortium (ICGC)

WP6 Activities • Round table “future hardware needs of computational biology” at “Grand Challenges in Computational Biology” in Barcelona • Initial informal survey of historical and recent growth at several European Bioinformatics Centres • EBI and WTSI both engaged in projections of future growth

Informal survey • Split out EBI/Sanger from others (factor ~10 fold higher numbers from EBI+Sanger) • Steady growth, showing sharp recent increases

Installed cores

Estimated CPU growth, given Moore’s law (doubling every 18 months)

Disk space

Disk vs Compute • Bioinformatics has always been more data orientated than other IT heavy sciences (arguably only astronomy is as data orientated) • This trend is accelerating with the presence next-gen sequencing and imaging • Data volumes on a par to LHC distribution volumes

Options for a sustainable physical infrastructure • Continue to “ride” the decrease in cost per CPU and disk in each institute • Unlikely to handle data spike increase, especially from next generation sequencing and imaging • Be smarter • Coordination of archives and centres. Centres mounting archives “locally”, avoids duplication of data • Better compression - SRF/SRA 10/100 fold better compression than old-style “trace” information • More sophisticated pooling • Must be data aware

Compute pooling trends “cheap” linux Boxes in clusters Dedicated, poor queuing Virtualisation Cloud computing 1990 1995 2000 2005 LSF + Condor (becomes SunGrid Engine) used in life Science to manage Institute Unix nice, VMS queues One machine’s resource management “GRID” computing For LHC developed

Virtualisation • Xen and VMware robust and sensible • Amazon offering commercial cloud services based on virtualisation (ec2) • Removes flexibility barriers for compute style • “linux” is your programming environment • Can move compute to data, not data to compute • Important shift in structure - previous GRIDs moved both data and compute remotely • Also useful for medical ethics issues • EBI & WTSI prototyping virtualisation • Ensembl in Amazon cloud; Xen for EGA access; WTSI is GRID node

Data shape • Two radically different datasets commonly used in bioinformatics • “Your” or “local” data, from experiments as part of this study • The corpus of current information in both archived and processed formats

Openness Privacy • Can you have both? • Biology getting closer to medicine

ICGC Database Model

Current approach to providing access to EGA / DbGaP data • Genotypes / Sequence de-identified, but potentially re-identifiable • Summary data publicly available • Access to individual data requires registration

Current approach to providing access to EGA / DbGaP data • Genotypes / Sequence de-identified, but potentially re-identifiable • Summary data publicly available • Access to individual data requires registration • Risk: • registered individuals (2nd party) allowed download access (encrypted) • will 2nd party provide appropriate security to prevent leak to 3rd party?

Future Human Genetics Data • Now: long term case control studies • Genotype all (now: WTCCC) • Sequence all (future) • Future: study design no longer just around individual diseases • UK biobank (just 500,000 people over 40) • UK NHS patient records (whole population)

Hard to be totally anonymous and still useful • Patient records anonymised, but associated data makes them potentially re-identifiable • Height, weight, age • Location (county, town, postcode?) • Need access to this data to carry out useful analysis • e.g. need to calculate correlations with postcodes to investigate environmental effects

Secure analysis of private data • Privacy is an issue • Public happy to contribute to health research • Public not happy to discover personal details have been lost from laptops / DVDs etc. • 3 potential solutions • “Fuzzify” data accessible for research • Social revolution (personal genetic openness) • Technical solution

Honest Broker • Virtual machines attached to data • Researcher can carry out any data analysis they wish (with their own code), but is guaranteed to only be able to see “privacy safe” summary results

Honest Broker Secure Environment “Correlate A & B” Researcher Data set A Data set B Summary only (no raw data)

Honest Broker Secure Environment “Correlate A & C” Researcher Data set A Data set C Data set B Summary only (no raw data)

Honest Broker Secure Environment “Run X on A, B & C” Researcher Data set A Virtual Machine Data set C Algorithm X Algorithm X Data set B Summary only (no raw data) Virtual machine (VM): • VM has sole access to raw data. • Algorithms implement analysis within VM. • VM guarantees that only summary data can be exported • Existing examples: • cloud computing: Amazon ec2 • iphone SDK (all software is developed against SDK, with controlled access)

Strategy for Physical Infrastructure