eScience Data Management

eScience Data Management It’s not just size that matters, it’s what you can do with it Bill Howe, Phd eScience Institute

me from eScience Rollout, 11/5/08 Bill Howe, eScience Institute

My Background • BS Industrial and Systems Engineering, GA Tech 1999 • Big 3 Consulting with Deloitte 99-00 • Residual guilt from call centers of consultants burning $50k/day • Independent Consulting 00-01 • Microsoft, Siebel, Schlumberger, Verizon • Phd, Computer Science, Portland State University, 2006 (via OGI) • Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical Sciences”, Advisor: David Maier • Postdoc and Data Architect 06-08 • NSF Science and Technology Center for Coastal Margin Observation and Prediction (CMOP) Bill Howe, eScience Institute

All Science is becoming eScience Empirical X  Analytical X  Computational X  X-informatics Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses) But: Acquisition now outpaces analysis • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) • Medicine: ubiquitous digital records, MRI, ultrasound • Oceanography: high-resolution models, cheap sensors, satellites • Biology: automated PCR, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Bill Howe, eScience Institute

“The future is already here. It’s just not very evenly distributed.”-- William Gibson The Long Tail • Researchers with growing data management challenges but limited resources for cyberinfrastructure • No dedicated IT staff • Overreliance on simple tools (e.g., spreadsheets) LSST (~100PB) CERN (~15PB/year) PanSTARRS (~40PB) The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) data inventory SDSS (~100TB) Ocean Modelers CARMEN (~50TB) Seis-mologists <Spreadsheet users> Microbiologists ordinal position Bill Howe, eScience Institute

Heterogeneity also drives costs LSST (~100PB; images, objects) CERN (~15PB/year, particle interactions) PanSTARRS (~40PB; images, objects, trajectories) # of bytes SDSS (~100TB; images, objects) OOI (~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more) Biologists (~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees) # of data types Bill Howe, eScience Institute

complexity-hiding interfaces Access Methods Query Languages Web Services Visualization; Workflow Storage Management Data Integration Knowledge Extraction, Crawlers Data Mining, Distributed Programming Models, Provenance Facets of Data Management The DB maxim: push computation to the data Bill Howe, eScience Institute

Example: Relational Databases At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70]. The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak” physical data independence logical data independence Bill Howe, eScience Institute

Medium-Scale Data Management Toolbox Relational Databases The “hammer” of data management Scientific Workflow Systems [Howe, Freire, Silva, et al. 2008] Science “Mashups” [Howe, Green-Fishback, Maier, 2009] “Dataspace” systems [Howe, Maier, Rayner, Rucker 2008] Bill Howe, eScience Institute

Large-Scale Data Management Toolbox Amazon S3 RDBMS-like features in the cloud Note: cost effectiveness unclear for large datasets MapReduce Parallel programming using functional programming abstractions (Google) Howe, Freire, Silva: 2009 NSF CluE Award Connolly, Gardner: 2009 NSF CluE Award Dryad Parallel programming via relational algebra plus type safety, monitoring, debugging (Michael Isard, Microsoft Research) Bill Howe, eScience Institute

Current Activities • Consulting: Armbrust Lab (next slide) • Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow) Bill Howe, eScience Institute

Consulting: Armbrust Lab • Initial Goal: Corral and inventory all relevant data • SOLiD sequencer: potentially 0.5 TB / day, flat files • Metadata: small relational DB + Rails/Django web app • Data Products: visualizations, intermediate results • Ad hoc scripts and programs • Initial Goal: Amplify programmer effort • Change is constant: No “one size fits all” solution; ad hoc development is the norm • Strategy: Teach biologists to “fish” (David Schruth’s R course) • Strategy: Develop an infrastructure that enables and encourages reuse -- scientific workflow systems key idea: these are data too Bill Howe, eScience Institute

Scientific Workflow Systems • Value proposition: More time on science, less time on code • How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency • Provenance • Automatic task-parallelism • Visual programming • Caching • Domain-specific toolkits • Many examples from eScience and DB communities: • Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more Bill Howe, eScience Institute

Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008. http://www.microsoft.com/mscorp/tc/trident.mspx Bill Howe, eScience Institute

Bill Howe, eScience Institute screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

Bill Howe @ CMOP computes salt flux using GridFields Erik Anderson @ Utah adds vector streamlines and adjusts opacity Peter Lawson adds discussion of the scientific interpretation Bill Howe @ CMOP adds an isosurface of salinity source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe) Bill Howe, eScience Institute

Strategy at Armbrust Lab • Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings • “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake • “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut. Informed by two of Jim Gray’s Laws of Data Engineering: • Start with “20 queries” • Go from “working to working” Bill Howe, eScience Institute

NSF Award: Cluster Exploratory (CluE) • Partnership between NSF, IBM, Google • Data-intensive computing: “I/O farm” • massive queries, not massive simulations • “in ferro” experiments • To “Cloud-Enable” GridFields and VisTrails • Goal: 10+-year climatologies at interactive speeds • Requires turning over up to 25TB < 5s • Provenance, reproducibility, visualization: VisTrails • Connect rich desktop experience to cloud query engine • Co-PIs from University of Utah • Claudio Silva and Juliana Freire Bill Howe, eScience Institute

Ahmdahl’s Laws Gene Amdahl (1965): Laws for a balanced system • Parallelism: max speedup is S/(S+P) • One bit of IO/sec per instruction/sec (BW) • One byte of memory per one instruction/sec (MEM) • One IO per 50,000 instructions (IO) Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006) For a Blue Gene the BW=0.001, MEM=0.12. For the JHU cluster BW=0.5, MEM=1.04 source: Alex Szalay, keynote, eScience 2008 Bill Howe, eScience Institute

Climatology May Feb Washington Columbia River Oregon Average Surface Salinity by Month Columbia River Plume 1999-2006 psu animation Bill Howe, eScience Institute

1 3 4 5 6 7 2 8 9 15 10 11 12 13 14 16 17 23 18 (b) 19 20 21 22 psu 31 24 25 26 27 28 29 30 Bill Howe, eScience Institute

Epilogue We’re here to help! SIG Wiki: https://sig.washington.edu/itsigs/SIG_eScience eScience Blog: http://escience.washington.edu/blog/ eScience wesbite: http://www.washington.edu/uwtech/escience.html Bill Howe, eScience Institute

Bill Howe, eScience Institute

eScience requirements are Fractal William Gibson -- “The future is already here. It’s just not very evenly distributed.” Bill Howe, eScience Institute

eScience High-Performance Computing Data Management Online Collaboration Tools CS Research Consulting Bill Howe, eScience Institute

It’s what you can do with it • Relational database • SQL, plus UDTs and UDFs as needed • FASTA databases • Alignments, rarefaction curves, phylogenetic trees, filtering • MapReduce: • Roll your own • Dryad • Relational algebra available; you can still roll our own if needed Bill Howe, eScience Institute

A data deluge in all fields  X-informatics Empirical X  Analytical X  Computational X Acquisition eventually outpaces analysis • Astronomy: SDSS, now LSST; PanSTARRS • Biology: PCR, SOLiD sequencing • Oceanography: high-resolution models, cheap sensors • Marine Microbiology: FlowCytometer Bill Howe, eScience Institute “Increase Data Collection Exponentially in Less Time, with FlowCAM”

High-Performance Computing Data Management Online Collaboration Consulting Community Building Technology Transfer eScience Research

Query Languages • Organize and encapsulate access methods • Raise the level of abstraction beyond GPLs • Identify and exploit opportunities for algebraic optimization • What is algebraic optimization? Consider the expression x/z + y/z x/z + y/z = (x + y)/z, but the latter is less expensive since it involves only one division operation • Tables -- SQL • XML -- XQuery, XPath • RDF -- SPARQL • Streams -- StreamSQL, CQL • Meshes (e.g., Finite Element Sims) -- GridFields Bill Howe, eScience Institute

Example: Relational Databases (In Codd we Trust…) At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1. The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc. 1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970 Bill Howe, eScience Institute

Gray’s Laws of Data Engineering Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working” DISSC: Data Intensive Scalable Scientific Computing slide source: Alex Szalay, keynote, eScience 2008 Bill Howe, eScience Institute

Data Management Bill Howe, eScience Institute

eScience Data Management

eScience Data Management

Presentation Transcript

qub.ac.uk/escience

eScience

International eScience Infrastructure

eScience Supporting Data-Intensive Research with Client + Cloud

OGF eScience Function

OGF eScience Function

OGF eScience Function

eScience at:

eScience

OGF eScience Function

“Big Data” and Data -Intensive Science (eScience)

Petascale Data Intensive Computing for eScience

qub.ac.uk/escience

eScience

eScience meets eFrameworks

OGF eScience Function

OGF eScience Function

OGF eScience Function

OGF eScience Function

eScience

OGF eScience Function

Data-Intensive Science (eScience)