1 / 33

eScience Data Management

eScience Data Management. It’s not just size that matters, it’s what you can do with it. Bill Howe, Phd eScience Institute. me. from eScience Rollout, 11/5/08. My Background. BS Industrial and Systems Engineering, GA Tech 1999 Big 3 Consulting with Deloitte 99-00

matt
Download Presentation

eScience Data Management

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. eScience Data Management It’s not just size that matters, it’s what you can do with it Bill Howe, Phd eScience Institute

  2. me from eScience Rollout, 11/5/08 Bill Howe, eScience Institute

  3. My Background • BS Industrial and Systems Engineering, GA Tech 1999 • Big 3 Consulting with Deloitte 99-00 • Residual guilt from call centers of consultants burning $50k/day • Independent Consulting 00-01 • Microsoft, Siebel, Schlumberger, Verizon • Phd, Computer Science, Portland State University, 2006 (via OGI) • Dissertation: “GridFields: Model-Driven Data Manipulation in the Physical Sciences”, Advisor: David Maier • Postdoc and Data Architect 06-08 • NSF Science and Technology Center for Coastal Margin Observation and Prediction (CMOP) Bill Howe, eScience Institute

  4. All Science is becoming eScience Empirical X  Analytical X  Computational X  X-informatics Old model: “Query the world” (Data acquisition coupled to a specific hypothesis) New model: “Download the world” (Data acquired en masse, independent of hypotheses) But: Acquisition now outpaces analysis • Astronomy: High-resolution, high-frequency sky surveys (SDSS, LSST, PanSTARRS) • Medicine: ubiquitous digital records, MRI, ultrasound • Oceanography: high-resolution models, cheap sensors, satellites • Biology: automated PCR, high-throughput sequencing “Increase Data Collection Exponentially in Less Time, with FlowCAM” Bill Howe, eScience Institute

  5. “The future is already here. It’s just not very evenly distributed.”-- William Gibson The Long Tail • Researchers with growing data management challenges but limited resources for cyberinfrastructure • No dedicated IT staff • Overreliance on simple tools (e.g., spreadsheets) LSST (~100PB) CERN (~15PB/year) PanSTARRS (~40PB) The long tail is getting fatter: notebooks become spreadsheets (MB), spreadsheets become databases (GB), databases become clusters (TB) clusters become clouds (PB) data inventory SDSS (~100TB) Ocean Modelers CARMEN (~50TB) Seis-mologists <Spreadsheet users> Microbiologists ordinal position Bill Howe, eScience Institute

  6. Heterogeneity also drives costs LSST (~100PB; images, objects) CERN (~15PB/year, particle interactions) PanSTARRS (~40PB; images, objects, trajectories) # of bytes SDSS (~100TB; images, objects) OOI (~50TB/year; sim. results, satellite, gliders, AUVs, vessels, more) Biologists (~10TB, sequences, alignments, annotations, BLAST hits, metadata, phylogenetic trees) # of data types Bill Howe, eScience Institute

  7. complexity-hiding interfaces Access Methods Query Languages Web Services Visualization; Workflow Storage Management Data Integration Knowledge Extraction, Crawlers Data Mining, Distributed Programming Models, Provenance Facets of Data Management The DB maxim: push computation to the data Bill Howe, eScience Institute

  8. Example: Relational Databases At IBM Almaden in 60s and 70s, Codd worked out a formal basis for tabular data representation, organization, and access [Codd 70]. The early systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! Now: $10B market, de facto standard for data management. SQL is “intergalactic dataspeak” physical data independence logical data independence Bill Howe, eScience Institute

  9. Medium-Scale Data Management Toolbox Relational Databases The “hammer” of data management Scientific Workflow Systems [Howe, Freire, Silva, et al. 2008] Science “Mashups” [Howe, Green-Fishback, Maier, 2009] “Dataspace” systems [Howe, Maier, Rayner, Rucker 2008] Bill Howe, eScience Institute

  10. Large-Scale Data Management Toolbox Amazon S3 RDBMS-like features in the cloud Note: cost effectiveness unclear for large datasets MapReduce Parallel programming using functional programming abstractions (Google) Howe, Freire, Silva: 2009 NSF CluE Award Connolly, Gardner: 2009 NSF CluE Award Dryad Parallel programming via relational algebra plus type safety, monitoring, debugging (Michael Isard, Microsoft Research) Bill Howe, eScience Institute

  11. Current Activities • Consulting: Armbrust Lab (next slide) • Research: MapReduce for Oceanographic SImulations (+ Visualization and Workflow) Bill Howe, eScience Institute

  12. Consulting: Armbrust Lab • Initial Goal: Corral and inventory all relevant data • SOLiD sequencer: potentially 0.5 TB / day, flat files • Metadata: small relational DB + Rails/Django web app • Data Products: visualizations, intermediate results • Ad hoc scripts and programs • Initial Goal: Amplify programmer effort • Change is constant: No “one size fits all” solution; ad hoc development is the norm • Strategy: Teach biologists to “fish” (David Schruth’s R course) • Strategy: Develop an infrastructure that enables and encourages reuse -- scientific workflow systems key idea: these are data too Bill Howe, eScience Institute

  13. Scientific Workflow Systems • Value proposition: More time on science, less time on code • How: By providing language features emphasizing sharing, reuse, reproducibility, rapid prototyping, efficiency • Provenance • Automatic task-parallelism • Visual programming • Caching • Domain-specific toolkits • Many examples from eScience and DB communities: • Trident (MSR), Taverna (Manchester), Kepler (UCSD), VisTrails (Utah), more Bill Howe, eScience Institute

  14. Photo: The Trident Scientific Workflow Workbench for Oceanography, developed by Microsoft Research, demonstrated at Microsoft’s TechFest 2008. http://www.microsoft.com/mscorp/tc/trident.mspx Bill Howe, eScience Institute

  15. Bill Howe, eScience Institute screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

  16. Bill Howe, eScience Institute screenshot: VisTrails, Claudio Silva, Juliana Freire, et al., University of Utah

  17. Bill Howe @ CMOP computes salt flux using GridFields Erik Anderson @ Utah adds vector streamlines and adjusts opacity Peter Lawson adds discussion of the scientific interpretation Bill Howe @ CMOP adds an isosurface of salinity source: VisTrails (Silva, Freire, Anderson) and GridFields (Howe) Bill Howe, eScience Institute

  18. Strategy at Armbrust Lab • Develop a benchmark suite of workflow exemplars and use them to evaluate workflow offerings • “Let a hundred flowers blossom” -- deploy multiple solutions in practice to assess user uptake • “Pay as you go” -- evolve a toolkit rather than attempt a comprehensive, monolithic data management juggernaut. Informed by two of Jim Gray’s Laws of Data Engineering: • Start with “20 queries” • Go from “working to working” Bill Howe, eScience Institute

  19. NSF Award: Cluster Exploratory (CluE) • Partnership between NSF, IBM, Google • Data-intensive computing: “I/O farm” • massive queries, not massive simulations • “in ferro” experiments • To “Cloud-Enable” GridFields and VisTrails • Goal: 10+-year climatologies at interactive speeds • Requires turning over up to 25TB < 5s • Provenance, reproducibility, visualization: VisTrails • Connect rich desktop experience to cloud query engine • Co-PIs from University of Utah • Claudio Silva and Juliana Freire Bill Howe, eScience Institute

  20. Ahmdahl’s Laws Gene Amdahl (1965): Laws for a balanced system • Parallelism: max speedup is S/(S+P) • One bit of IO/sec per instruction/sec (BW) • One byte of memory per one instruction/sec (MEM) • One IO per 50,000 instructions (IO) Modern multi-core systems move farther away from Amdahl’s Laws (Bell, Gray and Szalay 2006) For a Blue Gene the BW=0.001, MEM=0.12. For the JHU cluster BW=0.5, MEM=1.04 source: Alex Szalay, keynote, eScience 2008 Bill Howe, eScience Institute

  21. Climatology May Feb Washington Columbia River Oregon Average Surface Salinity by Month Columbia River Plume 1999-2006 psu animation Bill Howe, eScience Institute

  22. 1 3 4 5 6 7 2 8 9 15 10 11 12 13 14 16 17 23 18 (b) 19 20 21 22 psu 31 24 25 26 27 28 29 30 Bill Howe, eScience Institute

  23. Epilogue We’re here to help! SIG Wiki: https://sig.washington.edu/itsigs/SIG_eScience eScience Blog: http://escience.washington.edu/blog/ eScience wesbite: http://www.washington.edu/uwtech/escience.html Bill Howe, eScience Institute

  24. Bill Howe, eScience Institute

  25. eScience requirements are Fractal William Gibson -- “The future is already here. It’s just not very evenly distributed.” Bill Howe, eScience Institute

  26. eScience High-Performance Computing Data Management Online Collaboration Tools CS Research Consulting Bill Howe, eScience Institute

  27. It’s what you can do with it • Relational database • SQL, plus UDTs and UDFs as needed • FASTA databases • Alignments, rarefaction curves, phylogenetic trees, filtering • MapReduce: • Roll your own • Dryad • Relational algebra available; you can still roll our own if needed Bill Howe, eScience Institute

  28. A data deluge in all fields  X-informatics Empirical X  Analytical X  Computational X Acquisition eventually outpaces analysis • Astronomy: SDSS, now LSST; PanSTARRS • Biology: PCR, SOLiD sequencing • Oceanography: high-resolution models, cheap sensors • Marine Microbiology: FlowCytometer Bill Howe, eScience Institute “Increase Data Collection Exponentially in Less Time, with FlowCAM”

  29. High-Performance Computing Data Management Online Collaboration Consulting Community Building Technology Transfer eScience Research

  30. Query Languages • Organize and encapsulate access methods • Raise the level of abstraction beyond GPLs • Identify and exploit opportunities for algebraic optimization • What is algebraic optimization? Consider the expression x/z + y/z x/z + y/z = (x + y)/z, but the latter is less expensive since it involves only one division operation • Tables -- SQL • XML -- XQuery, XPath • RDF -- SPARQL • Streams -- StreamSQL, CQL • Meshes (e.g., Finite Element Sims) -- GridFields Bill Howe, eScience Institute

  31. Example: Relational Databases (In Codd we Trust…) At IBM Almaden in 60s and 70s, Codd worked out a formal basis for working with tabular data1. The early relational systems were buggy and slow (and sometimes reviled), but programmers only had to write 5% of the code the previously did! The Database Game: do the same thing as Codd, but with new data types: XML (trees), RDF (graphs), streams, DNA sequences, images, arrays, simulation results, etc. 1 E. F. Codd, “A Relational Model of Data for Large Shared Data Banks”, Communications of the ACM 13(6), pp 377-387, 1970 Bill Howe, eScience Institute

  32. Gray’s Laws of Data Engineering Jim Gray: Scientific computing is revolving around data Need scale-out solution for analysis Take the analysis to the data! Start with “20 queries” Go from “working to working” DISSC: Data Intensive Scalable Scientific Computing slide source: Alex Szalay, keynote, eScience 2008 Bill Howe, eScience Institute

  33. Data Management Bill Howe, eScience Institute

More Related