1 / 33

SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

SciDB An Open Source Data Base Project by Michael Stonebraker (and others). Outline. Why science folks are unhappy with RDBMS How we plan to fix that The details. Why SciDB?. “Big science” very unhappy with RDBMS Astronomy HEP Fusion Bio Remote sensing. Why?.

geter
Download Presentation

SciDB An Open Source Data Base Project by Michael Stonebraker (and others)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SciDBAn Open Source Data Base Project byMichael Stonebraker(and others)

  2. Outline • Why science folks are unhappy with RDBMS • How we plan to fix that • The details

  3. Why SciDB? • “Big science” very unhappy with RDBMS • Astronomy • HEP • Fusion • Bio • Remote sensing

  4. Why? • Experience of Sequoia 2000 (mid 1990s) • Tried to use Postgres for science databases • Failed badly…… • Main science data type is an array – horribly inefficient to simulate arrays on top of tables • Required features absent (provenance, uncertainty, version control) • SQL operations wrong (regrid – not join)

  5. Why SciDB? • Net result • Mentality of “roll your own from the ground up” for every new science project • Realization by the science community that this is long-term suicide • Community wants to get behind something better • Great commonality of needs among domains

  6. A Little Context • XLDB-1 • Genesis of the need • Asilomar conference (March 2008) • Small conference to generate requirements

  7. A Little Context • March 2008 – September 2008 • Initial design completed • Fund raising • Recruiting of initial team • Detailed use cases specified

  8. Our Partnership • Science and high-end commercial folks • Who will put up some resources • And review design • DBMS brain trust • Who will design the system, oversee its construction, and perform needed research • Non-profit company • Which will manage the open source project • And support the resulting system • May need long term funding help

  9. Partners – Science (We are recruiting more….) • LSST astronomy project • DBMS work co-ordinated by SLAC • Pacific Northwest National Laboratory (PNNL) • Various bio projects • Lawrence Livermore National Laboratory • Fusion projects • UCSB • Remote sensing

  10. Partners -- DBMS • Mike Stonebraker (MIT) • Dave DeWitt (Wisconsin -> Microsoft) • Jignesh Patel (Wisconsin) • Jennifer Widom (Stanford) • Dave Maier (Portland State) • Stan Zdonik (Brown) • Sam Madden (MIT) • Ugur Cetintemal (Brown) • Magda Balazinska (Washington) • Mike Carey (UCI)

  11. Partners -- Other • E-Bay • Vertica • Microsoft • LSST • SLAC • Will hit up NSF and DOE

  12. The SciDB Data Model • Nothing (e.g. Hadoop, Pig, Hive, …)? • Most of you have schemas • Hadoop is not a good starting point • Slow • No HA

  13. The SciDB Data Model • Tables? • Makes a few of you happy • Used by Sloan Sky Survey • But • PanStarrs (Alex Szalay) wants arrays and scalability

  14. The SciDB Data Model • Arrays? • Superset of tables (tables with a primary key are a 1-D array) • Makes HEP, remote sensing, astronomy, oceanography folks happy • But • Not biology and chemistry (who wants networks and sequences)

  15. The SciDB Data Model • Multidimensional grids • Superset of arrays (non-uniform cells) • Makes solid modeling folks happy • But • Complex and slow

  16. SciDB Data Model • Nested multidimensional arrays • Array values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id]

  17. Basic Arrays • Positive integer dimensions, no gaps • Bounded or unbounded

  18. Enhanced Arrays • “Shape” function • Supports irregular boundary

  19. Enhanced Arrays • Co-ordinate systems • User defined functions that map integers to something else • E.g. mercator • Use dimension notation to access, e.g. • A[17,36] or • A{468.2, 917.6}

  20. SciDB Query Language • “Parse-tree” representation of array operations • With a “binding” to: • MatLab • C++ • Python • IDL • There may be more…. • User extendable operations (Postgres-style)

  21. Operations • Standard relational ones (filter, join) • Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) • Plus add your own (Postgres-style) • We need science input here!!!

  22. Environment and Storage • Extendable grid (cloud) of Linux machines • With built-in high availability and failover • And built in disaster recovery

  23. In Situ Processing • Operate on data with loading it • Supported by a SciDB self-describing file format • And some number of adaptors, e.g. HDF-5, NetCDF • Or write your own

  24. Storage Model • Arrays are “chunked” in storage • Chunk size can vary • Chunks are partitioned across the grid • Go for scalability to petabytes

  25. Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) • Uncertainty • Data has error bars • Which must be carried along in the computation (interval arithmetic) • Will look at more sophisticated error models later

  26. Other Features • Provenance (lineage) • What calibration generated the data • What was the “cooking” algorithm • In general – repeatability of data derivation • Supported by a command log • with query facilities (interesting research problem) • And redo

  27. Other Features • Time travel • Don’t fix errors by overwrite • I.e. keep all of the data • Supported by an extra array dimension (history) • Spatial support • Named versions • Recalibration usually handled this way • Supported by allocating an array for the new version and “diffing” against its parent

  28. Other Features • (Optionally) integration of the real time data capture system • “cooking” inside DBMS • Makes provenance capture easier • Sometimes important

  29. Time Line • Q4/08 • start company, begin research activities • Late 2009 • Demoware available • Late 2010 • V1 ships

  30. Project Organization (Build-it for real) • CEO (Andy Palmer -- Vertica) • Project management (Bobbi Heath -- Vertica) • CTO (Stonebraker)

  31. Project Organization (Design and Research) • Overall co-ordination (Stonebraker, DeWitt) • Storage and execution (Madden, Cetintemal) • Query layer and semantics (Zdonik, Maier) • Provenance (Widom, Patel) • Resource management (Balazinska) • Language bindings (Carey)

  32. SciDB Has a Good Chance at Success • Community realizes shared infrastructure is good • “Lighthouse” customers • Strong team • Computation goes inside the DBMS • Easier to share • And reuse

  33. How Can You Help? • Get involved!!!!

More Related