330 likes | 772 Views
SciDB An Open Source Data Base Project by Michael Stonebraker (and others). Outline. Why science folks are unhappy with RDBMS How we plan to fix that The details. Why SciDB?. “Big science” very unhappy with RDBMS Astronomy HEP Fusion Bio Remote sensing. Why?.
E N D
SciDBAn Open Source Data Base Project byMichael Stonebraker(and others)
Outline • Why science folks are unhappy with RDBMS • How we plan to fix that • The details
Why SciDB? • “Big science” very unhappy with RDBMS • Astronomy • HEP • Fusion • Bio • Remote sensing
Why? • Experience of Sequoia 2000 (mid 1990s) • Tried to use Postgres for science databases • Failed badly…… • Main science data type is an array – horribly inefficient to simulate arrays on top of tables • Required features absent (provenance, uncertainty, version control) • SQL operations wrong (regrid – not join)
Why SciDB? • Net result • Mentality of “roll your own from the ground up” for every new science project • Realization by the science community that this is long-term suicide • Community wants to get behind something better • Great commonality of needs among domains
A Little Context • XLDB-1 • Genesis of the need • Asilomar conference (March 2008) • Small conference to generate requirements
A Little Context • March 2008 – September 2008 • Initial design completed • Fund raising • Recruiting of initial team • Detailed use cases specified
Our Partnership • Science and high-end commercial folks • Who will put up some resources • And review design • DBMS brain trust • Who will design the system, oversee its construction, and perform needed research • Non-profit company • Which will manage the open source project • And support the resulting system • May need long term funding help
Partners – Science (We are recruiting more….) • LSST astronomy project • DBMS work co-ordinated by SLAC • Pacific Northwest National Laboratory (PNNL) • Various bio projects • Lawrence Livermore National Laboratory • Fusion projects • UCSB • Remote sensing
Partners -- DBMS • Mike Stonebraker (MIT) • Dave DeWitt (Wisconsin -> Microsoft) • Jignesh Patel (Wisconsin) • Jennifer Widom (Stanford) • Dave Maier (Portland State) • Stan Zdonik (Brown) • Sam Madden (MIT) • Ugur Cetintemal (Brown) • Magda Balazinska (Washington) • Mike Carey (UCI)
Partners -- Other • E-Bay • Vertica • Microsoft • LSST • SLAC • Will hit up NSF and DOE
The SciDB Data Model • Nothing (e.g. Hadoop, Pig, Hive, …)? • Most of you have schemas • Hadoop is not a good starting point • Slow • No HA
The SciDB Data Model • Tables? • Makes a few of you happy • Used by Sloan Sky Survey • But • PanStarrs (Alex Szalay) wants arrays and scalability
The SciDB Data Model • Arrays? • Superset of tables (tables with a primary key are a 1-D array) • Makes HEP, remote sensing, astronomy, oceanography folks happy • But • Not biology and chemistry (who wants networks and sequences)
The SciDB Data Model • Multidimensional grids • Superset of arrays (non-uniform cells) • Makes solid modeling folks happy • But • Complex and slow
SciDB Data Model • Nested multidimensional arrays • Array values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id]
Basic Arrays • Positive integer dimensions, no gaps • Bounded or unbounded
Enhanced Arrays • “Shape” function • Supports irregular boundary
Enhanced Arrays • Co-ordinate systems • User defined functions that map integers to something else • E.g. mercator • Use dimension notation to access, e.g. • A[17,36] or • A{468.2, 917.6}
SciDB Query Language • “Parse-tree” representation of array operations • With a “binding” to: • MatLab • C++ • Python • IDL • There may be more…. • User extendable operations (Postgres-style)
Operations • Standard relational ones (filter, join) • Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) • Plus add your own (Postgres-style) • We need science input here!!!
Environment and Storage • Extendable grid (cloud) of Linux machines • With built-in high availability and failover • And built in disaster recovery
In Situ Processing • Operate on data with loading it • Supported by a SciDB self-describing file format • And some number of adaptors, e.g. HDF-5, NetCDF • Or write your own
Storage Model • Arrays are “chunked” in storage • Chunk size can vary • Chunks are partitioned across the grid • Go for scalability to petabytes
Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) • Uncertainty • Data has error bars • Which must be carried along in the computation (interval arithmetic) • Will look at more sophisticated error models later
Other Features • Provenance (lineage) • What calibration generated the data • What was the “cooking” algorithm • In general – repeatability of data derivation • Supported by a command log • with query facilities (interesting research problem) • And redo
Other Features • Time travel • Don’t fix errors by overwrite • I.e. keep all of the data • Supported by an extra array dimension (history) • Spatial support • Named versions • Recalibration usually handled this way • Supported by allocating an array for the new version and “diffing” against its parent
Other Features • (Optionally) integration of the real time data capture system • “cooking” inside DBMS • Makes provenance capture easier • Sometimes important
Time Line • Q4/08 • start company, begin research activities • Late 2009 • Demoware available • Late 2010 • V1 ships
Project Organization (Build-it for real) • CEO (Andy Palmer -- Vertica) • Project management (Bobbi Heath -- Vertica) • CTO (Stonebraker)
Project Organization (Design and Research) • Overall co-ordination (Stonebraker, DeWitt) • Storage and execution (Madden, Cetintemal) • Query layer and semantics (Zdonik, Maier) • Provenance (Widom, Patel) • Resource management (Balazinska) • Language bindings (Carey)
SciDB Has a Good Chance at Success • Community realizes shared infrastructure is good • “Lighthouse” customers • Strong team • Computation goes inside the DBMS • Easier to share • And reuse
How Can You Help? • Get involved!!!!