SciDB: Open Source Data Base Project by Michael Stonebraker and Team

SciDBAn Open Source Data Base Project byMichael Stonebraker(and others)

Outline • Why science folks are unhappy with RDBMS • How we plan to fix that • The details

Why SciDB? • “Big science” very unhappy with RDBMS • Astronomy • HEP • Fusion • Bio • Remote sensing

Why? • Experience of Sequoia 2000 (mid 1990s) • Tried to use Postgres for science databases • Failed badly…… • Main science data type is an array – horribly inefficient to simulate arrays on top of tables • Required features absent (provenance, uncertainty, version control) • SQL operations wrong (regrid – not join)

Why SciDB? • Net result • Mentality of “roll your own from the ground up” for every new science project • Realization by the science community that this is long-term suicide • Community wants to get behind something better • Great commonality of needs among domains

A Little Context • XLDB-1 • Genesis of the need • Asilomar conference (March 2008) • Small conference to generate requirements

A Little Context • March 2008 – September 2008 • Initial design completed • Fund raising • Recruiting of initial team • Detailed use cases specified

Our Partnership • Science and high-end commercial folks • Who will put up some resources • And review design • DBMS brain trust • Who will design the system, oversee its construction, and perform needed research • Non-profit company • Which will manage the open source project • And support the resulting system • May need long term funding help

Partners – Science (We are recruiting more….) • LSST astronomy project • DBMS work co-ordinated by SLAC • Pacific Northwest National Laboratory (PNNL) • Various bio projects • Lawrence Livermore National Laboratory • Fusion projects • UCSB • Remote sensing

Partners -- DBMS • Mike Stonebraker (MIT) • Dave DeWitt (Wisconsin -> Microsoft) • Jignesh Patel (Wisconsin) • Jennifer Widom (Stanford) • Dave Maier (Portland State) • Stan Zdonik (Brown) • Sam Madden (MIT) • Ugur Cetintemal (Brown) • Magda Balazinska (Washington) • Mike Carey (UCI)

Partners -- Other • E-Bay • Vertica • Microsoft • LSST • SLAC • Will hit up NSF and DOE

The SciDB Data Model • Nothing (e.g. Hadoop, Pig, Hive, …)? • Most of you have schemas • Hadoop is not a good starting point • Slow • No HA

The SciDB Data Model • Tables? • Makes a few of you happy • Used by Sloan Sky Survey • But • PanStarrs (Alex Szalay) wants arrays and scalability

The SciDB Data Model • Arrays? • Superset of tables (tables with a primary key are a 1-D array) • Makes HEP, remote sensing, astronomy, oceanography folks happy • But • Not biology and chemistry (who wants networks and sequences)

The SciDB Data Model • Multidimensional grids • Superset of arrays (non-uniform cells) • Makes solid modeling folks happy • But • Complex and slow

SciDB Data Model • Nested multidimensional arrays • Array values are a tuple of values and arrays Sightings (sid, details) [x, y, z, t] Objects (type, [sid]) [id]

Basic Arrays • Positive integer dimensions, no gaps • Bounded or unbounded

Enhanced Arrays • “Shape” function • Supports irregular boundary

Enhanced Arrays • Co-ordinate systems • User defined functions that map integers to something else • E.g. mercator • Use dimension notation to access, e.g. • A[17,36] or • A{468.2, 917.6}

SciDB Query Language • “Parse-tree” representation of array operations • With a “binding” to: • MatLab • C++ • Python • IDL • There may be more…. • User extendable operations (Postgres-style)

Operations • Standard relational ones (filter, join) • Plus whatever you want (regrid, interpolate, fourier transform, eigenvalues, …) • Plus add your own (Postgres-style) • We need science input here!!!

Environment and Storage • Extendable grid (cloud) of Linux machines • With built-in high availability and failover • And built in disaster recovery

In Situ Processing • Operate on data with loading it • Supported by a SciDB self-describing file format • And some number of adaptors, e.g. HDF-5, NetCDF • Or write your own

Storage Model • Arrays are “chunked” in storage • Chunk size can vary • Chunks are partitioned across the grid • Go for scalability to petabytes

Other Features Which Science Guys Want (These could be in RDBMS, but Aren’t) • Uncertainty • Data has error bars • Which must be carried along in the computation (interval arithmetic) • Will look at more sophisticated error models later

Other Features • Provenance (lineage) • What calibration generated the data • What was the “cooking” algorithm • In general – repeatability of data derivation • Supported by a command log • with query facilities (interesting research problem) • And redo

Other Features • Time travel • Don’t fix errors by overwrite • I.e. keep all of the data • Supported by an extra array dimension (history) • Spatial support • Named versions • Recalibration usually handled this way • Supported by allocating an array for the new version and “diffing” against its parent

Other Features • (Optionally) integration of the real time data capture system • “cooking” inside DBMS • Makes provenance capture easier • Sometimes important

Time Line • Q4/08 • start company, begin research activities • Late 2009 • Demoware available • Late 2010 • V1 ships

Project Organization (Build-it for real) • CEO (Andy Palmer -- Vertica) • Project management (Bobbi Heath -- Vertica) • CTO (Stonebraker)

Project Organization (Design and Research) • Overall co-ordination (Stonebraker, DeWitt) • Storage and execution (Madden, Cetintemal) • Query layer and semantics (Zdonik, Maier) • Provenance (Widom, Patel) • Resource management (Balazinska) • Language bindings (Carey)

SciDB Has a Good Chance at Success • Community realizes shared infrastructure is good • “Lighthouse” customers • Strong team • Computation goes inside the DBMS • Easier to share • And reuse

How Can You Help? • Get involved!!!!

SciDB: Open Source Data Base Project by Michael Stonebraker and Team