330 likes | 447 Views
Database technology for seismology. Martin Kersten Jennie Zhang February 2011. Trends in database technology One size does not fit all … … all exhibit similar problems Database technology needs in science Webservices [out of scope] Workflow management [out of scope] Meta-data catalogs
E N D
Database technology for seismology Martin Kersten Jennie Zhang February 2011
Trends in database technology • One size does not fit all … • … all exhibit similar problems • Database technology needs in science • Webservices [out of scope] • Workflow management [out of scope] • Meta-data catalogs • Data vaults and UDFs • Scale-up and –out [out of scope] • Array query languages
Trends in database technology • One size does not fit all … • … all exhibit similar problems • MySQL, PostgreSQL, Oracle, DB2, Microsoft SQLserver,… • Have not been designed for datawarehouses • Have not been designed for science • Have not been designed for signal processing
Row versus column stores • Traditional database systems were designed for online transaction processing • How to move money from A to B • Modern database systems are designed for business intelligence • Who should I trust to lend money ?
Row versus column stores • The application domains dictate the storage and processing schemes • Row stores storage • All records are organized as linear sequence of multiple fields • An insurance company record contains around 150 fields • A stellar object contains around 500 fields • A factory assemblage line may contain 10.000 fields
Row versus column stores • The application domains dictate the storage and processing schemes • Row stores processing • Retrieve all fields of a particular record • Non-discriminative • Update the field of a record • Finally get a raise • Fast search using indices. • 15% overhead on storage saves a lot
Storing Relations in MonetDB Virtual OID: seqbase=1000 (increment=1)
Hash tables, T-trees, R-trees, ... BAT Data Structure is an array C - ARRAY BAT: binary association table BAT heap: - consecutive memory block (array) - memory-mapped file Tail Heap: - best-effort duplicate elimination for strings (~ dictionary encoding) Tail Head
Why database systems may fail? • Meta-data model • XML or NO-SQL or FLAT files • Synchronise on the datacatalog • Computational paradigm • Database systems use the relational model • Scientists use an array model (e.g. Matlab) • Storage cost • Database systems index data • Scientists use structured files (e.f.FITS, NETCDF,MSEED) • Experience • Database query formulation is “hard” for scientists • Data manipulation is “hard” for database developers
Meta-data, the astro case • Astronomy explores space using a variety of instruments, with highly different characteristics • Astronomers in the 90’s were convinced that everyone should learn to program C++ • Astronomers needed a shared catalog to correlate observations • 1997-2003 J. Gray + A. Szalay bridged the gap
230 million object images • 1 million spectra • 4TB catalog data • 9TB images A project to make a map of a large part of the Universe SkyServer provides public access to SDSS for astronomers, students, and wide public
SkyServer Schema Vertical fragment of 100+ popular columns 446 columns >600 million rows Materialized join of Photo and Spectra
LOFAR example • Every second a 2K x 2K image of the sky • Number crunching to extract ca 1000 light sources • Sent to database for spatial matching and checking for transients. • Database growth ca 50TB/yr • Single MonetDB/SQL instance on 8 core 16 G machine.
SciLens project • Explore the other sciences to • Characterise the challenges for database researchers • Create real-world show cases (e.g. Skyserver) • Derive the database technology challenge from astronomy, seismology, remote sensing,... • Develop novel query language techniques • Provide an experimentation facility for cooperative projects
Loading Mseed data into a DBMS • ORFEUS has > 3.3M compact Mseed files • Loading (meta-) data into a database structure is time and space consumptive • Option 1: use INSERT statements • Option 2: use COPY into with CSV files • Option 3: use binary attachment
Estimated Loading time INSERT INTO mseed VALUES(….) 28 years if not careful COPY INTO mseedFROM‘repro.csv’ USING ‘\t’ \n’ 340 days + conversion COPY INTO mseed FROM (‘repro.mseed’) 10 days with some care (200ms)
Data Vaults • The database system and science repositories should act as a symbiotic organism • Data is loaded dynamically upon demand at light speed • All mseed record headers can be easily handled in a modern database system • (~ 100 M records)
The holy grail An Array-DBMS
Use case • Rietbrock: Chili earthquake … 2TB of wave fronts … filter by sta/lta … remove false positives … window-based 3 min cuts … heuristic tests … interactive response required … • How can a database system help? • Scanning 2TB on modern pc takes >3 hours
Use case, a SciQL dream • Rietbrock: Chili earthquake create array mseed ( tick timestamp dimension[1988:*], data decimal(8,6), station string ); -- > 3 10^11 events, 300 billion events
Use case, a SciQL dream • Rietbrock: … filter by sta/lta --- average by window of 5 seconds select A.tick, avg(A.data) from mseed A group by A[tick:tick + 5 seconds]
Use case, a SciQL dream • Rietbrock: … filter by sta/lta select A.tick from mseed A, mseed B where A.tick = B.tick and avg(A.data) /avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds]
Use case, a SciQL dream • Rietbrock: … filter by sta/lta create view candidates( station string, tick timestamp, ratio float ) as select A.station, A.tick, avg(A.data) /avg(B.data)as ratio from mseed A, mseed B where A.tick = B.tick and avg(A.data) /avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds]
Use case, a SciQL dream • Rietbrock: … remove false positives -- remove isolated errors by direct environment -- using wave propagation statics create table neighbors( head string, tail string, delay timestamp, weight float)
Use case, a SciQL dream • Rietbrock: … remove false positives select A.tick, B.tickfrom candidates A, candidates B, neighbors N where A.station = N.head andB.station = N.tail and B.tick = A.tick + N.delay and B.ratio* N.weight < A.ratio;
Use case, a SciQL dream • Rietbrock: … remove false positives delete from candidates select A.tickfrom candidates A, candidates B, neighbors N where A.station = N.head andB.station = N.tail and B.tick = A.tick + N.delay and B.ratio* N.weight < A.ratio;
Use case, a SciQL dream • Rietbrock: … window-based 3 min cuts … heuristic tests select B.station, myfunction(B.data)from candidates A, mseed B where A.tick = B.tickgroup by distinct B[tick:tick + 3 minutes]; -- using a User Defined Function written in C.
Use case • Rietbrock: … interactive response required … The query over 2TB of seismic data will be handled before he finishes his coffee.
An Array-DBMS An Array DBMS • An array DBMS is yet to be provided to the community • Major implementation impediments, • Integration with existing SQL stack • Integration of the proper MATH libraries
Conclusions • Catalogs Relational database systems are effective for building community-based meta-data catalogs (Skyserver). • Streaming SQL-based processing can perform well in streaming applications (LOFAR) • Data vaults Shared responsibility of science repositories is feasible [Development] • Arrays Relational and array-based declarative query processing [Research & development]