Database technology for seismology

Database technology for seismology Martin Kersten Jennie Zhang February 2011

Trends in database technology • One size does not fit all … • … all exhibit similar problems • Database technology needs in science • Webservices [out of scope] • Workflow management [out of scope] • Meta-data catalogs • Data vaults and UDFs • Scale-up and –out [out of scope] • Array query languages

Trends in database technology • One size does not fit all … • … all exhibit similar problems • MySQL, PostgreSQL, Oracle, DB2, Microsoft SQLserver,… • Have not been designed for datawarehouses • Have not been designed for science • Have not been designed for signal processing

Row versus column stores • Traditional database systems were designed for online transaction processing • How to move money from A to B • Modern database systems are designed for business intelligence • Who should I trust to lend money ?

Row versus column stores • The application domains dictate the storage and processing schemes • Row stores storage • All records are organized as linear sequence of multiple fields • An insurance company record contains around 150 fields • A stellar object contains around 500 fields • A factory assemblage line may contain 10.000 fields

Row versus column stores • The application domains dictate the storage and processing schemes • Row stores processing • Retrieve all fields of a particular record • Non-discriminative • Update the field of a record • Finally get a raise • Fast search using indices. • 15% overhead on storage saves a lot

Storing Relations in MonetDB Virtual OID: seqbase=1000 (increment=1)

Hash tables, T-trees, R-trees, ... BAT Data Structure is an array C - ARRAY BAT: binary association table BAT heap: - consecutive memory block (array)‏ - memory-mapped file Tail Heap: - best-effort duplicate elimination for strings (~ dictionary encoding) Tail Head

Why database systems may fail? • Meta-data model • XML or NO-SQL or FLAT files • Synchronise on the datacatalog • Computational paradigm • Database systems use the relational model • Scientists use an array model (e.g. Matlab) • Storage cost • Database systems index data • Scientists use structured files (e.f.FITS, NETCDF,MSEED) • Experience • Database query formulation is “hard” for scientists • Data manipulation is “hard” for database developers

Meta-data, the astro case • Astronomy explores space using a variety of instruments, with highly different characteristics • Astronomers in the 90’s were convinced that everyone should learn to program C++ • Astronomers needed a shared catalog to correlate observations • 1997-2003 J. Gray + A. Szalay bridged the gap

230 million object images • 1 million spectra • 4TB catalog data • 9TB images A project to make a map of a large part of the Universe SkyServer provides public access to SDSS for astronomers, students, and wide public

SkyServer Schema Vertical fragment of 100+ popular columns 446 columns >600 million rows Materialized join of Photo and Spectra

LOFAR example • Every second a 2K x 2K image of the sky • Number crunching to extract ca 1000 light sources • Sent to database for spatial matching and checking for transients. • Database growth ca 50TB/yr • Single MonetDB/SQL instance on 8 core 16 G machine.

SciLens project • Explore the other sciences to • Characterise the challenges for database researchers • Create real-world show cases (e.g. Skyserver) • Derive the database technology challenge from astronomy, seismology, remote sensing,... • Develop novel query language techniques • Provide an experimentation facility for cooperative projects

The SciLens platform

Loading Mseed data into a DBMS • ORFEUS has > 3.3M compact Mseed files • Loading (meta-) data into a database structure is time and space consumptive • Option 1: use INSERT statements • Option 2: use COPY into with CSV files • Option 3: use binary attachment

Estimated Loading time INSERT INTO mseed VALUES(….) 28 years if not careful COPY INTO mseedFROM‘repro.csv’ USING ‘\t’ \n’ 340 days + conversion COPY INTO mseed FROM (‘repro.mseed’) 10 days with some care (200ms)

Data Vaults • The database system and science repositories should act as a symbiotic organism • Data is loaded dynamically upon demand at light speed • All mseed record headers can be easily handled in a modern database system • (~ 100 M records)

DON’T MOVE THE DATA

The holy grail An Array-DBMS

Science DBMS landscape

Use case • Rietbrock: Chili earthquake … 2TB of wave fronts … filter by sta/lta … remove false positives … window-based 3 min cuts … heuristic tests … interactive response required … • How can a database system help? • Scanning 2TB on modern pc takes >3 hours

Use case, a SciQL dream • Rietbrock: Chili earthquake create array mseed ( tick timestamp dimension[1988:*], data decimal(8,6), station string ); -- > 3 10^11 events, 300 billion events

Use case, a SciQL dream • Rietbrock: … filter by sta/lta --- average by window of 5 seconds select A.tick, avg(A.data) from mseed A group by A[tick:tick + 5 seconds]

Use case, a SciQL dream • Rietbrock: … filter by sta/lta select A.tick from mseed A, mseed B where A.tick = B.tick and avg(A.data) /avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds]

Use case, a SciQL dream • Rietbrock: … filter by sta/lta create view candidates( station string, tick timestamp, ratio float ) as select A.station, A.tick, avg(A.data) /avg(B.data)as ratio from mseed A, mseed B where A.tick = B.tick and avg(A.data) /avg(B.data) > delta group by A[tick:tick + 5 seconds], B[tick:tick + 15 seconds]

Use case, a SciQL dream • Rietbrock: … remove false positives -- remove isolated errors by direct environment -- using wave propagation statics create table neighbors( head string, tail string, delay timestamp, weight float)

Use case, a SciQL dream • Rietbrock: … remove false positives select A.tick, B.tickfrom candidates A, candidates B, neighbors N where A.station = N.head andB.station = N.tail and B.tick = A.tick + N.delay and B.ratio* N.weight < A.ratio;

Use case, a SciQL dream • Rietbrock: … remove false positives delete from candidates select A.tickfrom candidates A, candidates B, neighbors N where A.station = N.head andB.station = N.tail and B.tick = A.tick + N.delay and B.ratio* N.weight < A.ratio;

Use case, a SciQL dream • Rietbrock: … window-based 3 min cuts … heuristic tests select B.station, myfunction(B.data)from candidates A, mseed B where A.tick = B.tickgroup by distinct B[tick:tick + 3 minutes]; -- using a User Defined Function written in C.

Use case • Rietbrock: … interactive response required … The query over 2TB of seismic data will be handled before he finishes his coffee.

An Array-DBMS An Array DBMS • An array DBMS is yet to be provided to the community • Major implementation impediments, • Integration with existing SQL stack • Integration of the proper MATH libraries

Conclusions • Catalogs Relational database systems are effective for building community-based meta-data catalogs (Skyserver). • Streaming SQL-based processing can perform well in streaming applications (LOFAR) • Data vaults Shared responsibility of science repositories is feasible [Development] • Arrays Relational and array-based declarative query processing [Research & development]

Database technology for seismology

Database technology for seismology

Presentation Transcript

SEISMOLOGY

Coronal seismology

Seismology

Classical Seismology

Basic Seismology

Basis for Distributed Database Technology

Seismology

Database technology

DataBase Technology by

Database Technology for the Semantic Web

Database Technology

Database Technology

Database Technology

CoRoT seismology

database technology for monitoring human development

Database Technology

EARTHQUAKE SEISMOLOGY

Technology Email Database

Reflection Seismology

Database Technology for the Semantic Web

Technology Email Database