Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid

Providing Data Access and Data RelatedMonitoring Information for Data Integration on the Grid Alexander Wöhrer and Peter Brezany Institute of Scientific Computing University of Vienna {woehrer|brezany}@par.univie.ac.at

Contents • context of SemDIG • starting scenario • information needed for query optimization and adaptive query processing (AQP) • continuous data statistics with D³G • overall strategie for metadata about data sources • future work and conclusions

Context of this work • SemDIG: Semantic Data Integration on the Grid • 2 years project • focus on: • Query Optimization • e.g. early exclusion of data sources • which source to take? • Adaptive Query Processing • e.g. changes on available data source indexes • Pilot applications: • ecological (via AustrianGrid) • GridMiner project

MAIN_2 MAIN_2 MAIN_2 MAIN_1 MAIN_1 MAIN_1 Starting scenario I • ecological application • need to query measurement data from water, air and soil • various replicas definied AIR SOIL WATER REP_1 REP_2 REP_2 REP_2 REP_1 REP_1

Starting scenario II • Questions for DAI: • which sources can provide data to answer a query with various conditions? • take main source or replica? • Data distribution and volume (important for query optimisation)? • „Normal“ answers: • all main sources • take main source if available • normal distribution of the values

MAIN_2 MAIN_2 MAIN_2 MAIN_1 MAIN_1 MAIN_1 Starting scenario III • An example query plan could look like this: J Host 3 Host 1 Host 2 J U U U

Needed information for further DAI optimisations • Data access related: • Available indexes • provided by OGSA-DAI on request • Connection time • indicator for current database workload • Data related: • available histograms • exact data statistics (for columns often used in conditions!) General idea: provide more information for better initial query plans and support AQP

Envisioned Solution • independent from the actual data access technology • Supporting/using SOA features • e.g. subscribe to index changes HOST Data Access related Connection Time Indexes Data Source Monitoring Web Service RDBMS Data related Histograms Data statistics

Histograms • important for cost based optimiser • available from system tables of a DBMS http://www.dba-oracle.com/art_builder_histo.htm

Exact Data Statistics • expensive to query each time when needed • Idea: • gather once • include the effect of the delta (increment) for various database operations (insert, delete, update) • Advantage: • Low running costs • use to refute data sources from a query plan early

RDBMS side Data statstics update Triggers monitor init create Tables Stored procedure D³G RDBMS-side architecture • Maintainance: • row trigger after delete/insert/update to update the following values of a table: • mean, standard deviation (numerical) • missing and total frequency • statement trigger to keep min/max for columns up-to-date All Triggers are dynamically (according to the table structure) generated after initializing the data statistics

D³G RDBMS-side performance • Setup: • table with 11 columns (9 numerical) • Oracle 10g on a AMD 1 GHz, 768 MB RAM • init just once per table • RT independent of the table size • no updates to min/max => ST returns immediately Performance of RDBMS side functionality in msec

Target DAI scenario I • The following information is available: • Water • REP_1 has an index on a column used • MAIN_2 exposes 1 < WATER_ID < 5000 • Soil • MAIN_2 has a very bad connection time • Air • MAIN_1 exposes 1 < AIR_ID < 100.000 Let the query be: select * from water, soil, air where .... WATER_ID > 10000 and AIR_ID > 150000

MAIN_2 MAIN_2 MAIN_2 MAIN_1 MAIN_1 MAIN_1 Target DAI scenario:Starting query plan J Host 3 Host 1 Host 2 J U U U

REP_2 MAIN_2 MAIN_1 REP_1 Target DAI scenario II J Host 3 • refute data sources early • Histograms and information about row numbers could be used to change operator distribution Host 1 Host 2 J U

Conclusions • Efficient DAI needs more metadata about a data source • Data related • histograms • data statistics • Data access related • indexes • connection time Additionally: info about main source + info about replicas = more knowledge about one source (combine it) • D³G promising first results • Query optimisation as well as AQP could profit • QO: better initial query plans • AQP: react to index changes, more information used during adaption • More information on this and future work http://www.par.univie.ac.at/project/semdig

References • Jim Gray, “Distributed Computing Economics” ,TR, 2003 • Alexander Wöhrer, Lenka Novakova, Peter Brezany and A Min Tjoa, „D3G: Novel Approaches to Data Statistics, Understanding and Preprocessing on the Grid“, Accepted for IEEE AINA, Vienna, 2006 • SemDIG, http://www.par.univie.ac.at/project/semdig • PMML, http://www.dmg.org

Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid