170 likes | 346 Views
Providing Data Access and Data Related Monitoring Information for Data Integration on the Grid. Alexander Wöhrer and Peter Brezany Institute of Scientific Computing University of Vienna {woehrer|brezany}@par.univie.ac.at. Contents. context of SemDIG starting scenario
E N D
Providing Data Access and Data RelatedMonitoring Information for Data Integration on the Grid Alexander Wöhrer and Peter Brezany Institute of Scientific Computing University of Vienna {woehrer|brezany}@par.univie.ac.at
Contents • context of SemDIG • starting scenario • information needed for query optimization and adaptive query processing (AQP) • continuous data statistics with D³G • overall strategie for metadata about data sources • future work and conclusions
Context of this work • SemDIG: Semantic Data Integration on the Grid • 2 years project • focus on: • Query Optimization • e.g. early exclusion of data sources • which source to take? • Adaptive Query Processing • e.g. changes on available data source indexes • Pilot applications: • ecological (via AustrianGrid) • GridMiner project
MAIN_2 MAIN_2 MAIN_2 MAIN_1 MAIN_1 MAIN_1 Starting scenario I • ecological application • need to query measurement data from water, air and soil • various replicas definied AIR SOIL WATER REP_1 REP_2 REP_2 REP_2 REP_1 REP_1
Starting scenario II • Questions for DAI: • which sources can provide data to answer a query with various conditions? • take main source or replica? • Data distribution and volume (important for query optimisation)? • „Normal“ answers: • all main sources • take main source if available • normal distribution of the values
MAIN_2 MAIN_2 MAIN_2 MAIN_1 MAIN_1 MAIN_1 Starting scenario III • An example query plan could look like this: J Host 3 Host 1 Host 2 J U U U
Needed information for further DAI optimisations • Data access related: • Available indexes • provided by OGSA-DAI on request • Connection time • indicator for current database workload • Data related: • available histograms • exact data statistics (for columns often used in conditions!) General idea: provide more information for better initial query plans and support AQP
Envisioned Solution • independent from the actual data access technology • Supporting/using SOA features • e.g. subscribe to index changes HOST Data Access related Connection Time Indexes Data Source Monitoring Web Service RDBMS Data related Histograms Data statistics
Histograms • important for cost based optimiser • available from system tables of a DBMS http://www.dba-oracle.com/art_builder_histo.htm
Exact Data Statistics • expensive to query each time when needed • Idea: • gather once • include the effect of the delta (increment) for various database operations (insert, delete, update) • Advantage: • Low running costs • use to refute data sources from a query plan early
RDBMS side Data statstics update Triggers monitor init create Tables Stored procedure D³G RDBMS-side architecture • Maintainance: • row trigger after delete/insert/update to update the following values of a table: • mean, standard deviation (numerical) • missing and total frequency • statement trigger to keep min/max for columns up-to-date All Triggers are dynamically (according to the table structure) generated after initializing the data statistics
D³G RDBMS-side performance • Setup: • table with 11 columns (9 numerical) • Oracle 10g on a AMD 1 GHz, 768 MB RAM • init just once per table • RT independent of the table size • no updates to min/max => ST returns immediately Performance of RDBMS side functionality in msec
Target DAI scenario I • The following information is available: • Water • REP_1 has an index on a column used • MAIN_2 exposes 1 < WATER_ID < 5000 • Soil • MAIN_2 has a very bad connection time • Air • MAIN_1 exposes 1 < AIR_ID < 100.000 Let the query be: select * from water, soil, air where .... WATER_ID > 10000 and AIR_ID > 150000
MAIN_2 MAIN_2 MAIN_2 MAIN_1 MAIN_1 MAIN_1 Target DAI scenario:Starting query plan J Host 3 Host 1 Host 2 J U U U
REP_2 MAIN_2 MAIN_1 REP_1 Target DAI scenario II J Host 3 • refute data sources early • Histograms and information about row numbers could be used to change operator distribution Host 1 Host 2 J U
Conclusions • Efficient DAI needs more metadata about a data source • Data related • histograms • data statistics • Data access related • indexes • connection time Additionally: info about main source + info about replicas = more knowledge about one source (combine it) • D³G promising first results • Query optimisation as well as AQP could profit • QO: better initial query plans • AQP: react to index changes, more information used during adaption • More information on this and future work http://www.par.univie.ac.at/project/semdig
References • Jim Gray, “Distributed Computing Economics” ,TR, 2003 • Alexander Wöhrer, Lenka Novakova, Peter Brezany and A Min Tjoa, „D3G: Novel Approaches to Data Statistics, Understanding and Preprocessing on the Grid“, Accepted for IEEE AINA, Vienna, 2006 • SemDIG, http://www.par.univie.ac.at/project/semdig • PMML, http://www.dmg.org