430 likes | 544 Views
Mediator Cost Models for Heterogeneous Data Sources. Hubert Naacke. Outline. Context : DISCO a mediation system. Context : DISCO a mediation system Problem and Objective cost evaluation of query plan, use diverse cost info Proposition extensible cost model, declarative language
E N D
Mediator Cost Models for Heterogeneous Data Sources Hubert Naacke
Outline Context : DISCO a mediation system • Context : DISCO a mediation system • Problem and Objective cost evaluation of query plan, use diverse cost info • Proposition extensible cost model, declarative language • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion
Mediation Systems • goals • intelligent integration of information • exploit existing sources, reuse data and power • systems • IRO-DB Univ. Versailles • Hermes Univ. Maryland • Garlic IBM • Tsimmis Stanford • Cords Univ. Waterloo • DISCO Bull + INRIADistributed Information Search Component
Administrator Mediator Mediator Wrapper 1 Wrapper n Data Sources Registration phase User Application Query Mediator Wrapper 1 Wrapper n Data Sources Mediator-Wrapper Architecture Interface Wrapper 1 Wrapper n Data Sources Query processing phase Result
Registration Query processing Catalog Catalog Execution Engine Query Decomposition Optimization Query Recomposition DISCO Architecture Application Administrator Query Tuples Mediator Mediator Schema Capabilities Cost model Wrapper 1 Wrapper n Data Sources Relational Algebra Tuples Wrapper 1 Wrapper n Data Sources
Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan in the mediator • Proposition extensible cost model, declarative language • Detailed solution cost formulas integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion Problem cost evaluation of a query plan
Cost Based Optimization Problem • plan generation: 1 query ð n plans (sub-plans) • logical: operator reordering, permutation, distribution • physical: 1 operator ð n algorithms • cost estimation of plans (sub-plans) compare plans ð 1 candidate for execution • objective minimize response time, resource, memory
Diversity of cost information Problem • sources autonomy • no cost info: black box • statistics only • statistics + cost formulas • instability of execution environment • network communication (e.g. daily contention) • workload (e.g. low load at night) • Objective: take into account all available cost info
Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan • Proposition extensible cost model, hierarchical cost formulas • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion Proposition extensible cost model
expressiveness: sufficient for heterogeneous sources extensible framework Proposition Proposed solution • declarative language • describe wrapper cost model • based on rules • hierarchical classification • integrates generic/specialized cost info • cost evaluation algorithm
Assumptions • more cost info ð better plan • cost info comes from • mediator: generic cost model • wrapper: specialized cost model • cost model used by mediator • input for query plan evaluation • no cost evaluation in wrapper
Requirements Proposed solution • cost model specified by wrapper implementor • statistics • formulas • description may be incomplete • missing statistics • missing formulas • integration: wrapper ðmediator • transfer at registration time • merge wrapper cost models into mediator
Cost Communication Language Proposed solution • exporting statistics • collections stat: total size, card, tuple size • attribute stat: index, min, max, distinct values • exporting formulas • math. formulas: selectivity, statistic of intermediate collections • cost model = set of cost rules • interface for transfer • extension of wrapper interface • textual form parsed by mediator
Cost Rule Definition Proposed solution • head is a plan (or sub-plan): e.g., scan(Publication) • body: 1 formula per cost vector component • cost vector = [Total time, Total size, Cardinality] • depends on statistics • genericity: head contains variables • unify a set of plans • e.g., scan(Collection)
Cost Integration Detailed solution • hierarchical rule classification • criteria: based on rule head variables • any wrapper = generic scope generic rule for fully autonomous sources • any collection = wrapper scope e.g., source latency • any predicate= collection scope e.g., access method specificity • ...
Count = ... TotalTime = ... ... Cost rules hierarchy Detailed solution any wrapper Generic-scope rules select (Collection, Predicate) TotalSize = ... wrapper 2: wrapper 1: Wrapper-scope rules select (Collection, Predicate) select (Collection, Predicate) TotalTime = ... TotalSize = ... Collection- scope rules select(author, Pred) select(publication, Predicate) TotalSize = ... TotalTime = ... Predicate-scope rules select(pub, year < S) select(pub, title = T) TotalTime = ... TotalTime = ... Query specific rules
2 1 2 Cost Evaluation Algorithm Detailed solution • traversing the plan tree • top-down = attach node « cost formulas • most specific formulas of the hierarchy • may attach many formulas • bottom-up = cost computation • depends on sub-nodes cost • sorting costs • many dimensions (response time, size) • compare: cost(plan1) < cost(plan2) plan
select(scan(Person), age) select(X, P) scan(X) scan scan scan Person Pub Pub Example Detailed solution "get all persons younger than 25 who publish papers" join(select(Person, age < 25), Pub, person.name = pub.author) scan(X) scan(X) select age < 25 select age < 25 scan Person
Validation experimentation on Web sources Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan • Proposition extensible cost model, declarative language • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion
Validation Validation • objectives • efficiency of cost model : generic vs. specialized • limited efficiency of generic cost model • maximal efficiency of specialized cost model • cost language power: • can describe heterogeneous execution models • experimentation • data: real sources on the Web, TPC-D data • queries: selection, projection, inter-site join
Validation method Validation • evaluate the cost of plans • P5 : best cost plan • execute the plans • P3 : best response time plan • compare response time P3 / P5 • according to the cost model • generic or specialized • according to the query and data • varying selectivity, cardinality of collections
Experimental system: Sources • Validation • bibliographic data sources on the Web • ACM : 80MB card: 30000 • DBLP: 60 MB card: 90000 • data from TPC-D benchmark • stored in Oracle, access via JDBC • size: 1GB (scaling factor=1) • collections card: 150K - 1.5M
Wrappers for bibliographical sources • Validation • Wrapper for ACM source • schema: Publication(author, title, conf, year) • capability: select on author(optionally conf, year) • cost info: only statistics, no formula • Wrapper for DBLP source • schema: Publication(author, title, conf, year) • capability: select on author,select on conf and year (=) • cost info: select(Pub, pred(conf=x2, year=x3)) :- totalTime = S0 + card(Pub) * sel(pred) * S1
Experimental system : Mediator • Validation • schema • Publication : ACM Pub or DBLP Pub (replication) • generic cost model • cost of select: • select(Pub, Pred) : totalTime = S0 + card(Pub) * S1 • cost of select on indexed attribute : • select(Pub, Pred) : totalTime = S0 + card(Pub) * sel(pred) * S1
Cost model for select operation Optimizer choice ? s s s = submit to wrapper • Validation select * fromPublication where author = x1 and conf = x2 and year < x3 plan P2: union xmin<xi<x3 plan P1: select author = x1 Select author = x1 conf = x2 year < x3 generic cost model P1 materialized select conf = x2 year = xi specialized cost model P1 P2 ACM DBLP
Performance results • Validation Execution of P1 vs. P2 Efficiency loss for low selectivity
Performance results • Validation Execution of P1 vs. P2 Efficiency loss for high selectivity
Cost model for join operation a.author=b.author s s s s • Validation Inter-site join select * from ACM a, DBLP b where a.author = b.author and a.conf=SIGMOD plan P1: plan P2: hash-join a.author=b.author dep-join DBLP select conf = SIGMOD select conf = SIGMOD select author = X ACM ACM DBLP
Performance results • Validation Response time and cost of joins Efficiency of specialized cost model
o.custkey=c.custkey s s s s s Specific projection within TPC-D • Validation Order(orderkey, custkey, comment...), Customer(custkey, name) select * from Order o, Customer c where o.custkey = c.custkey plan P2: tmp.orderkey=c.orderkey plan P1: tmp.custkey=c.custkey select orderkey=X dep-join materialized Order Order select custkey=X select custkey=X proj orderkey, custkey Customer Customer Order
Performance results • Validation Response time: plan 2 / plan1 Generic cost model efficiency
Related work comparison with other approaches Outline • Context : DISCO a mediation system • Problem cost evaluation of query plan • Proposition extensible cost model, declarative language • Detailed solution cost formulas hierarchical integration, cost evaluation • Validation experimentation on Web data sources • Related solutions comparison with: calibration, historical cost, query sampling • Conclusion
Known Solutions (1) • IRO-DB [Gardarin et al. in CoopIS'97] • hypothesis: generic cost model is unique • calibration of parameters • Hermes [Subrahmanian et al. in SIGMOD'96] • hypothesis: black box, stability over time • historical costs: cost[initial delay, total time, cardinality] = table[query]
Known Solutions (2) • Cords [Zhu, Larson in Distributed & Parallel DB, 98] • hypothesis: similar query have similar cost • query sampling + regression cost model • Garlic [Roth, Ozcan, Haas in VLDB ’99] • multimedia context, cost of image retrieval • cost [cold resp. time, hot resp. time, cardinality] • override wrapper cost functions
Comparison with related work • Conclusion • generalizes calibration and sampling approach • integrates Hermes historical cost • bottom layer of cost rules hierarchy • useful for query caching • Garlic confirmation • cost models do matter: need for wrapper input • extensible : cost of method call • trade-off expressiveness/abstraction
Conclusion • Conclusion • Disco’s cost model is efficient • easy specialization of cost model for access method • yields 100% efficiency for typical web queries • improves logical and physical optimization • Extensible • may specify constraints imposed by source • periodic variation, resource limitation • Flexible • fine granularity of cost formulas
Directions for Future Work • Conclusion • update cost model at runtime • polling / notification • cost rules for query caching • optimize cost evaluation algorithm • stop condition : response time < Tmax or result size < Smax • extensions • EC : cost = price, new parameters • cost of path expression, full-text search
References • Disco’s cost model definition • Hubert Naacke, Georges Gardarin, and Anthony Tomasic. Leveraging Mediator Cost Models with Heterogeneous Data Sources (Extended version). ICDE 1998. • Early version in BDA 1997. • Disco’s cost model validation • Hubert Naacke, Anthony Tomasic, and Patrick Valduriez. Validating Mediator Cost Models with Disco. to appear, NISJ 1999. • Disco implementation • Anthony Tomasic, Rémy Amouroux, Philippe Bonnet, Olga Kapitskaia, Hubert Naacke, and Louiqa Raschid. The Distributed Information Search Component (Disco) and the World Wide Web. In ACM SIGMOD 1997, Research Prototype Demonstration. • MIRO-WEB Esprit Project, Spanish Hospital Application.
T • Validation
DBLP Wrapper : Materialization • Validation • materialized view • V(conf, year) : select * fromPublication where conf = C and year = Y materialized view
Cost model for projection select * from Publication pub, Person pers where pub.author = pers.name plan 1: s s s s s materialized Person select author=X Publication • Validation Person(name, picture), Publication(author, ...) plan 2: select name=X Person project name select author=X Person Publication
T • Validation