320 likes | 449 Views
Adaptive Query Processing for Data Aggregation:. Mining, Using and Maintaining Source Statistics. M.S Thesis Defense by Jianchun Fan Committee Members: Dr. Subbarao Kambhampati (chair) Dr. Huan Liu
E N D
Adaptive Query Processing for Data Aggregation: Mining, Using and Maintaining Source Statistics M.S Thesis Defense by Jianchun Fan Committee Members: Dr. Subbarao Kambhampati (chair) Dr. Huan Liu Dr. Yi Chen April 13, 2006
Introduction • Data Aggregation: Vertical Integration R (A1, A2, A3, A4, A5, A6) Mediator S1 R1 (A1, A2, _, _, A5, A6) S2 R2 (A1, _, A3, A4, A5, A6) S3 R1 (A1, A2, A3, A4, A5, _)
Introduction • Query Processing in Data Aggregation • Sending every query to all sources ? • Increasing work load on sources • Consuming a lot of network resources • Keeping users waiting • Primary processing task: Selecting the most relevant sources regarding difference user objectives, such as completeness and quality of the answers and response time • Need several types of sources statistics to guide source selection • Usually not directly available
Introduction • Challenges • Automatically gather various types of source statistics to optimize individual goal • Many answers (high coverage) • Good answers (high density) • Answered quickly (short latency) • Combine different statistics to support multi-objective query processing • Maintain statistics dynamically
System Overview • Test beds: • Bibfinder: Online bibliography mediator system, integrating DBLP, IEEE xplore, CSB, Network Bibligraph, ACM Digital Library, etc. • Synthetic test bed: 30 synthetic data sources (based on Yahoo! Auto database) with different coverage, density and latency characteristics.
Outline • Introduction & Overview • Coverage/Overlap Statistics • Learning Density Statistics • Learning Latency Statistics • Multi-Objective Query Processing • Other Contribution • Conclusion
Coverage/Overlap Statistics • Coverage: how many answers a source provides for a given query • Overlap: how many common answers a set of sources share for a given query • Based on Nie & Kambkampati [ICDE 2004]
Density Statistics • Coverage measures “vertical completeness” of the answer set • “horizontal completeness” is important too – quality of the individual answers Density statistics measures the horizontal completeness of the individual answer tuples
Defining Density • Density of a source w.r.t a given query: Average of density of all answers Projection Attribute set Select A1, A2, A3, A4 From S Where A1 > v1 Density = (1 + 0.5 + 0.5 + 0.75) / 4 = 0.675 Selection Predicates • Learning density for every possible source/query combination? – too costly • The number of possible queries is exponential to the number of attributes
Learning Density Statistics • A more realistic solution: classify the queries and learn density statistics only w.r.t the classes • Assumption: If a tuple t represents a real world entity E, then whether or not t has missing value on attribute A is independent to E’s actual value of A. Projection Attribute set Select A1, A2, A3, A4 From S Where A1 > v1 Selection Predicates
Learning Density Statistics • Query class for density statistics: projection attribute set • For queries whose projection attribute set is (A1, A2, …, Am), 2m different types of answers 22 different density patterns: dp1 = (A1, A2) dp2 = (A1, ~A2) dp3 = (~A1, A2) dp4 = (~A1, ~A2) Density([A1, A2] | S) = P(dp1 | S) * 1.0 + P(dp2 | S) * 0.5 + P(dp3 | S) * 0.5 + P(dp4 | S) * 0.0
Learning Density Statistics R(A1, A2, …, An) 2n possible projection attribute set (A1) (A1, A2) (A1, A3) … (A1, A2, …, Am) … 2m possible density patterns (A1, A2, …, Am) (~A1, A2, …, Am) (~A1, ~A2, …, Am) … (~A1, ~A2, …, ~Am) For each data source S, the mediator needs to estimate joint probabilities!
Learning Density Statistics • Independence Assumption: the probability of tuple t having a missing value on attribute A1 is independent of whether or not t has a missing value on attribute A2. • For queries whose projection attribute set is (A1, A2, …, Am), only need to assess m probability values for each source! Joint distribution: P(A1, ~A2 | S) = P(A1 | S) * (1 - P(A2 | S)) Learned from a sample of the data source
Outline • Introduction & Overview • Coverage/Overlap Statistics • Learning Density Statistics • Learning Latency Statistics • Multi-Objective Query Processing • Other Contribution • Conclusion
Latency Statistics • Existing work: source specific measurement of response time • Variations on time, day of the week, quantity of data, etc. • However, latency is often query specific • For example, some attributes are indexed • How to classify queries to learn latency? • Binding Pattern Same different
Using Latency Statistics • Learning is straightforward: average on a group of training queries for each binding pattern • Effectiveness of binding pattern based latency statistics
Outline • Introduction & Overview • Coverage/Overlap Statistics • Learning Density Statistics • Learning Latency Statistics • Multi-Objective Query Processing • Other Contribution • Conclusion
Multi-Objective Query Processing • Users may not be easy to please… • “give me some good answers fast” • “I need manygood answers” • … • These goals are often conflicting! • decoupled optimization strategy won’t work • Example: • S1(coverage = 0.60, density = 0.10) • S2(coverage = 0.55, density = 0.15) • S3(coverage = 0.50, density = 0.50)
Multi-Objective Query Processing • The mediator needs to select sources that are good in many dimensions • “Overall optimality” • Query selection plans can be viewed as 3-dimentional vectors • Option1: Pareto Optimal Set • Option2: aggregating multi-dimension vectors into scalar utility values
Multi-Objective Query Processing • discount model • weighted sum model 2D coverage
Outline • Introduction & Overview • Coverage/Overlap Statistics • Learning Density Statistics • Learning Latency Statistics • Multi-Objective Query Processing • Other Contribution • Conclusion
Other Contribution • Incremental Statistics Maintenance (In Thesis)
Other Contribution • A snapshot of public web services (not in Thesis) [Sigmod Record Mar. 2005] • Implications and Lessons learned: • Most publicly available web services support simple data sensing and conversion, and can be viewed as distributed data sources • Discovery/Retrival of public web services are not beyond what the commercial search engines do. • Composition: • Very few services available – little correlations among them • Most composition problems can be solved with existing data integration techniques
Other Contribution • Query Processing over Incomplete Autonomous Database [with Hemal Khatri] • Retrieving uncertain answers where constrained attributes are missing • Learning Approximate Functional Dependency and Classifiers to reformulate the original user queries Select * from cars where model = “civic” (Make, Body Style) Model Q1: select * from cars where make = Honda and BodyStyle = “sedan” Q2: select * from cars where make = Honda and BodyStyle = “coupe”
Conclusion • A comprehensive framework • Automatically learns several types of source statistics • Uses statistics to support various query processing goal • Optimize in individual dimensions (coverage, density & latency) • Joint Optimization over multiple objectives • Adaptive to different users’ own preferences • Dynamically maintains source statistics