190 likes | 322 Views
Validating an Access Cost Model for Wide Area Applications. Louiqa Raschid University of Maryland CoopIS 2001 Co-authors V. Zadorozhny, T. Zhan and L. Bright. Scalable Wide-Area Applications. Problems Wide area environment is dynamic (noisy) Wide variability in latency (end-to-end delay)
E N D
Validating an Access Cost Model for Wide Area Applications Louiqa Raschid University of Maryland CoopIS 2001 Co-authors V. Zadorozhny, T. Zhan and L. Bright
Scalable Wide-Area Applications Problems • Wide area environment is dynamic (noisy) • Wide variability in latency (end-to-end delay) • Network and server workloads are unknown • Time and Day dependencies impact latency • Dynamic environment - constantly monitored Research Objective: Use query feedback to monitor and learn behavior and to predict access cost distributions that may be Time and Day dependent L. Raschid — University of Maryland, CoopIS01
Talk Outline • Architecture for Wide Area Applications • WebPT: Tool to predict access costs • WebPT based Access Cost Catalog • Grouping of WebSources based on observable WebSource characteristics • Hypothesis to test WebPT based Catalog -- High Prediction Accuracy versus Low Prediction Accuracy • Validation based on experimental case study L. Raschid — University of Maryland, CoopIS01
Architecture for WebPT based Catalog L. Raschid — University of Maryland, CoopIS01
Predicting Response Times for Accessing WebSources Problem: Difficulty in determining evaluation costs Physical implementation details unknown Load on network and WebSource unknown • Objective: • Use query feedback to learn access costs • Exploit Time of day, Day of week etc., to predict costs • Identify easily observable WebSource characteristicsDetermine prediction accuracy for WebSources based on WebSource characteristics L. Raschid — University of Maryland, CoopIS01
Metrics in WebPT Access Cost Model • WebSource and Network Costs • Query Processing at WebSource • Downloading data from WebSource (extraction cost) • Wrapper Statistics • Number of Pages Accessed • Cardinality of Result • Statistics may be dependent on value of query binding • WebPT - a tool for learning using query feedback and predicting access cost based on parameters such as Day, Time, Qty of data , Cardinality, etc. L. Raschid — University of Maryland, CoopIS01
WebPT Learning L. Raschid — University of Maryland, CoopIS01
WebPT based Prediction • WebPT is configured for some hierarchy of dimensions Quantity, Day,Time, Cardinality • WebPT Learning algorithm • Cell splitting • Smoothing • Estimate response time and confidence • Similar to CART (regression versus heuristics) • Cell merging • Heuristics used in calibration of each cell • Dimension - min/ max/ scale • Allowed deviation • Confidence window L. Raschid — University of Maryland, CoopIS01
Prediction Accuracy of WebPT based Cost Model is strongly correlated with the following: • Observable WebSource Characteristics • Significance of Time and Day in predicting workload at the server and on the network • Variance (noise) in accessing server • Quality of available statistics - cardinality • Random bindings - large variance in cardinality • Fixed bindings - better estimation of cardinality L. Raschid — University of Maryland, CoopIS01
Case Study: Data gathering and Experiment • 6 data sources in the public domain • Data gathered for several weeks in 1999, 2000 • Queries submitted to WebSources periodically • Recorded TTF TTL • Query bindings affected result cardinality • Random bindings - >50 bindings • Fixed bindings - 2 bindings each for [S,M,L] • Mediator queries - simple scan to complex 5 way join over data in 5 WebSources (not reported) L. Raschid — University of Maryland, CoopIS01
Observable WebSource Characteristics L. Raschid — University of Maryland, CoopIS01
Grouping of WebSources based on Characteristics • G1: T and D significant; Noise can vary • G2: Noise High • G3: T, D not significant; Noise Low - EMPTY L. Raschid — University of Maryland, CoopIS01
Hypothesis to test WebPT based Access Cost Catalog • H1: High prediction Accuracy for the following • T, D, are significant and Low Noise • Sources are in G1 but not in G2 • H2: Catalog will improve prediction accuracy for the following WebSources • T, D are significant independent of noise • Group G1 • H3: Statistics may be dependent on value of query binding • Prediction accuracy improves with learning on fixed bindings • Sources in both groups L. Raschid — University of Maryland, CoopIS01
Prediction Accuracy for WebSources WebPT(Lo) - Random bindings L. Raschid — University of Maryland, CoopIS01
WebSource Characteristics and Correlation With Prediction Accuracy L. Raschid — University of Maryland, CoopIS01
Groupings of WebSources and Correlation with Prediction Accuracy G1: T and D significant G2: Noise High GNIS: High Pred Accuracy G1 AND G2 FAA; FishBase: Low Pred Accuracy while in G1; Noisy L. Raschid — University of Maryland, CoopIS01
Quantile Plots of Relative Error of Prediction for ACM, Aircraft L. Raschid — University of Maryland, CoopIS01
Quantile Plot of Relative Error of Prediction for FAA, GNIS L. Raschid — University of Maryland, CoopIS01
Summary + Impact • Unique Case Study: WebPT based Access Cost Catalog and Cost distributions • Grouping of WebSources based on observable WebSource characteristics • High Prediction Accuracy for some sources in G1 (T,D significant) with low noise • High Prediction Accuracy for some sources in G1 and in G2 (High Noise) • Similar results for Mediator cost model and complex N-way joins over multiple WebSources L. Raschid — University of Maryland, CoopIS01