330 likes | 424 Views
WIC : A General-Purpose Algorithm for Monitoring Web Information Sources. Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University. Dynamic Information on the Web. Bulletin boards Online auctions News Weather Roadway conditions, Sports scores, etc….
E N D
WIC: A General-Purpose Algorithm for Monitoring Web Information Sources Sandeep Pandey (speaker) Kedar Dhamdhere Christopher Olston Carnegie Mellon University
Dynamic Information on the Web • Bulletin boards • Online auctions • News • Weather • Roadway conditions, Sports scores, etc…
Continuous Query Systems • Process information from dynamic Web sources automatically • e.g., CONQUER [Liu et al. WWW 1999] Niagara [Naughton et al. SIGMOD 2000] WebCQ [Liu et al. CIKM 2000]
Past Research on CQ Systems • Focus on language design, query processing • Assume “push” model of information access • Information shows up at doorstep • Web sources are “pull” oriented • Must explicitly download Web pages, check for changes, submit changes to CQ engine
Converting Pull Push Auction sites pull ? push WIC CQ engine Sports sites pull
Converting Pull Push • Topic has received little attention • So far only heuristics with no formal guarantees • Periodical polling of sources • Not scalable • CAM [Pandey et al. WWW’03] Gal et al. [JACM 2001]: • Take into account predicted change behavior • Create monitoring schedule in advance
A good first step, but … • No formal guarantees • Suits narrow range of applications
Example Application Scenarios Append-only Complete overwrite Timeliness not critical Timeliness is critical
Outline • Introduction • Problem statement • WIC: Web Information Collector • Formal results: • WIC is a 2-approximation • Experimental results: • Timeliness-completeness tradeoff
Databases @Carnegie Mellon Model of Pull-Oriented Sources • Proposed by Wolf et al. [WWW 2002] • Set of Web pages of interest P1 … Pn • Importance weight associated with each page • Time is divided into discrete time instants • Change: An update posted on a Web page • Known probability πij that page Pi will change at time Tj • We do not address the problem of estimating change probabilities
Databases @Carnegie Mellon 1.0 1.0 0.9 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 1.0 0.8 0.6 0.6 0.4 0.3 0.3 0.2 0.2 0.2 0.1 0.1 1.0 1.0 0.9 0.9 0.8 0.8 0.7 0.6 0.7 0.6 0.4 0.1 Our Model P1 P2 P3 Time
Modeling the Change Characteristics Append-only Complete overwrite Timeliness not critical Timeliness is critical
Databases @Carnegie Mellon Modeling the Change Characteristics the probability of a change to page Piat time Tj to remain available at time Tk Tj Case 1: changes overwrite old info. Case 2: append-only Also: sliding window, others …
Web Monitoring Requirements Append-only Complete overwrite Timeliness not critical Timeliness is critical
Databases @Carnegie Mellon Conflicting Requirements • Completeness: maximize number of changes captured • Timeliness: minimize delay in capturing changes • Limited resources • Up to C pages can be monitored per time instant • When resources are not plentiful, the twoobjectives can be at odds with each other
Databases @Carnegie Mellon 1.0 1.0 0.9 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 0.9 0.8 0.9 0.5 0.5 0.3 0.3 0.3 0.2 0.2 0.1 0.0 Timeliness-Completeness tradeoff Resource constraint: C=1 P1 (append-only) P2 (overwrite)
Databases @Carnegie Mellon 0.9 0.8 0.9 0.5 0.5 0.3 0.3 0.3 0.2 0.2 0.1 0.0 Only Timeliness Objective: Changes must be captured with zero delay 1.0 1.0 0.9 P1 (append-only) 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 P2 (overwrite)
Databases @Carnegie Mellon Only Completeness Objective: Maximize the number of changes captured 1.0 1.0 0.9 P1 (append-only) 0.6 0.6 0.4 0.4 0.4 0.3 0.3 0.2 0.1 0.9 0.8 0.9 P2 (overwrite) 0.5 0.5 0.3 0.3 0.3 0.2 0.2 0.1 0.0
Databases @Carnegie Mellon Controlling the Tradeoff Urgency : Importance of information captured as a function of delay in capturing Example urgency functions
Web Monitoring Requirements Append-only Complete overwrite Timeliness not critical Timeliness is critical gradual urgency curve steep urgency curve
Web Monitoring Objective • Maximize Utility • Utility = Expected number of changes captured, weighted by delay according to urgency function • Each monitoring action takes unit amount of resource • Resource constraint: amount of resource per time unit constrained
Our Solution • Web Information Collector (WIC) • 2-approximation for all scenarios • Total utility accrued at least half that accrued by optimal monitoring schedule • Finds optimal solution in the following special case: • Timeliness is critical, changes overwrite
Web Information Collector (WIC) • Online, greedy strategy • At each time instant, download page(s) with highest utility • Utility combines: • Probability that a change has occurred • Probability that change has not been erased • Delay in capturing change (weighted according to urgency function)
Databases @Carnegie Mellon WIC continued • Running time: • O(# pages) per time instant under most settings of life and urgency • WIC is an online algorithm • Forecasting can be done at last minute
Proof of 2-Approximation • See our paper
Experiments Append-only Complete overwrite Timeliness not critical Timeliness is critical • Data: 7550 auction pages • Exponential decaying urgency function parameterized by r
Experimental Results in Paper • Sensitivity to error in prediction • Not unduly sensitive • Comparison against prior approach (CAM) • Up to 80% improvement • Handles more applications • Timeliness-Completeness tradeoff
Timeliness-Completeness tradeoff favor timeliness favor completeness
Databases @Carnegie Mellon Summary • Pull->push • Can’t have it all - Choose a combination of timeliness and completeness • Our solution: WIC - Handles many applications - Formal guarantee: 2-approximation - Online algorithm
Urgency Parameter Controls Timeliness-Completeness Tradeoff • Best curve to use depends on application • Ap 1: Agent to monitor and bid in online auctions on behalf of many customers • Use steep curve (timeliness is critical) • Ap 2: Program to maintain database of large number of online resumes • Use gradual curve (timeliness less critical)
Experiments • Determine exact change occurrence times • Add noise to simulate prediction inaccuracy: - False positives - False negatives - Gaussian spreading