210 likes | 634 Views
Monitoring the dynamic Web to respond to Continuous Queries Sandeep Pandey Krithi Ramamritham Soumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/laiir/ Motivation Web pages change rapidly: 40% commercial pages 23% of all pages change per day (Sethuraman et al.)
E N D
Monitoring the dynamic Web to respond to Continuous Queries Sandeep PandeyKrithi RamamrithamSoumen Chakrabarti IIT Bombay www.cse.iitb.ac.in/laiir/
Motivation • Web pages change rapidly: • 40% commercial pages • 23% of all pages change per day (Sethuraman et al.) • Current search engine users • Need to repeat queries (how often?) and • Diff results with recent versions • Or poll frequently updated collections(e.g., Google news)
Continuous Queries (CQ) • Users register long-lived queries of interest • Pages of interest may be added, modified, and deleted • System continually updates responses • Example applications • Commuter updates: traffic and weather conditions • Alerts on cricket scores, stock portfolios
Query lives for an “instant”, one-shot anwer Optimize corpus freshness at all times Objective penalizes delay from update to refresh Usually handled by bulk crawls with diverse periods Queries have positive lifetime, many updates over time Updates must track changes closely Objective penalizes number or importance of missed updates Dynamic monitoring with more restrictive network resources Discrete vs. continuous queries
Talk outline • Introduction and motivation • Previous approaches • Our contributions • Continuous Adaptive Monitoring (CAM) • How to allocate limited polling resources among pages • How to schedule poll instants • Experiments • Conclusion
Related work • CONQUER and WebCQ (Liu, Pu and Tang) • Query language and architecture for CQ • Do not address monitoring for freshness optimization • NIAGARA (DeWitt and Naughton) • Query evaluation and optimization techniques • Database query optimization setting • ChangeDetector (Boyapati et al.) • Fixed-priority polling for given set of pages • Freshness for discrete queries • Poisson updates (Cho and Garcia-Molina) • Quasi-deterministic and other distributions (Sethuraman, Wolf, Squillante, Yu)
Our contributions • New statistical recency objective for CQs • New monitoring framework to fit statistical models of page change behavior • Recency optimization problem constrained by network resources • Two-phase solution to optimization tailored to CQ search systems • Resource allocation (knapsack) • Poll scheduling (flow-shop)
Continuous Adaptive Monitoring • Planning horizon or “epoch” • Time proceeds in discrete steps {j } over epoch • Each time step j, each page i has probability ρi,jof an update • Can capture predictable bursts, periodicity • jρi,j= i, the expected #updates to page i(“change rate”) • Decision variables yij • Is page i polled at time step j?
Profit, relevance and importance • Each registered query q has a profit q • Relevance riq of page i w.r.t. query q • We use cosine in TFIDF space as in IR • Other measures (e.g. PageRank) may be integrated • Page i has “importance” Wi—function of • Currently resident queries and their “profits” • Relevance of page i to each resident query • Importance
Returned Information Ratio • Update information reported for page i is • Goal is to maximize importance-weighted updates reported, iWiRi subject to polling resource constraint • Returned info ratio (RIR) is Importance-weighted updatescaptured by system Total importance-weightedexpected updates
CAM system overview • Time proceeds in epochs • At the end of every epoch we re-evaluate • Relevance • Update probabilities • For the next epoch • We select instants at which to poll each page (resource allocation) • Schedule these instants subject to resource constraint Determiningrelevant pages Parametertracking Monitoring Resourceallocation Scheduling
Resource allocation • Existing policies • Uniform: Resources (#polls) distributed uniformly among all pages irrespective of their change frequency • Proportional: #polls allocated to a page is proportional to the frequency with which it changes • For discrete queries, uniform better than proportional for any inter-update distribution • CAM: solve a knapsack problem • Better than uniform and proportional • Proportional better than uniform • Evidence that CQ objective discrete objective
Scheduling Determiningrelevant pages • Suppose our crawler can fetch M pages concurrently, and • An epoch is T time steps long • Then we can fetch a total of C=MT pages during an epoch • Ensured by resource allocation phase • But at each instant we cannot schedule more than M fetches • Want small planned-to-actual poll delays • May fail to schedule all poll jobs in an epoch Parametertracking Monitoring Resourceallocation Tentative yijs Scheduling
A flow-shop problem • M “machines” available at any time • Each yij which is equal to 1 is a “job” • Job “k” is “released” at time step rk (= j ) • “Processing time” = crawl time = tj • “Completion time” of job j is Cj • Want to minimize “total flow” • NP-hard problem • We use earliest deadline heuristic Time Job
Experiments • Synthetic data • Change frequency distribution: a few pages change very often (Zipfian) • Update probability distribution: a few ρi,j ’s are large, most are small (Zipfian again) • Page importance distribution: also Zipfian (Wolman, 1999) • Real data • Eight cricket score sites • High update rate FIXME
CAM > Proportional > Uniform • Uniform update andimportance distrib. • Plot RIR against ratioof resources toexpected changes • RIR for CAM is >3times better • Proportional is betterthan uniform in theCQ setting • Intuition from “minimum total stale duration” does not apply to CQ
Resource allocation • Sort pages by increasing change rate • Place in ten equally populated bins (10=fastest) • Uniform spends same resource for each bin • Proportional wastes fewer resources on slow-changing bins, but is not aggressive enough • CAM invests more aggressively in fast-changing bins, achieving the greatest RIR
Skew-handling and adaptation • Fixed monitoring/ change ratio • Vary skew in update probability distribution • CAM’s gains increase with skew • CAM improves over initial epochs • Change distribution estimates stabilize within a few epochs RIR
Experiments on real pages • Eight sites with dynamic cricket match information • In fact, Zipfian updates • Adversarial setup: monitor/change < 1 • CAM close to best possible • For M/C=2, CAM updates on 80% of the information changed
Conclusion • Continual queries are inherently different from discrete queries • Approach used in CAM • Identify relevant pages • Track the pages as they change • Characterize page change behavior • Decide when to monitor the pages in future • CAM approach performs better than other naïve approaches
References • J. Cho, H. Gracia-Molina. Synchronizing the database to improve freshness. ACM-SIGMOD, 2000. • J. Cho, H. Gracia-Molina. Estimating frequency of change. Technical Report, 2000. • J. Sethuram, J. L. Wolf, M. S. Squillante, P. S. Yu. Optimal Crawling strategies for Web search-engines. World Wide Web, 2002.