User-Centric Web Crawling for Distributed Dynamic Information

User-Centric Web Crawling* Christopher Olston CMU & Yahoo! Research** * Joint work with Sandeep Pandey ** Work done at Carnegie Mellon

central monitoring node resource constraints source A source B source C Distributed Sources of Dynamic Information • Support integrated querying • Maintain historical archive • Sensors • Web sites Christopher Olston

this talk Workload-driven Approach • Goal: meet usage needs, while adhering to resource constraints • Tactic: pay attention to workload • workload = usage + data dynamics • Thesis work:cooperative sources [VLDB’00, SIGMOD’01, SIGMOD’02, SIGMOD’03a, SIGMOD’03b] • Current focus:autonomous sources • Data archival from Web sources [VLDB’04] • Supporting Web search [WWW’05] Christopher Olston

Outline • Introduction: monitoring distributed sources • User-centric web crawling • Model + approach • Empirical results • Related & future work Christopher Olston

Web Crawling to Support Search search engine search queries index repository users crawler Q: Given a full repository, when to refresh each page? resource constraint Christopher Olston web site A web site B web site C

Our approach: • User-centric optimization objective • Rich notion of document change, attuned to user-centric objective Approach • Faced with optimization problem • Others: • Maximize freshness, age, or similar • Boolean model of document change Christopher Olston

--------- • --------- • --------- • … Web Search User Interface • User enters keywords • Search engine returns ranked list of results • User visits subset of results documents Christopher Olston

Objective: Maximize Repository Quality, from Search Perspective • Suppose a user issues search query q Qualityq = Σdocuments d(likelihood of viewing d)x (relevance of d to q) • Given a workload W of user queries: Average quality = 1/K x Σqueries q  W(freqq x Qualityq) Christopher Olston

Viewing Likelihood • Depends primarily on rank in list [Joachims KDD’02] • From AltaVista data [Lempel et al. WWW’03]: 1 . 2 1 0 . 8 view probability 0 . 6 Probability of Viewing ViewProbability(r) r –1.5 0 . 4 0 . 2 0 0 5 0 1 0 0 1 5 0 rank Christopher Olston R a n k

Relevance Scoring Function • Search engines’ internal notion of how well a document matches a query • Each D/Q pair  numerical score  [0,1] • Combination of many factors, e.g.: • Vector-space similarity (e.g., TF.IDF cosine metric) • Link-based factors (e.g., PageRank) • Anchortext of referring pages Christopher Olston

(Caveat) • Using scoring function for absolute relevance (Normally only used for relative ranking) • Need to ensure scoring function has meaning on an absolute scale • Probabilistic IR models, PageRank okay • Unclear whether TF-IDF does (still debated, I believe) • Bottom line: stricter interpretability requirement than “good relative ordering” Christopher Olston

scoring function over “live” copy of d ViewProb( Rank(d, q) ) query logs scoring function over (possibly stale) repository usage logs Measuring Quality Avg. Quality= Σq(freqqx Σd(likelihood of viewing d) x (relevance of d to q)) Christopher Olston

Lessons from Quality Metric Avg. Quality= Σq(freqqx Σd(ViewProb( Rank(d, q) ) x Relevance(d, q)) ) • ViewProb(r) monotonically nonincreasing • Quality maximized when ranking function orders documents in descending order of true relevance Out-of-date repository: scrambles ranking  lowers quality Let ΔQD = loss in quality due to inaccurate information about D • Alternatively, improvement in quality if we (re)download D Christopher Olston

ΔQD: Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQD Christopher Olston

∆QD(t) = Q(t) – Q(t–) =Σq(freqqx Σd(VP x Relevance(d, q)) ) where VP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) ) Formula for Quality Gain (ΔQD) • Quality beforehand: • Quality after re-download: • Quality gain: Re-download document D at time t. Q(t–) =Σq(freqqx Σd(ViewProb( Rankt–(d, q) ) x Relevance(d, q)) ) Q(t) =Σq(freqqx Σd(ViewProb( Rankt(d, q) ) x Relevance(d, q)) ) Christopher Olston

Download Prioritization Three difficulties: • ΔQD depends on order of downloading • Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive • Live copy usually unavailable Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly Christopher Olston

Difficulty 1: Order of Downloading Matters • ΔQD depends on relative rank positions of D • Hence, ΔQDdepends on order of downloading • To reduce implementation complexity, avoid tracking inter-document ordering dependencies • Assume ΔQD independent of downloading of other docs. QD(t) = Σq(freqqx Σd(VP x Relevance(d, q)) ) whereVP = ViewProb( Rankt(d, q) ) – ViewProb( Rankt–(d, q) ) Christopher Olston

Difficulty 3: Live Copy Unavailable • Take measurements upon re-downloading D (live copy available at that time) • Use forecasting techniques to project forward past re-downloads now time ΔQD(t1) ΔQD(t2) forecastΔQD(tnow) Christopher Olston

Ability to Forecast ΔQD Avg. weekly ΔQD (log scale) Data: 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log Docs downloaded once per week, in random order Top 50% Top 80% second 24 weeks Top 90% Christopher Olston first 24 weeks

Strategy So Far • Measure shift in quality (ΔQD) each time re-download document D • Forecast future ΔQD • Treat each D independently • Prioritize re-downloading by ΔQD Remaining difficulty: • Given both the “live” and repository copies of D, measuring ΔQD is computationally expensive Christopher Olston

Difficulty 2: Metric Expensive to Compute Example: • “Live” copy of D becomes less relevant to query q than before • Now D is ranked too high • Some users visit D in lieu of Y, which is more relevant • Result: less-than-ideal quality • Upon redownloading D, measuring quality gain requires knowing relevancy of Y, Z Solution: estimate! • Use approximate relevancerank mapping functions, fit in advance for each query One problem: measurements of other documents required. Results for q ActualIdeal 1. X 1. X 2. D 2. Y 3. Y 3. Z 4. Z 4. D Christopher Olston

DETAIL Estimation Procedure • Focus on query q (later we’ll see how to sum across all affected queries) • Let Fq(rel) be relevancerank mapping for q • We use piecewise linear function in log-log space • Let r1 = D’s old rank (r1 = Fq(Rel(Dold, q))), r2 = new rank • Use integral approximation of summation QD,q = Σd(ViewProb(d,q) x Relevance(d,q)) = VP(D,q) x Rel(D,q) + Σd≠D(VP(d,q) x Rel(d,q)) ≈Σr=r1+1…r2(VP(r–1) – VP(r)) x F–1q(r) Christopher Olston

DETAIL Where we stand … Context: QD = Σq(freqqx QD,q) QD,q = VP(D,q) x Rel(D,q) + Σd≠D(VP(d,q) x Rel(d,q)) ≈ f(Rel(D,q), Rel(Dold,q)) ≈ VP( Fq(Rel(D, q)) ) – VP( Fq(Rel(Dold, q)) ) QD,q ≈ g(Rel(D,q), Rel(Dold,q)) Christopher Olston

Difficulty 2, continued Sketch: • Basic index unit: posting. Conceptually: • Each time insert/delete/update a posting, compute old & new relevance contribution from term/document pair* • Transform using estimation procedure, and accumulate across postings touched to get ΔQD Additional problem: must measure effect of shift in rank across all queries. Solution: couple measurements with index updating operations Christopher Olston * assumes scoring function treats term/document pairs independently

DETAIL Background: Text Indexes Basic index unit: posting • One posting for each term/document pair • Contains information needed for scoring function • Number of occurrences, font size, etc. Dictionary Postings Christopher Olston

DETAIL Pre-Processing: Approximate the Workload • Break multi-term queries into set of single-term queries • Now, term  query • Index has one posting for each query/document pair Dictionary Postings = query Christopher Olston

DETAIL Taking Measurements During Index Maintenance • While updating index: • Initialize bank of ΔQD accumulators, one per document (actually, materialized on demand using hash table) • Each time insert/delete/update a posting: • Compute new & old relevance contributions for query/document pair: Rel(D,q), Rel(Dold,q) • Compute ΔQD,q using estimation procedure, add to accumulator: ΔQD += freqq x g(Rel(D,q), Rel(Dold,q)) Christopher Olston

Measurement Overhead Implemented in Lucene Caveat: Does not handle factors that do not depend on a single term/doc. pair, e.g. term proximity and anchortext inclusion Christopher Olston

Summary of Approach • User-centric metric of search repository quality • (Re)downloading document improves quality • Prioritize downloading by expected quality gain • Metric adaptations to enable feasible+efficient implementation Christopher Olston

Next: Empirical Results • Introduction: monitoring distributed sources • User-centric web crawling • Model + approach • Empirical results • Related & future work Christopher Olston

Overall Effectiveness • Staleness = fraction of out-of-date documents* [Cho et al. 2000] • Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes • Scoring function: PageRank (similar results for TF.IDF) Min. Staleness Min. Embarrassment User-Centric resource requirement Quality (fraction of ideal) Christopher Olston

Reasons for Improvement • Does not rely on size of text change to estimate importance Tagged as important by staleness- and embarrassment-based techniques, although did not match many queries in workload Christopher Olston (boston.com)

Reasons for Improvement • Accounts for “false negatives” • Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page Christopher Olston (washingtonpost.com)

Related Work (1/2) • General-purpose web crawling • [Cho, Garcia-Molina, SIGMOD’00], [Edwards et al., WWW’01] • Maximize average freshness or age • Balance new downloads vs. redownloading old documents • Focused/topic-specific crawling • [Chakrabarti, many others] • Select subset of documents that match user interests • Our work: given a set of docs., decide when to (re)download Christopher Olston

Most Closely Related Work • [Wolf et al., WWW’02]: • Maximize weighted average freshness • Document weight = probability of “embarrassment” if not fresh • User-Centric Crawling: • Measure interplay between update and query workloads • When document X is updated, which queries are affected by the update, and by how much? • Metric penalizes false negatives • Doc. ranked #1000 for a popular query should be ranked #2 • Small embarrassment but big loss in quality Christopher Olston

Future Work: Detecting Change-Rate Changes • Current techniques schedule monitoring to exploit existing change-rate estimates (e.g., ΔQD) • No provision to explore change-rates explicitly  Explore/exploit tradeoff • Ongoing work on Bandit Problem formulation Bad case: change-rate = 0, so never monitor • Won’t notice future increase in change-rate Christopher Olston

Summary • Approach: • User-centric metric of search engine quality • Schedule downloading to maximize quality • Empirical results: • High quality with few downloads • Good at picking “right” docs. to re-download Christopher Olston

User-Centric Web Crawling for Distributed Dynamic Information

User-Centric Web Crawling for Distributed Dynamic Information

Presentation Transcript

User-Centric Web Crawling

Designing a New User-Centric College Public Website

CSE 538 Web Search and Mining Web Crawling

Towards Exploiting User-Centric Information for Proactive Caching in Mobile Networks ‡

Crawling, extraction and pre-processing

User Centric Monitoring – a redesign and novel approach in the STAR experiment

User-Centric Computing

Crawling

Background: Problem Statement

Are Networked and Net-Centric the Same?

MyOSG: A user-centric information resource for OSG infrastructure data sources

Crawling Deep Web Content Through Query Forms

User-Centric Design of a Vision System for Interactive Applications

Web Crawling Notes by Aisha Walcott

User-centric Identity

User-centric dynamic service composition

Net-Centric Publisher

Policy Search for Focused Web Crawling

PUCC and WEB

Adaptive Focused Crawling

GUIDE-ME: Georgia Tech User-Centric Identity Management Environment

Flying/Crawling Wires