User-Centric Web Crawling

User-Centric Web Crawling Sandeep Pandey & Christopher Olston Carnegie Mellon University

WWW Web Crawling • One important application (our focus): search • Topic-specific search engines + General-purpose ones search queries index repository user crawler

Out-of-date Repository • Web is always changing [Arasu et.al., TOIT’01] • 23% of Web pages change daily • 40% commercial Web pages change daily • Many problems may arise due to an out-of-date repository • Hurt both precision and recall

WWW Web Crawling Optimization Problem • Not enough resources to (re)download every web document every day/hour • Must pick and choose  optimization problem • Others: objective function = avg. freshness, age • Our goal: focus directly on impact on users search queries index repository user crawler

--------- • --------- • --------- • … Web Search User Interface • User enters keywords • Search engine returns ranked list of results • User visits subset of results documents

Objective: Maximize Repository Quality (as perceived by users) • Suppose a user issues search query q: Qualityq = Σdocuments D(likelihood of viewing D)x (relevance of D to q) • Given a workload W of user queries: Average quality = 1/K x Σqueries q  W (freqq x Qualityq)

Viewing Likelihood • Depends primarily on rank in list [Joachims KDD’02] • From AltaVista data [Lempel et al. WWW’03]: 1 . 2 1 0 . 8 view probability 0 . 6 Probability of Viewing ViewProbability(r) r –1.5 0 . 4 0 . 2 0 0 5 0 1 0 0 1 5 0 rank R a n k

Relevance Scoring Function • Search engines’ internal notion of how well a document matches a query • Each D/Q pair  numerical score  [0,1] • Combination of many factors, including: • Vector-space similarity (e.g., TF.IDF cosine metric) • Link-based factors (e.g., PageRank) • Anchortext of referring pages

(Caveat) • Using scoring function for absolute relevance • Normally only used for relative ranking • Need to craft scoring function carefully

scoring function over “live” copy of D ViewProb( Rank(D, q) ) query logs scoring function over (possibly stale) repository usage logs Measuring Quality Avg. Quality= Σq(freqqx ΣD(likelihood of viewing D) x (relevance of D to q))

Lessons from Quality Metric Avg. Quality= Σq(freqqx ΣD(ViewProb( Rank(D, q) ) x (relevance of D to q)) • ViewProb(r) monotonically nonincreasing • Quality maximized when ranking function orders documents in descending order of relevance Out-of-date repository: scrambles ranking  lowers quality Let ΔQD = loss in quality due to inaccurate information about D • Alternatively, improvement in quality if we (re)download D

ΔQD: Improvement in Quality REDOWNLOAD Web Copy of D (fresh) Repository Copy of D (stale) Repository Quality += ΔQD

Download Prioritization Two difficulties: • Live copy unavailable • Given both the “live” and repository copies of D, measuring ΔQD may require computing ranks of all documents for all queries Idea: Given ΔQD for each doc., prioritize (re)downloading accordingly Q: How to measure ΔQD? Approach: (1) EstimateΔQD for past versions, (2) Forecast current ΔQD

Overhead of EstimatingΔQD Estimate while updating inverted index

Forecast Future ΔQD Avg. weekly ΔQD : Top 50% Data: 48 weekly snapshots of 15 web sites sampled from OpenDirectory topics Queries: AltaVista query log Top 80% second 24 weeks Top 90% first 24 weeks

Summary • Estimate ΔQD at index time • Forecast future ΔQD • Prioritize downloading according to forecasted ΔQD

Overall Effectiveness • Staleness = fraction of out-of-date documents* [Cho et al. 2000] • Embarrassment = probability that user visits irrelevant result* [Wolf et al. 2002] * Used “shingling” to filter out “trivial” changes • Scoring function: PageRank (similar results for TF.IDF) Min. Staleness Min. Embarrassment User-Centric resource requirement Quality (fraction of ideal)

Reasons for Improvement • Does not rely on size of text change to estimate importance Tagged as important by shingling measure, although did not match many queries in workload (boston.com)

Reasons for Improvement • Accounts for “false negatives” • Does not always ignore frequently-updated pages User-centric crawling repeatedly re-downloads this page (washingtonpost.com)

Related Work (1/2) • General-purpose Web crawling: • Min. Staleness [Cho, Garcia-Molina, SIGMOD’00] • Maximize average freshness or age for fixed set of docs. • Min. Embarrassment [Wolf et al., WWW’02]: • Maximize weighted avg. freshness for fixed set of docs. • Document weights determined by prob. of “embarrassment” • [Edwards et al., WWW’01] • Maximize average freshness for a growing set of docs. • How to balance new downloads vs. redownloading old docs.

Related Work (2/2) • Focused/topic-specific crawling • [Chakrabarti, many others] • Select subset of pages that match user interests • Our work: given a set of pages, decide when to (re)download each based on predicted content shifts + user interests

Summary • Crawling: an optimization problem • Objective: maximize quality as perceived by users • Approach: • Measure ΔQD using query workload and usage logs • Prioritize downloading based on forecasted ΔQD • Various reasons for improvement • Accounts for false positives and negatives • Does not rely on size of text change to estimate importance • Does not always ignore frequently updated pages

THE END • Paper available at: www.cs.cmu.edu/~olston

Most Closely Related Work • [Wolf et al., WWW’02]: • Maximize weighted avg. freshness for fixed set of docs. • Document weights determined by prob. of “embarrassment” • User-Centric Crawling: • Which queries affected by a change, and by how much? • Change A: significantly alters relevance to several common queries • Change B: only affects relevance to infrequent queries, and not by much • Metric penalizes false negatives • Doc. ranked #1000 for a popular query should be ranked #2 • Small embarrassment but big loss in quality

Inverted Index Word Posting list DocID (freq) Doc1 Seminar: Cancer Symptoms Cancer Doc7 (2) Doc1 (1) Doc9 (1) Doc5 (1) Doc6 (1) Doc1 (1) Seminar Doc1 (1) Doc4 (3) Doc8 (2) Symptoms

Updating Inverted Index Stale Doc1 Live Doc1 Cancer management: how to detect breast cancer Seminar: Cancer Symptoms Cancer Doc7 (2) Doc1 (1) Doc1 (2) Doc9 (1)

Measure ΔQD While Updating Index • Compute previous and new scores of the downloaded document while updating postings • Maintain an approximate mapping between score and rank for each query term (20 bytes per mapping in our exps.) • Compute previous and new ranks (approximately) using the computed scores and score-to-rank mapping • Measure ΔQD using previous and new ranks (by applying an approximate function derived in the paper)

Out-of-date Repository Web Copy of D (fresh) Repository Copy of D (stale)

User-Centric Web Crawling

User-Centric Web Crawling

Presentation Transcript

Web Design – User Centric Sites

User-centric PKI

Crawling the Hidden Web

Web Crawling

Web Crawling

Web Crawling

Web Crawling

CRAWLING THE WEB

User-Centric Computing

Crawling the Hidden Web

User-Centric Visual Analytics

User-centric Identity

User-Centric Web Search: We-Centric Aspect

Ch. 8: Web Crawling

User Centric Web Design

Datahut - Web Crawling Services

Ch. 8: Web Crawling

User Centric Web Design

User-Centric Communication Middleware

Deep Web Crawling

User-Centric Web Crawling*