220 likes | 323 Views
Estimating the ImpressionRank of Web Pages. Ziv Bar- Yossef Maxim Gurevich Google and Technion Technion. TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A. Impressions and ImpressionRank. Impression of page/site x on a keyword w :
E N D
Estimating the ImpressionRank of Web Pages Ziv Bar-Yossef Maxim Gurevich Google and TechnionTechnion TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA
Impressions and ImpressionRank • Impression of page/site x on a keyword w: • A user sends w to a search engine • The search engine returns x as one of the results • The user sees the result x • ImpressionRank of x: • # of impressions of x • Within a certain time frame • Measure of page/site visibility in a search engine • Each result has an impression on the keyword “www 2009”: • www.2009.org • www2009.org/calls.html • www.loginconference.com • ...
Popular Keyword Extraction • The Popular Keyword Extraction problem: • Input: web page x, intk • Output: k keywords on which x has the most impressions among all keywords • Example: x = www.johnmccain.com • sarahpalin • john mccain • cindymccain
Motivation • Popularity rating of pages and sites • Site analytics • Enable site owners to determine their visibility in different search engines • Combine with traffic data to derive click-through rates • Compare to other sites • Keyword suggestions for online advertising • Social analysis • Search engine evaluation • Finding similar pages
Internal Measurements of ImpressionRank and Popular Keyword Extraction • Search engines can compute both ImpressionRank and popular keywords based on their query logs • Query logs are not publicly released due to privacy concerns • Caveats: • Only search engines can do this • Non-transparent
External Measurements of ImpressionRank and Popular Keyword Extraction Main cost measure: # of requests to the search engine and to the suggestion server ImpressionRank estimator / Popular keyword extractor Target page URL ImpressionRank / Popular Keywords
Our Contributions • Reduce ImpressionRank Estimation to Popular Keyword Extraction • First external algorithm for popular keyword extraction • Accurate • Uses relatively few search engine requests • Applies to: • Single web pages (www.cnn.com) • Web sites (www.cnn.com/*) • Domains (*.cnn.com/*)
Related Work • Keyword extraction [Frank et al 99, Turney 00, …] • Keyword suggestions (for online advertising) [Yih et al 06, Fuxman et al 08] • Query by Document [Yang et al 09] • Commercial traffic reporting [GoogleTrends, comScore, Nielsen, Compete]
Roadmap • The naïve popular keyword extraction algorithm • The improved popular keyword extraction algorithm • Best-First Search • Experimental results
Popular Keyword Extraction: The Naïve Algorithm • Recall problem: • Target page may have impressions on keywords that do not occur in its text • Efficiency problem: • 103 terms 109 3-term candidates Suggestion Server Search Engine • Verification procedure for keyword w: • Submit w to the search engine and the suggestion server • Verify that w returns the target page • Verify that the popularity of w > 0 [BG08] Term Extractor Candidate keyword generator Candidate Verifier … weather mp3 tag song … Popular Keywords Term Pool Target Page Candidate keyword TRIE Candidate keyword TRIE mp3 tag … mp3 …
Popular Keyword Extraction: The Improved Algorithm Suggestion Server Search Engine Term Extractor Candidate keyword generator Best-First Search Candidate Verifier Target Page Term Pool Target Page Popular Keywords Similar Pages Candidate keyword TRIE Anchor Text
Best-First Search Suggestion Server Search Engine Best-First Search Candidate Verifier • Goals: • Prune as many candidates as possible • Verify the most promising candidates first • Start with single term candidates • Score candidates • While not exceeded search engine request budget • w = top scoring candidate • Send w to the verifier • Decide whether to prune w • If not prune w • Expand w – generate and score the children of w Candidate keyword TRIE 3 5 … weather mp3 … … 8 tag song mp3
Pruning • Pruning decision for keyword w: • Submit query inurl:<target url> w • If no results, prune w and all its descendants • Retrieve suggestions for w • If no results, prune w and all its descendants • Pruning eliminates the vast majority of candidates • A single search/suggestion request may eliminate thousands of candidates
Scoring • The Best-First search algorithm considers only the top scoring candidates given the budget • Want to predict • Whether the search engine returns the target page on w • Whether w is a popular keyword • score(w) = tf(w) idf(w) popularity_score(w) • , , and : relative weights of the scoring components Predicts the popularity of w Predicts whether the search engine returns the target page on w
How to Compute Candidate Scores • Every time the algorithm expands a keyword, it needs to compute scores for all its children • There could be thousands of such children • TF Score • Straightforward. No search requests needed. • IDF Score • Approximated based on an offline corpus. No search requests needed. • Popularity Score • [BarYossefGurevich 08]: Algorithm for estimating keyword popularity using the query suggestion service • Too costly: may use dozens of suggestion requests per estimate • We present a new algorithm that estimates popularity for all the children in bulk • Uses hundreds of suggestion requests to estimate the popularity of all the children • Estimates are less accurate
Cheap Popularity Estimation • Input: a keyword w • Goal: Estimate popularity of all w’s children • Bucket children according to their first character • Estimate relative popularity of each bucket • Estimate the relative popularity within each bucket mp3_ Example: w = “mp3” children: “mp3 song”, “mp3 tag”, “mp3 table”, … a BG08 Popularity Estimator mp3 s mp3 t … … s t 5 6 4 mp3 song mp3 tag mp3 table Estimate of popularity_score(prefix) 5 2
Popular Keyword Extraction Algorithm: Quality Analysis • Precision: 100% • All extracted keywords return the target page • Recall: do we miss some popular keywords? • More difficult to measure – no ground truth to compare to • Estimate lower bound on the recall • Google: recall > 90% • Yahoo!: recall = 70% - 80%
Resource Usage • ~10000 suggestion server requests per page • ~1000 search engine requests per page • 85%(Google), 75%(Yahoo) after 25% of resources spent
ImpressionRank of News Sites(March 2009) weather cnn bristolpalin news weather cnn video obama stimulus package new york times barackobama amazon movies barackobama
Conclusions • First external algorithms for • ImpressionRank estimation • Popular keyword extraction • Future work • Improve efficiency • Improve recall