490 likes | 587 Views
Shuffling a Stacked Deck The Case for Partially Randomized Ranking of Search Engine Results. Sandeep Pandey 1 , Sourashis Roy 2 , Christopher Olston 1 , Junghoo Cho 2 , Soumen Chakrabarti 3. 1 Carnegie Mellon 2 UCLA 3 IIT Bombay. --------- --------- ---------.
E N D
Shuffling a Stacked DeckThe Case for Partially Randomized Ranking of Search Engine Results Sandeep Pandey1, Sourashis Roy2, Christopher Olston1, Junghoo Cho2, Soumen Chakrabarti3 1 Carnegie Mellon 2 UCLA 3 IIT Bombay
--------- • --------- • --------- Popularity as a Surrogate for Quality • Search engines want to measure the “quality” of pages • Quality hard to define and measure • Various “popularity” measures are used in ranking • e.g., in-links, PageRank, usertraffic
Relationship Between Popularity and Quality • Popularity : depends on the number of users who “like” a page • relies on both awareness and quality of the page • Popularity correlated with quality • when awareness is large
Problem • Popularity/quality correlation weak for young pages • Even if of high quality, may not (yet) be popular due to lack of user awareness • Plus, process of gaining popularity inhibited by “entrenchment effect”
--------- • --------- • --------- • --------- • --------- • --------- … user attention entrenched pages Entrenchment Effect • Search engines show entrenched (already-popular) pages at the top • Users discover pages via search engines; tend to focus on top results
Outline • Problem introduction • Evidence of entrenchment effect • Key idea: Mitigate entrenchment by introducing randomness into ranking • Model of ranking and popularity evolution • Evaluation • Summary
Evidences of the Entrenchment Do search engines suppress controversy? - Susan L. Gerhart More news, less diversity - New York Times Googlearchy Distinction of retrievability and visibility The politics of search engines - IEEE Computer • The political economy • of linking on the Web • ACM conf. on • Hypertext & Hypermedia Are search engines biased? - Chris Sherman Bias on the Web - Comm. of the ACM
Quantification of Entrenchment Effect • Impact of Search Engines on Page Popularity • Real Web study by Cho et. al. [WWW’04] • Pages downloaded every week from 154 sites • Partitioned into 10 groups based on initial link popularity • After 7 months, • 70% of new links to top 20% pages • Decrease in PageRank for bottom 50% pages
Alternative Approaches to Counter-act Entrenchment Effect • Weight links to young pages more • [Baeza-Yates et. al SPIRE ’02] • Proposed an age-based variant of PageRank • Extrapolate quality based on increase in popularity • [Cho et. al SIGMOD ’05] • Proposed an estimate of quality based on the derivative of popularity
1 1 500 2 2 3 . . . 3 . 500 499 501 501 Our Approach: Randomized Rank Promotion • Select random (young) pages to promote to good rank positions • Rank position to promote to is chosen at random
Our Approach: Randomized Rank Promotion • Consequence: Users visit promoted pages; improves quality estimate • Compared with previous approaches: • Does not rely on temporal measurements (+) • Sub-optimal (-)
Exploration/Exploitation Tradeoff • Exploration/Exploitation tradeoff • exploit known high-quality pages by assigning good rank positions • explore quality of new pages by promoting them in rank • Existing search engines only exploit (to our knowledge)
Possible Objectives for Rank Promotion • Fairness • Give each page an equal chance to become popular • Incentive for search engines to be fair? • Quality • Maximize quality of search results seen by users (in aggregate) • Quality page p: extent to which users “like” p • Q(p) [0,1] our choice
Squash Linux Model of the Web • Web = collection of multiple disjoint topic-specific communities (e.g., ``Linux’’, ``Squash’’ etc.) • A community is made up of a set of pages, interested users and related queries
Model of the Web • Users visit pages only by issuing queries to search engine • Mixed surfing & searching considered in the paper • Query answer = ordered list containing all pages in the corresponding community • A single ranked list associated with each community • Since queries within a community are very similar
--------- • --------- • --------- • --------- • --------- • --------- … • --------- • --------- • --------- • --------- • --------- • --------- … Model of the Web Community on Squash Community on Linux • Consequence: Each community evolves independent of the other communities
Quality-Per-Click Metric (QPC) • V(p,t):number of visits to page p at time t • QPC : average quality of pages viewed by users, amortized over time
Outline • Problem introduction • Evidence of entrenchment effect • Key idea: Mitigate entrenchment by introducing randomness into ranking • Model of ranking and popularity evolution • Evaluation • Summary
1 1 500 2 2 3 . . . 3 . 500 499 501 501 Desiderata for Randomized Rank Promotion Want ability to: • Control exploration/exploitation tradeoff • “Select” certain pages as candidates for promotion • “Protect’’ certain pages from demotion
1 2 W 3 4 1 2 3 4 Randomized Rank Promotion Scheme Promotion pool Wm random ordering Remainder W-Wm Lm order by popularity Ld
1-r r k-1 Randomized Rank Promotion Scheme Promotion list Remainder 1 2 1 2 4 3 Ld Lm 1 2 3 4 5 6 k = 3 r = 0.5
Parameters • Promotion pool(Wm) • Uniform rank promotion : give an equal chance to each page • Selective rank promotion : exclusively target zero awareness pages • Start rank (k) • rank to start randomization from • Degree of randomization (r) • controls the tradeoff between exploration and exploitation
Tuning the Parameters • Objective: maximize quality-per-click (QPC) • Entrenchment in a community depends on many factors • Number of pages and users • Page lifetimes • Visits per user • Two ways to tune • set parameters per community • one parameter setting for all communities
Outline • Problem introduction • Evidence of entrenchment effect • Key idea: Mitigate entrenchment by introducing randomness into ranking • Model of ranking and popularity evolution • Evaluation • Summary
Popularity Evolution Cycle Popularity P(p,t) Awareness A(p,t) Rank R(p,t) Visit rate V(p,t)
DETAIL Popularity to Rank Relationship • Rank of a page under randomized rank promotion scheme • determined by a combination of popularity and randomness • Deterministic Popularity-based-ranking is a special case • i.e., r=0 • Unknown function FPR:rank as a function ofthe popularity of page p under a given randomized scheme R(p,t) = FPR(P(p,t))
DETAIL Viewing Likelihood • Depends primarily on rank in list [Joachims KDD’02] • From AltaVista data [Lempel et al. WWW’03]: 1 . 2 1 0 . 8 view probability 0 . 6 Probability of Viewing FRV(r) r –1.5 0 . 4 0 . 2 0 0 5 0 1 0 0 1 5 0 rank R a n k
DETAIL Visit to Awareness Relationship • Awareness A(p,t) :fraction of users who have visited page p at least once by time t
DETAIL Awareness to Popularity Relationship • Quality Q(p) :extent to which users like page p (contribute towards its popularity) • Popularity P(p,t) :
Popularity Evolution Cycle FPR(P(p,t)) FAP(A(p,t)) Popularity P(p,t) Awareness A(p,t) Rank R(p,t) Visit rate V(p,t) FRV(R(p,t)) FVA(V(p,t))
Next step : derive formula for popularity evolution curve Popularity P(p,t) time (t) Deriving Popularity Evolution Curve • Derive it using the awareness distribution of pages
Deriving Popularity Evolution Curve • Assumptions • number of pages constant • Pages are created and retired according to a Poisson process with rate parameter • Quality distribution of pages is stationary In the steady state, both popularity and awareness distribution of the pages are stationary
DETAIL Popularity Evolution Curve and Awareness Distribution Awareness distribution : fraction of pages of quality q whose awareness is i / (#users) Popularity EvolutionCurveE(x,q) : time duration for which a page of quality q has popularity value x Next: derive popularity evolution curve using the awareness distribution
DETAIL Popularity Evolution Curve and Awareness Distribution : interpret it as the probability of a page of quality q to have awareness ai at any point of time We know that : Hence,
DETAIL Deriving Awareness Distribution • : fraction of pages of quality q whose awareness is i / (#users) • Doing the steady state analysis, we get but remember that we do not know FPRyet R(p,t) = FPR(P(p,t))
DETAIL Deriving Awareness Distribution Good news: rank is a combination of popularity and randomness, we can derive FPR given . (ex. below) Start with an initial form of FPR; iterate till convergence
Summary of Where We Stand • Formalized the popularity evolution cycle • Relationship between popularity evolution and awareness distribution • Derived the awareness distribution • Next step: tune parameters • Recall, goal is to obtain scheme that: • achieves high QPC (quality per click) • is robust across a wide range of community types
Tuning the Promotion Scheme • Parameters: k, r and Wm • Objective: maximize QPC • Influential factors: • Number of pages and users • Page lifetimes • Visits per user
Default Community Setting Number of pages = 10,000 * Number of users = 1000 Visits per user = 1000 visits per day Page lifetimes = 1.5 years [Ntoulas et. al, WWW’04 ] * How Much Information? SIMS, Berkeley, 2003
Tuning: Wm parameter • -no promotion • - uniform promotion • selective promotion k=1 and r=0.2
Tuning: k and r • Optimal r: (0,1) • Optimal r increases • with increasing k Based on simulation (reason: analysis only accurate for small values of r)
Tuning: k and r Deciding k & r : • k >= 2 for “feeling lucky” • Minimize amount of “junk” perceived • Maximize QPC
Final Parameter Settings • Promotion pool (Wm ): zero-awareness pages • Start rank (k): 1 or 2 • Randomization (r) : 0.1
Tuning the Promotion Scheme • Parameters: k, r and Wm • Objective: maximize QPC • Influential factors: • Number of pages and users • Page lifetimes • Visits per user
Influence of Visit Rate 1000 visits/day per user
Summary • Entrenchment effect hurts search result quality • Solution: Randomized rank promotion • Model of Web evolution and QPC metric • Used to tune & evaluate randomized rank promotion • Initial results • Significantly increases QPC • Robust across wide range of Web communities • More study required
THE END • Paper available at : www.cs.cmu.edu/~spandey