520 likes | 622 Views
Recommendation in Advertising and Social Networks. Deepayan Chakrabarti (deepay@yahoo-inc.com). This presentation. Content Match [KDD 2007] : How can we estimate the click-through rate (CTR) of an ad on a page?. CTR for ad j on page i. ~10 9 pages. ~10 6 ads. This presentation.
E N D
Recommendation in Advertising and Social Networks DeepayanChakrabarti(deepay@yahoo-inc.com)
This presentation • Content Match [KDD 2007]: • How can we estimate the click-through rate (CTR) of an ad on a page? CTR for ad j on page i ~109 pages ~106 ads
This presentation • Estimating CTR for Content Match [KDD ‘07] • Theoretical underpinnings[COLT ‘10 best student paper] • Represent relationships as a graph • Recommendation = Link Prediction • Many useful heuristics exist • Why do these heuristics work? Goal: Suggest friends
Estimating CTR for Content Match • Contextual Advertising • Show an ad on a webpage (“impression”) • Revenue is generated if a user clicks • Problem: Estimate the click-through rate (CTR) of an ad on a page CTR for ad j on page i ~109 pages ~106 ads
Estimating CTR for Content Match • Why not use the MLE? • Few (page, ad) pairs have N>0 • Very few have c>0 as well • MLE does not differentiate between 0/10 and 0/100 • We have additional information: hierarchies
Estimating CTR for Content Match • Use an existing, well-understood hierarchy • Categorize ads and webpages to leaves of the hierarchy • CTR estimates of siblings are correlated • The hierarchy allows us to aggregate data • Coarser resolutions • provide reliable estimates for rare events • which then influences estimation at finer resolutions
Estimating CTR for Content Match • Region= (page node, ad node) • Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Level i Region Page hierarchy Ad hierarchy
Estimating CTR for Content Match Level 0 • Region= (page node, ad node) • Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Level i Page hierarchy Ad hierarchy
Estimating CTR for Content Match • Our Approach • Data Transformation • Model • Model Fitting
Data Transformation • Problem: • Solution: Freeman-Tukey transform • Differentiates regions with 0 clicks • Variance stabilization:
Model • Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000] Level i Each region has a latent state Sr yr is independent of the hierarchy given Sr Sr is drawn from its parent Spa(r) Sparent latent S3 S1 S4 Level i+1 S2 y1 y2 y4 observable 11
Model wpa(r) Spa(r) variance wr Vpa(r) βpa(r) ypa(r) upa(r) Sr variance Vr coeff. βr ur yr 12
Model • However, learning Wr, Vr and βrfor each region is clearly infeasible • Assumptions: • All regions at the same level ℓ sharethe same W(ℓ) and β(ℓ) • Vr = V/Nr for some constant V, since wr Spa(r) Sr Vr βr yr ur
Model • Implications: • determines degree of smoothing • : • Sr varies greatly from Spa(r) • Each region learns its own Sr • No smoothing • : • All Sr are identical • A regression model on features ur is learnt • Maximum Smoothing wr Spa(r) Sr Vr βr yr ur
Model • Implications: • determines degree of smoothing • Var(Sr) increases from root to leaf • Better estimates at coarser resolutions wr Spa(r) Sr Vr βr yr ur
Model • Implications: • determines degree of smoothing • Var(Sr) increases from root to leaf • Correlations among siblings atlevel ℓ: • Depends only on level of least commonancestor wr Spa(r) Sr Vr βr ) yr ur ) > Corr( , Corr( ,
Estimating CTR for Content Match • Our Approach • Data Transformation (Freeman-Tukey) • Model (Tree-structured Markov Chain) • Model Fitting
Model Fitting • Fitting using a Kalman filtering algorithm • Filtering: Recursively aggregate data from leaves to root • Smoothing: Propagate information from root to leaves • Complexity: linear in the number of regions, for both time and space filtering smoothing
Model Fitting • Fitting using a Kalman filtering algorithm • Filtering: Recursively aggregate data from leaves to root • Smoothing: Propagates information from root to leaves • Kalman filter requires knowledge of β, V, and W • EM wrapped around the Kalman filter filtering smoothing
Experiments • 503M impressions • 7-level hierarchy of which the top 3 levels were used • Zero clicks in • 76% regions in level 2 • 95% regions in level 3 • Full dataset DFULL, and a 2/3 sample DSAMPLE
Experiments • Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE • Some of these regions R>0 get clicks in DFULL • A good model should predict higher CTRs for R>0 as against the other regions in R
Experiments • We compared 4 models • TS: our tree-structured model • LM (level-mean): each level smoothed independently • NS (no smoothing): CTR proportional to 1/Nr • Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R
Experiments TS Random LM, NS
Experiments • MLE=0 everywhere, since 0 clicks were observed • What about estimated CTR? Variability from coarser resolutions Close to MLE for large N Estimated CTR Estimated CTR Impressions Impressions No Smoothing (NS) Our Model (TS)
Estimating CTR for Content Match • We presented a method to estimate • rates of extremely rare events • at multiple resolutions • under severe sparsity constraints • Key points: • Tree-structured generative model • Extremely fast parameter fitting
Theoretical underpinnings • Estimating CTR for Content Match [KDD ‘07] • Theoretical underpinnings of link prediction [COLT ‘10 best student paper]
Link Prediction • Which pair of nodes {i,j} shouldbe connected? Alice Bob Charlie Goal: Recommend a movie
Link Prediction • Which pair of nodes {i,j} shouldbe connected? Goal: Suggest friends
Link Prediction Heuristics • Predict link between nodes • Connected by the shortest path • With the most common neighbors (length 2 paths) • More weight to low-degree common nbrs (Adamic/Adar) Prolific common friends Less evidence Alice Less prolific Much more evidence 1000 followers Bob Charlie 3followers
Link Prediction Heuristics • Predict link between nodes • Connected by the shortest path • With the most common neighbors (length 2 paths) • More weight to low-degree common nbrs (Adamic/Adar) • With more short paths (e.g. length 3 paths ) • exponentially decaying weights to longer paths (Katz measure) • …
Previous Empirical Studies* Especially if the graph is sparse How do we justify these observations? Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
Link Prediction – Generative Model Unit volume universe Model: • Nodes are uniformly distributed points in a latent space • This space has a distance metric • Points close to each other are likely to be connected in the graph • Logistic distance function (Raftery+/2002)
Link Prediction – Generative Model α determines the steepness 1 ½ radius r Model: Nodes are uniformly distributed points in a latent space This space has a distance metric Points close to each other are likely to be connected in the graph Higher probability of linking • Link prediction ≈ find nearest neighbor who is not currently linked to the node. • Equivalent to inferring distances in the latent space
Previous Empirical Studies* Especially if the graph is sparse Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
Common Neighbors • Pr2(i,j) = Pr(common neighbor|dij) j i Product of two logistic probabilities, integrated over a volume determined by dij Asα∞Logistic Step function Much easier to analyze!
Common Neighbors Everyone has same radius r Unit volume universe j i η=Number of common neighbors V(r)=volume of radius r in D dims # common nbrs gives a bound on distance
Common Neighbors • OPT = node closest to i • MAX = node with max common neighbors with i • Theorem: w.h.p dOPT ≤ dMAX≤ dOPT + 2[ε/V(1)]1/D Link prediction by common neighbors is asymptotically optimal
Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k i rk i j m k j k i Type 2: i k j Type 1: i k j rj rk rk ri A(ri, rj,dij) A(rk , rk,dij)
Type 2 common neighbors • Example graph: • N1 nodes of radius r1 and N2 nodes of radius r2 • r1 << r2 k i j η2 ~ Bin[N2 , A(r2, r2, dij)] η1 ~ Bin[N1 , A(r1, r1, dij)] Pick d* to maximize Pr[η1 , η2 | dij] w(r1)E[η1|d*] + w(r2) E[η2|d*] = w(r1)η1 + w(r2) η2 Weighted common neighbors Inversely related to d*
Type 2 common neighbors j k i rk Adamic/Adar Presence of common neighbor is very informative Absence is very informative 1/r r is close to max radius Real world graphs generally fall in this range
Previous Empirical Studies* Especially if the graph is sparse Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
ℓ-hop Paths • Common neighbors = 2 hop paths • For longer paths: • Bounds are weaker • For ℓ’ ≥ℓwe need ηℓ’ >> ηℓto obtain similar bounds • justifies the exponentially decaying weight given to longer paths by the Katz measure
Summary • Three key ingredients • Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001 • Triangle inequality holds necessary to extend to ℓ-hop paths • Points are spread uniformly at random Otherwise properties will depend on location as well as distance
Summary In sparse graphs, length 3 or more paths help in prediction. Differentiating between different degrees is important For large dense graphs, common neighbors are enough Link prediction accuracy* The number of paths matters, not the length Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007
Conclusions • Discussed two problems • Estimating CTR for Content Match • Combat sparsity by hierarchical smoothing • Theoretical underpinnings • Latent space model • Link prediction ≈ finding nearest neighbors in this space
Other Work • Computational Advertising • Combining IR with click feedback [WWW ‘08] • Multi-armed bandits using hierarchies [SDM ‘07, ICML ‘07] • “Mortal” multi-armed bandits [NIPS ‘08] • Traffic Shaping [EC ‘12] • Web Search • Finding Quicklinks[WWW ‘09] • Titles for Quicklinks[KDD ‘08] • Incorporating tweets into search results [ICWSM ‘11] • Website clustering [WWW ‘10] • Webpage segmentation [WWW ‘08] • Template detection [WWW ‘07] • Finding hidden query aspects [KDD ’09] • Graph Mining • Epidemic thresholds [SRDS ‘03, Infocom ‘07] • Non-parametric prediction in dynamic graphs • Graph sampling [ICML ‘11] • Graph generation models [SDM ‘04, PKDD ‘05, JMLR ‘10] • Community detection [KDD ‘04, PKDD ‘04]
Advertising Setting Sponsored Search Display Content Match Content match ad
Advertising Setting Sponsored Search Display Content Match Text ads Pick ads Match ads to the content
Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k • “Weighted” common neighbors: • Predict (i,j) pairs with highest Σ w(r)η(r) i rk m # common neighbors of radius r Weight for nodes of radius r
Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k • “Weighted” common neighbors: • Predict (i,j) pairs with highest Σ w(r)η(r) i m rk # common neighbors of radius r Weight for nodes of radius r