1 / 52

Recommendation in Advertising and Social Networks

Recommendation in Advertising and Social Networks. Deepayan Chakrabarti (deepay@yahoo-inc.com). This presentation. Content Match [KDD 2007] : How can we estimate the click-through rate (CTR) of an ad on a page?. CTR for ad j on page i. ~10 9 pages. ~10 6 ads. This presentation.

rex
Download Presentation

Recommendation in Advertising and Social Networks

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recommendation in Advertising and Social Networks DeepayanChakrabarti(deepay@yahoo-inc.com)

  2. This presentation • Content Match [KDD 2007]: • How can we estimate the click-through rate (CTR) of an ad on a page? CTR for ad j on page i ~109 pages ~106 ads

  3. This presentation • Estimating CTR for Content Match [KDD ‘07] • Theoretical underpinnings[COLT ‘10 best student paper] • Represent relationships as a graph • Recommendation = Link Prediction • Many useful heuristics exist • Why do these heuristics work? Goal: Suggest friends

  4. Estimating CTR for Content Match • Contextual Advertising • Show an ad on a webpage (“impression”) • Revenue is generated if a user clicks • Problem: Estimate the click-through rate (CTR) of an ad on a page CTR for ad j on page i ~109 pages ~106 ads

  5. Estimating CTR for Content Match • Why not use the MLE? • Few (page, ad) pairs have N>0 • Very few have c>0 as well • MLE does not differentiate between 0/10 and 0/100 • We have additional information: hierarchies

  6. Estimating CTR for Content Match • Use an existing, well-understood hierarchy • Categorize ads and webpages to leaves of the hierarchy • CTR estimates of siblings are correlated • The hierarchy allows us to aggregate data • Coarser resolutions • provide reliable estimates for rare events • which then influences estimation at finer resolutions

  7. Estimating CTR for Content Match • Region= (page node, ad node) • Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Level i Region Page hierarchy Ad hierarchy

  8. Estimating CTR for Content Match Level 0 • Region= (page node, ad node) • Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Level i Page hierarchy Ad hierarchy

  9. Estimating CTR for Content Match • Our Approach • Data Transformation • Model • Model Fitting

  10. Data Transformation • Problem: • Solution: Freeman-Tukey transform • Differentiates regions with 0 clicks • Variance stabilization:

  11. Model • Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000] Level i Each region has a latent state Sr yr is independent of the hierarchy given Sr Sr is drawn from its parent Spa(r) Sparent latent S3 S1 S4 Level i+1 S2 y1 y2 y4 observable 11

  12. Model wpa(r) Spa(r) variance wr Vpa(r) βpa(r) ypa(r) upa(r) Sr variance Vr coeff. βr ur yr 12

  13. Model • However, learning Wr, Vr and βrfor each region is clearly infeasible • Assumptions: • All regions at the same level ℓ sharethe same W(ℓ) and β(ℓ) • Vr = V/Nr for some constant V, since wr Spa(r) Sr Vr βr yr ur

  14. Model • Implications: • determines degree of smoothing • : • Sr varies greatly from Spa(r) • Each region learns its own Sr • No smoothing • : • All Sr are identical • A regression model on features ur is learnt • Maximum Smoothing wr Spa(r) Sr Vr βr yr ur

  15. Model • Implications: • determines degree of smoothing • Var(Sr) increases from root to leaf • Better estimates at coarser resolutions wr Spa(r) Sr Vr βr yr ur

  16. Model • Implications: • determines degree of smoothing • Var(Sr) increases from root to leaf • Correlations among siblings atlevel ℓ: • Depends only on level of least commonancestor wr Spa(r) Sr Vr βr ) yr ur ) > Corr( , Corr( ,

  17. Estimating CTR for Content Match • Our Approach • Data Transformation (Freeman-Tukey) • Model (Tree-structured Markov Chain) • Model Fitting

  18. Model Fitting • Fitting using a Kalman filtering algorithm • Filtering: Recursively aggregate data from leaves to root • Smoothing: Propagate information from root to leaves • Complexity: linear in the number of regions, for both time and space filtering smoothing

  19. Model Fitting • Fitting using a Kalman filtering algorithm • Filtering: Recursively aggregate data from leaves to root • Smoothing: Propagates information from root to leaves • Kalman filter requires knowledge of β, V, and W • EM wrapped around the Kalman filter filtering smoothing

  20. Experiments • 503M impressions • 7-level hierarchy of which the top 3 levels were used • Zero clicks in • 76% regions in level 2 • 95% regions in level 3 • Full dataset DFULL, and a 2/3 sample DSAMPLE

  21. Experiments • Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE • Some of these regions R>0 get clicks in DFULL • A good model should predict higher CTRs for R>0 as against the other regions in R

  22. Experiments • We compared 4 models • TS: our tree-structured model • LM (level-mean): each level smoothed independently • NS (no smoothing): CTR proportional to 1/Nr • Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

  23. Experiments TS Random LM, NS

  24. Experiments • MLE=0 everywhere, since 0 clicks were observed • What about estimated CTR? Variability from coarser resolutions Close to MLE for large N Estimated CTR Estimated CTR Impressions Impressions No Smoothing (NS) Our Model (TS)

  25. Estimating CTR for Content Match • We presented a method to estimate • rates of extremely rare events • at multiple resolutions • under severe sparsity constraints • Key points: • Tree-structured generative model • Extremely fast parameter fitting

  26. Theoretical underpinnings • Estimating CTR for Content Match [KDD ‘07] • Theoretical underpinnings of link prediction [COLT ‘10 best student paper]

  27. Link Prediction • Which pair of nodes {i,j} shouldbe connected? Alice Bob Charlie Goal: Recommend a movie

  28. Link Prediction • Which pair of nodes {i,j} shouldbe connected? Goal: Suggest friends

  29. Link Prediction Heuristics • Predict link between nodes • Connected by the shortest path • With the most common neighbors (length 2 paths) • More weight to low-degree common nbrs (Adamic/Adar) Prolific common friends Less evidence Alice Less prolific Much more evidence 1000 followers Bob Charlie 3followers

  30. Link Prediction Heuristics • Predict link between nodes • Connected by the shortest path • With the most common neighbors (length 2 paths) • More weight to low-degree common nbrs (Adamic/Adar) • With more short paths (e.g. length 3 paths ) • exponentially decaying weights to longer paths (Katz measure) • …

  31. Previous Empirical Studies* Especially if the graph is sparse How do we justify these observations? Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

  32. Link Prediction – Generative Model Unit volume universe Model: • Nodes are uniformly distributed points in a latent space • This space has a distance metric • Points close to each other are likely to be connected in the graph • Logistic distance function (Raftery+/2002)

  33. Link Prediction – Generative Model α determines the steepness 1 ½ radius r Model: Nodes are uniformly distributed points in a latent space This space has a distance metric Points close to each other are likely to be connected in the graph Higher probability of linking • Link prediction ≈ find nearest neighbor who is not currently linked to the node. • Equivalent to inferring distances in the latent space

  34. Previous Empirical Studies* Especially if the graph is sparse Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

  35. Common Neighbors • Pr2(i,j) = Pr(common neighbor|dij) j i Product of two logistic probabilities, integrated over a volume determined by dij Asα∞Logistic  Step function Much easier to analyze!

  36. Common Neighbors Everyone has same radius r Unit volume universe j i η=Number of common neighbors V(r)=volume of radius r in D dims # common nbrs gives a bound on distance

  37. Common Neighbors • OPT = node closest to i • MAX = node with max common neighbors with i • Theorem: w.h.p dOPT ≤ dMAX≤ dOPT + 2[ε/V(1)]1/D Link prediction by common neighbors is asymptotically optimal

  38. Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k i rk i j m k j k i Type 2: i k  j Type 1: i k  j rj rk rk ri A(ri, rj,dij) A(rk , rk,dij)

  39. Type 2 common neighbors • Example graph: • N1 nodes of radius r1 and N2 nodes of radius r2 • r1 << r2 k i j η2 ~ Bin[N2 , A(r2, r2, dij)] η1 ~ Bin[N1 , A(r1, r1, dij)] Pick d* to maximize Pr[η1 , η2 | dij]  w(r1)E[η1|d*] + w(r2) E[η2|d*] = w(r1)η1 + w(r2) η2 Weighted common neighbors Inversely related to d*

  40. Type 2 common neighbors j k i rk Adamic/Adar Presence of common neighbor is very informative Absence is very informative 1/r r is close to max radius Real world graphs generally fall in this range

  41. Previous Empirical Studies* Especially if the graph is sparse Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

  42. ℓ-hop Paths • Common neighbors = 2 hop paths • For longer paths: • Bounds are weaker • For ℓ’ ≥ℓwe need ηℓ’ >> ηℓto obtain similar bounds •  justifies the exponentially decaying weight given to longer paths by the Katz measure

  43. Summary • Three key ingredients • Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001 • Triangle inequality holds necessary to extend to ℓ-hop paths • Points are spread uniformly at random  Otherwise properties will depend on location as well as distance

  44. Summary In sparse graphs, length 3 or more paths help in prediction. Differentiating between different degrees is important For large dense graphs, common neighbors are enough Link prediction accuracy* The number of paths matters, not the length Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

  45. Conclusions • Discussed two problems • Estimating CTR for Content Match • Combat sparsity by hierarchical smoothing • Theoretical underpinnings • Latent space model • Link prediction ≈ finding nearest neighbors in this space

  46. Other Work • Computational Advertising • Combining IR with click feedback [WWW ‘08] • Multi-armed bandits using hierarchies [SDM ‘07, ICML ‘07] • “Mortal” multi-armed bandits [NIPS ‘08] • Traffic Shaping [EC ‘12] • Web Search • Finding Quicklinks[WWW ‘09] • Titles for Quicklinks[KDD ‘08] • Incorporating tweets into search results [ICWSM ‘11] • Website clustering [WWW ‘10] • Webpage segmentation [WWW ‘08] • Template detection [WWW ‘07] • Finding hidden query aspects [KDD ’09] • Graph Mining • Epidemic thresholds [SRDS ‘03, Infocom ‘07] • Non-parametric prediction in dynamic graphs • Graph sampling [ICML ‘11] • Graph generation models [SDM ‘04, PKDD ‘05, JMLR ‘10] • Community detection [KDD ‘04, PKDD ‘04]

  47. Advertising Setting Sponsored Search Display Content Match Content match ad

  48. Advertising Setting Sponsored Search Display Content Match Text ads Pick ads Match ads to the content

  49. Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k • “Weighted” common neighbors: • Predict (i,j) pairs with highest Σ w(r)η(r) i rk m # common neighbors of radius r Weight for nodes of radius r

  50. Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k • “Weighted” common neighbors: • Predict (i,j) pairs with highest Σ w(r)η(r) i m rk # common neighbors of radius r Weight for nodes of radius r

More Related