Recommendation in Advertising and Social Networks

Recommendation in Advertising and Social Networks DeepayanChakrabarti(deepay@yahoo-inc.com)

This presentation • Content Match [KDD 2007]: • How can we estimate the click-through rate (CTR) of an ad on a page? CTR for ad j on page i ~109 pages ~106 ads

This presentation • Estimating CTR for Content Match [KDD ‘07] • Theoretical underpinnings[COLT ‘10 best student paper] • Represent relationships as a graph • Recommendation = Link Prediction • Many useful heuristics exist • Why do these heuristics work? Goal: Suggest friends

Estimating CTR for Content Match • Contextual Advertising • Show an ad on a webpage (“impression”) • Revenue is generated if a user clicks • Problem: Estimate the click-through rate (CTR) of an ad on a page CTR for ad j on page i ~109 pages ~106 ads

Estimating CTR for Content Match • Why not use the MLE? • Few (page, ad) pairs have N>0 • Very few have c>0 as well • MLE does not differentiate between 0/10 and 0/100 • We have additional information: hierarchies

Estimating CTR for Content Match • Use an existing, well-understood hierarchy • Categorize ads and webpages to leaves of the hierarchy • CTR estimates of siblings are correlated • The hierarchy allows us to aggregate data • Coarser resolutions • provide reliable estimates for rare events • which then influences estimation at finer resolutions

Estimating CTR for Content Match • Region= (page node, ad node) • Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Level i Region Page hierarchy Ad hierarchy

Estimating CTR for Content Match Level 0 • Region= (page node, ad node) • Region Hierarchy • A cross-product of the page hierarchy and the ad hierarchy Level i Page hierarchy Ad hierarchy

Estimating CTR for Content Match • Our Approach • Data Transformation • Model • Model Fitting

Data Transformation • Problem: • Solution: Freeman-Tukey transform • Differentiates regions with 0 clicks • Variance stabilization:

Model • Goal: Smoothing across siblings in hierarchy[Huang+Cressie/2000] Level i Each region has a latent state Sr yr is independent of the hierarchy given Sr Sr is drawn from its parent Spa(r) Sparent latent S3 S1 S4 Level i+1 S2 y1 y2 y4 observable 11

Model wpa(r) Spa(r) variance wr Vpa(r) βpa(r) ypa(r) upa(r) Sr variance Vr coeff. βr ur yr 12

Model • However, learning Wr, Vr and βrfor each region is clearly infeasible • Assumptions: • All regions at the same level ℓ sharethe same W(ℓ) and β(ℓ) • Vr = V/Nr for some constant V, since wr Spa(r) Sr Vr βr yr ur

Model • Implications: • determines degree of smoothing • : • Sr varies greatly from Spa(r) • Each region learns its own Sr • No smoothing • : • All Sr are identical • A regression model on features ur is learnt • Maximum Smoothing wr Spa(r) Sr Vr βr yr ur

Model • Implications: • determines degree of smoothing • Var(Sr) increases from root to leaf • Better estimates at coarser resolutions wr Spa(r) Sr Vr βr yr ur

Model • Implications: • determines degree of smoothing • Var(Sr) increases from root to leaf • Correlations among siblings atlevel ℓ: • Depends only on level of least commonancestor wr Spa(r) Sr Vr βr ) yr ur ) > Corr( , Corr( ,

Estimating CTR for Content Match • Our Approach • Data Transformation (Freeman-Tukey) • Model (Tree-structured Markov Chain) • Model Fitting

Model Fitting • Fitting using a Kalman filtering algorithm • Filtering: Recursively aggregate data from leaves to root • Smoothing: Propagate information from root to leaves • Complexity: linear in the number of regions, for both time and space filtering smoothing

Model Fitting • Fitting using a Kalman filtering algorithm • Filtering: Recursively aggregate data from leaves to root • Smoothing: Propagates information from root to leaves • Kalman filter requires knowledge of β, V, and W • EM wrapped around the Kalman filter filtering smoothing

Experiments • 503M impressions • 7-level hierarchy of which the top 3 levels were used • Zero clicks in • 76% regions in level 2 • 95% regions in level 3 • Full dataset DFULL, and a 2/3 sample DSAMPLE

Experiments • Estimate CTRs for all regions R in level 3 with zero clicks in DSAMPLE • Some of these regions R>0 get clicks in DFULL • A good model should predict higher CTRs for R>0 as against the other regions in R

Experiments • We compared 4 models • TS: our tree-structured model • LM (level-mean): each level smoothed independently • NS (no smoothing): CTR proportional to 1/Nr • Random: Assuming |R>0| is given, randomly predict the membership of R>0 out of R

Experiments TS Random LM, NS

Experiments • MLE=0 everywhere, since 0 clicks were observed • What about estimated CTR? Variability from coarser resolutions Close to MLE for large N Estimated CTR Estimated CTR Impressions Impressions No Smoothing (NS) Our Model (TS)

Estimating CTR for Content Match • We presented a method to estimate • rates of extremely rare events • at multiple resolutions • under severe sparsity constraints • Key points: • Tree-structured generative model • Extremely fast parameter fitting

Theoretical underpinnings • Estimating CTR for Content Match [KDD ‘07] • Theoretical underpinnings of link prediction [COLT ‘10 best student paper]

Link Prediction • Which pair of nodes {i,j} shouldbe connected? Alice Bob Charlie Goal: Recommend a movie

Link Prediction • Which pair of nodes {i,j} shouldbe connected? Goal: Suggest friends

Link Prediction Heuristics • Predict link between nodes • Connected by the shortest path • With the most common neighbors (length 2 paths) • More weight to low-degree common nbrs (Adamic/Adar) Prolific common friends Less evidence Alice Less prolific Much more evidence 1000 followers Bob Charlie 3followers

Link Prediction Heuristics • Predict link between nodes • Connected by the shortest path • With the most common neighbors (length 2 paths) • More weight to low-degree common nbrs (Adamic/Adar) • With more short paths (e.g. length 3 paths ) • exponentially decaying weights to longer paths (Katz measure) • …

Previous Empirical Studies* Especially if the graph is sparse How do we justify these observations? Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Link Prediction – Generative Model Unit volume universe Model: • Nodes are uniformly distributed points in a latent space • This space has a distance metric • Points close to each other are likely to be connected in the graph • Logistic distance function (Raftery+/2002)

Link Prediction – Generative Model α determines the steepness 1 ½ radius r Model: Nodes are uniformly distributed points in a latent space This space has a distance metric Points close to each other are likely to be connected in the graph Higher probability of linking • Link prediction ≈ find nearest neighbor who is not currently linked to the node. • Equivalent to inferring distances in the latent space

Previous Empirical Studies* Especially if the graph is sparse Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Common Neighbors • Pr2(i,j) = Pr(common neighbor|dij) j i Product of two logistic probabilities, integrated over a volume determined by dij Asα∞Logistic  Step function Much easier to analyze!

Common Neighbors Everyone has same radius r Unit volume universe j i η=Number of common neighbors V(r)=volume of radius r in D dims # common nbrs gives a bound on distance

Common Neighbors • OPT = node closest to i • MAX = node with max common neighbors with i • Theorem: w.h.p dOPT ≤ dMAX≤ dOPT + 2[ε/V(1)]1/D Link prediction by common neighbors is asymptotically optimal

Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k i rk i j m k j k i Type 2: i k  j Type 1: i k  j rj rk rk ri A(ri, rj,dij) A(rk , rk,dij)

Type 2 common neighbors • Example graph: • N1 nodes of radius r1 and N2 nodes of radius r2 • r1 << r2 k i j η2 ~ Bin[N2 , A(r2, r2, dij)] η1 ~ Bin[N1 , A(r1, r1, dij)] Pick d* to maximize Pr[η1 , η2 | dij]  w(r1)E[η1|d*] + w(r2) E[η2|d*] = w(r1)η1 + w(r2) η2 Weighted common neighbors Inversely related to d*

Type 2 common neighbors j k i rk Adamic/Adar Presence of common neighbor is very informative Absence is very informative 1/r r is close to max radius Real world graphs generally fall in this range

Previous Empirical Studies* Especially if the graph is sparse Link prediction accuracy* Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

ℓ-hop Paths • Common neighbors = 2 hop paths • For longer paths: • Bounds are weaker • For ℓ’ ≥ℓwe need ηℓ’ >> ηℓto obtain similar bounds •  justifies the exponentially decaying weight given to longer paths by the Katz measure

Summary • Three key ingredients • Closer points are likelier to be linked. Small World Model- Watts, Strogatz, 1998, Kleinberg 2001 • Triangle inequality holds necessary to extend to ℓ-hop paths • Points are spread uniformly at random  Otherwise properties will depend on location as well as distance

Summary In sparse graphs, length 3 or more paths help in prediction. Differentiating between different degrees is important For large dense graphs, common neighbors are enough Link prediction accuracy* The number of paths matters, not the length Random Shortest Path Common Neighbors Adamic/Adar Ensemble of short paths *Liben-Nowell & Kleinberg, 2003; Brand, 2005; Sarkar & Moore, 2007

Conclusions • Discussed two problems • Estimating CTR for Content Match • Combat sparsity by hierarchical smoothing • Theoretical underpinnings • Latent space model • Link prediction ≈ finding nearest neighbors in this space

Other Work • Computational Advertising • Combining IR with click feedback [WWW ‘08] • Multi-armed bandits using hierarchies [SDM ‘07, ICML ‘07] • “Mortal” multi-armed bandits [NIPS ‘08] • Traffic Shaping [EC ‘12] • Web Search • Finding Quicklinks[WWW ‘09] • Titles for Quicklinks[KDD ‘08] • Incorporating tweets into search results [ICWSM ‘11] • Website clustering [WWW ‘10] • Webpage segmentation [WWW ‘08] • Template detection [WWW ‘07] • Finding hidden query aspects [KDD ’09] • Graph Mining • Epidemic thresholds [SRDS ‘03, Infocom ‘07] • Non-parametric prediction in dynamic graphs • Graph sampling [ICML ‘11] • Graph generation models [SDM ‘04, PKDD ‘05, JMLR ‘10] • Community detection [KDD ‘04, PKDD ‘04]

Advertising Setting Sponsored Search Display Content Match Content match ad

Advertising Setting Sponsored Search Display Content Match Text ads Pick ads Match ads to the content

Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k • “Weighted” common neighbors: • Predict (i,j) pairs with highest Σ w(r)η(r) i rk m # common neighbors of radius r Weight for nodes of radius r

Common Neighbors: Distinct Radii j k • Node k has radius rk . • ik if dik ≤ rk (Directed graph) • rk captures popularity of node k • “Weighted” common neighbors: • Predict (i,j) pairs with highest Σ w(r)η(r) i m rk # common neighbors of radius r Weight for nodes of radius r

Recommendation in Advertising and Social Networks

Recommendation in Advertising and Social Networks

Presentation Transcript

Location Recommendation for Location-based Social Networks

A Matrix Factorization Technique with Trust Propagation for Recommendation in Social Networks

Recommendation in Social Networks

Social, Ethical and Regulatory Issues in Advertising

Recommendation in Social Networks

Link Recommendation In P2P Social Networks

Trust In Social and Sensor Networks

Influence and Correlation in Social Networks

Online advertising on social networks

Identity and search in social networks

Social Networks and Social Media

Document Recommendation in Social Tagging Services

User Generated Content and Approaching Advertising in Social Networks

Social Networks: Advertising, Pricing and All That

Viral Marketing and Advertising Strategies For Social Networks

Trust and Reputation in Social Networks

Diffusion in (Social) networks

Viral Marketing and Advertising Strategies For Social Networks

Neural Networks in Social Networks