Contextual Advertising by Combining Relevance with Click Feedback

Contextual Advertising by Combining Relevance with Click Feedback Deepak Agarwal Joint work withDeepayan Chakrabarti & Vanja Josifovski Yahoo! Research WWW’08, Beijing, China 24th April, 2008

Outline • Motivating Application, Challenges • Contextual Advertising • Semantic versus Predictive models • Pros, Cons • Our Approach: Blend Semantic with Predictive • Model Description • Logistic Regression, Feature Selection • Model structure amenable to fast scoring at run time • Experimental Results • Ongoing work

Outline 1 Motivating Application, Background and Challenges

Motivating Application • Problem: Match ads to queries • Sponsored Search: • The query is a short piece of text input by the user • User intent better expressed; less noisy • Contextual Advertising: • The query is a webpage • Generally long, noisy, user intent less clear • Harder matching problem

Challenges • Serve ads to maximize revenue (CTR) • Serve most relevant ads in a given context • User Feedback in the form of Clicks in different context • Automation must for profitability • Billions of opportunities; millions of ads • High volume, low marginal cost →lucrative business • Automation through Algorithms/Models • Accuracy: Massive data; scalable procedures • Structure of Models: Scoring ads under strict latency requirements (~few ms)

Classical Approach: Semantic • Serve Shoe ads on Shoe pages • Models: Information Retrieval • Get relevant docs (ads) for a query (webpage) • Simple vector space model • q=(t1,w1;…,tn,wn); a=(a1,v1;…,am,vm) • Cos(q,a) = s ε q ∩ awsas/(|q||a|) • w’s, a’s: tf-idf; • Frequency: reward in doc; penalize in corpus • Higher score →More relevance

Pros Training: simple, scalable Vocabulary (stop-words; stemming); Corpus Serving with low latency evaluates millions of candidate ads in few ms Clever algorithms (Broder et al) Cons Does not always capture context Clicks? Better? Active user feedback Can we use it ? Semantic: Pros & Cons

Predictive Approach: Clicks • New challenging research area • Learn from historic clicks on ads • Indicator of overall relevance • Rank ads by CTR = P(Click|Ad,context) • Estimating CTR difficult statistical problem • High-dim, sparseness (too many combinations) • (Page,Ad)→(Page Features, Ad Features) • Bias-Variance Tradeoff when selecting features • Coarse is stable but less precise; fine has high variance

Statistical Challenges( contd) • Retrospective data biased • I never showed ads with word “Rolex” on pages with word “Golf”, how will I learn this match? • What is irrelevant? Labeling negatives. • I never click on ads no matter what • Good models maybe complex • Scalability while training (Grid computing helps) • Serving: All models are not index friendly • Quick evaluation during serve time improves system

When Semantic meets Predictive • Semantic provides domain knowledge • Feature selection driven by semantic knowledge • Predictive “enhances” semantic • “correction” terms to semantic to match click feedback • fallback on semantic when signal weak • Model scalable (Grid computing) • Fast to evaluate at run time • Faster→More candidates evaluated at serve time • Accuracy versus Coverage

Outline 2 Modeling Approach

Predictive Regression model • Region specific splitting for page and ad • Page “regions”: • Title, headers, boldfaces, metadata, etc. • Ad “regions”: • title, body, etc • Features: words, phrases, classes in different regions. • Word matches in title more important that in the body • Illustration: word features; title regions • Extension to multiple regions, multiple feature types routine • Experiments to appear in a future version

Logistic Regression: Word features • Model clicks/non-clicks: Logistic Regression • Training & test data: events with clicks only • yij~Ber(pij) Model parameters CTR Main effect for page (overall popularity) Main effect for ad (overall popularity) Interaction effect(words shared by page and ad) Gaussian priors on model parameters: penalizes sparse features

Feature weights “correct” relevance • Mp,w = tfp,w1(w ε p) • Ma,w = tfa,w1(w ε a) • Ip,a,w = tfp,w * tfa,w1(w ε p) 1(w ε a) • So, IR-based term frequency measures are taken into account

How to select words? • Word selection • Overall, nearly 110k words in our training data • Stop word removal, stemming • Learning parameters for each word would be: • Expensive, overfits • We use simple feature selection strategies • Select top-k

Word Selection: data based • Define an interaction measure for each word • Higher values for words which have higher-than-expected CTR when they occur on both page and ad • Remove words served or clicked few times for robustness

Word selection contd • Word selection: relevance based • Average tfidf score of each word : pages and ads • Higher values imply higher relevance • Ranked by geometric mean: tfidf on page and ad • Ranked by tfidf on page and ad; take the union

Best Word Selection scheme • Word selection • Two methods • Data based • Relevance based • We picked the top 1000 words by each measure • Data-based methods give better results Precision Recall

Semantic similarity score • Word features have low coverage; fallback mechanism to semantic similarity • Map cosine on logit scale? • Create score bins • 100 points per bin • Mean score vs logit(CTR) • Quadratic relationship logit(pij) Cosine score

Incorporating similarity • Quadratic relationship used in two ways • Put in cosine and cosine2 as features • Add as offset: Prior log-odds • Similar Results

Scalable Training • Fast Implementation • Training: Hadoop implementation of Logistic Regression Combine estimates Data Learned model params Random data splits Mean and Variance estimates Iterative Newton-Raphson

Outline 3 Fast Evaluation at Serve Time

Efficient Score Evaluation • Problem: For a page visit; select top-n ads using scoring formula • Why hard: Only a few ms; too many ads to evaluate • Rich literature in IR to solve this problem • Efficient solutions for vector space models through “posting lists” • <term, sorted list of doc IDs containing the term> • Interaction terms in regression model motivated by this • Document at a time (DAAT) strategy • Posting lists: sorted doc IDs for each query term • Evaluates each doc containing at least one query term one at a time • stop prematurely if clear doc can’t make it to top n • System sparse, few correlations; efficiency through approximations

Efficient evaluation through two-stage procedure (Broder et al.) HEAP Top-n θ=min-score Approximate: x1*U1+x2*U2+x3*U3+x4*U4 > θ Doc Ids U1 1 3 x1 U2 x2 3 5 7 CurrDoc=1 U3 x3 2 3 U1+U2+U3 > θ U1 + U2 <= θ x4 7 U4 WAND Iterator traverses posting list very efficiently by skipping unnecessary docs Efficiency depends on Upper bounds for terms

Efficiency of procedure • Efficiency through document skipping • Must be able to compute upper bounds quickly • Match scoring formula should not use arbitrary features • (“word X in query AND word Y in ad”) • Such pairwise (“cross-product”) checks may get costly • Large posting lists; too many evaluations • Upper bounds crucial to performance • Large→False +ve’s; Small→False –ve’s • We are using upper bounds recommended in literature • More efficient implementation subject of future research

System Architecture: scoring at serve time • Fast Implementation • Testing • Main effect for ads is used in ordering of ads in postings list (static) • Interaction effect is used to modify the idf-table of words (static) • Main effect for pages does not play a role in ad serving (page is given) Building postings lists

Outline 4 Experiments and Results, Summary and Ongoing Work

Experiments Precision Recall 25% lift in precision at 10% recall

Experiments Low recall region Precision Recall 25% lift in precision at 10% recall Computed precision-recall for several splits Results statistically significant

Experiments • Increasing the number of words from 1000 to 3400 led to only marginal improvement • Diminishing returns • System already performs close to its limit, without needing more training • Changing the training time period changes the word list; we update our posting lists periodically

Summary • Matching ads to pages challenging problem • We provide an approach that blends semantic similarity and predictive models in a scalable fashion • Our approach index friendly • Experimental results on large scale system shows significant improvement • We can only improve the relevance models

Ongoing Work • Change in training data changes word set • Working on more robust word feature selection • Clustering words • Efficient indexing strategies through better upper bound estimates for WAND • Expanding feature sets to include neighborhoods of words in posting lists • Balance between accuracy and WAND efficiency • Isotonic regression on cosine similarity

Contextual Advertising by Combining Relevance with Click Feedback