Online Advertising Open lecture at Warsaw University February 25/26, 2011

Please interrupt me at any point! Online AdvertisingOpen lecture at Warsaw UniversityFebruary 25/26, 2011 Ingmar Weber Yahoo! Research Barcelona ingmar@yahoo-inc.com

Disclaimers & Acknowledgments • This talk presents the opinions of the author. It does not necessarily reflect the views of Yahoo! Inc. or any other entity. • Algorithms, techniques, features, etc. mentioned here might or might not be in use by Yahoo! or any other company. • Many of the slides in this lecture are based on tables/graphs from the referenced papers. Please see the actual papers for more details.

Review from last lecture • Lots of money • Ads essentially pay for the WWW • Mostly sponsored search and display ads • Sp. search: sold using variants of GSP • Disp. ads: sold in GD contracts or on the spot • Many computational challenges • Finding relevant ads, predicting CTRs, new/tail content and queries, detecting fraud, …

Plan for today and tomorrow • So far • Mostly introductory, “text book material” • Now • Mostly recent research papers • Crash course in machine learning, information retrieval, economics, … Hopefully more “think-along” (not sing-along) and not “shut-up-and-listen”

But first … • Third party cookies www.bluekai.com (many others …)

Efficient Online Ad Serving in a Display Advertising Exchange Keving Lang, Joaquin Delgado, Dongming Jiang, et al. WSDM’11

Not so simple landscape for D A Advertisers “Buy shoes at nike.com” “Visit asics.com today” “Rolex is great.” Publishers A running blog The legend of Cliff Young Celebrity gossip Users 32m, likes running 50f, loves watches 16m, likes sports Basic problem: Given a (user, publisher) pair, find a good ad(vertiser)

Ad networks and Exchanges • Ad networks • Bring together supply (publishers) and demand (advertisers) • Have bilateral agreements via revenue sharing to increase market fluidity • Exchanges • Do the actual real-time allocation • Implement the bilateral agreements

Middle-aged, middle-income New Yorker visits the web site of Cigar Magazine (P1) D only known at end. User constraints: no alcohol ads to minors Supply constraints: conservative network doesn’t want left publishers Demand constraints: Premium blogs don’t want spammy ads

Valid Paths & Objective Function

Depth-first search enumeration Algorithm A Worst case running time? Typical running time?

Algorithm B US pruning Worst case running time? Sum vs. product? Optimizations? Upper bound Why? D pruning

Reusable Precomputation Cannot fully enforce D Depends on reachable sink … … which depends on U What if space limitations? How would you prioritize?

Experiments – artificial data

Experiments – real data

Competing for Users’ Attention: On the Interplay between Organic and Sponsored Search Results Christian Danescu-Niculescu-Mizil, Andrei Broder, et al. WWW’10 What would you investigate? What would you suspect?

Things to look at • General bias for near-identical things • Ads are preferred (as further “North”) • Organic results are preferred • Interplay between ad CTR and result CTR • Better search results, less ad clicks? • Mutually reinforcing? • Dependence on type • Navigational query vs. informational query • Responsive ad vs. incidental ad

Data • One month of traffic for subset of Y! search servers • Only North ads, served at least 50 times • For each query qi most clicked ad Ai* and most clicked organic result Oi* • 63,789 (qi, Oi*, Ai*) triples • Bias?

(Non-)Commercial bias? • Look at A* and O* with identical domain • Probably similar quality … • … but (North) ad is higher • What do you think? • In 52% ctrO > ctrA

Correlation av. ctrA av. ctrO ctrA ctrO For given (range of) ctrO bucket all ads.

Navigational vs. non-navigational av. ctrA av. ctrO ctrO ctrA Navigational: antagonistic effect Non-navigational: (mild) reinforcement

Dependence on similarity Bag of words for title terms (“Free Radio”, “Pandora Radio – Listen to Free Internet Radio, Find New Music”) = 2/9

Dependence on similarity av. ctrA av. ctrA

A simple model Want to model Also need:

A simple model Explains basic (quadratic) shape of overlap vs. ad click-through-rate

Improving Ad Relevance in Sponsored Search Dustin Hillard, Stefan Schroedl, Eren Manavoglu, et al. WSDM’10

Ad relevance  Ad attractiveness • Relevance • How related is the ad to the search query • q=“cocacola”, ad=“Buy Coke Online” • Attractiveness • Essentially click-through rate • q=“cocacola”, ad=“Coca Cola Company Job” • q=*, ad=“Lose weight fast and easy” Hope: decoupling leads to better (cold-start) CTR predictions

Basic setup • Get relevance from editorial judgments • Perfect, excellent, good, fair, bad • Treat non-bad as relevant • Machine learning approach • Compare query to the ad • Title, description, display URL • Word overlap (uni- and bigram), character overlap (uni- and bigram), cosine similarity, ordered bigram overlap • Query length • Data • 7k unique queries (stratified sample) • 80k query-ad judged relevant pairs

Basic results – text only Precision = (“said ‘yes’ and was ‘yes’”)/(“said ‘yes’”) Recall = (“said ‘yes’ and was ‘yes’”)/(“was ‘yes’”) Accuracy = (“said the right thing”)/(“said something”) F1-score = 2/(1/P + 1/R) harmonic mean < arithmetic mean What other features?

Incorporating user clicks • Can use historic CTRs • Assumes (ad,query) pair has been seen • Useless for new ads • Also evaluate in blanked-out setting

Translation Model In search, translation models are common Here D = ad Good translation = ad click Typical model Maximum likelihood (for historic data) A query term An ad term Any problem with this?

Digression on MLE • Maximum likelihood estimator • Pick the parameter that‘s most likely to generate the observed data Example: Draw a single number from a hat with numbers {1, …, n}. You observe 7. Maximum likelihood estimator? Underestimates size (c.f. # of species) Underestimates unknown/impossible Unbiased estimator?

Remove position bias • Train one model as described before • But with smoothing • Train a second model using expected clicks • Ratio of model for actual and expected clicks • Add these as additional features for the learner

Filtering low quality ads • Use to remove irrelevant ads • - Don‘t show ads below relevance threshold Showing fewer ads gave more clicks per search!

Second part of Part 2

Estimating Advertisability of Tail Queries for Sponsored Search Sandeep Pandey, Kunal Punera, Marcus Fontoura, et al. SIGIR’10

Two important questions • Query advertisability • When to show ads at all • How many ads to show • Ad relevance and clickability • Which ads to show • Which ads to show where Focus on first problem. Predict: will there be an ad click? Difficult for tail queries!

Word-based Model Query q has words {wi}. Model q‘s click propensity as: Good/bad? Variant w/o bias for long queries: Maximum likelihood attempt to learn these: s(q) = # instances of q with an ad click n(q) = # instances of q without an ad click

Word-based Model Then give up …each q only one word

Linear regression model Different model: words contribute linearly Add regularization to avoid overfitting of underdetermined problem Problem?

Digression Taken from: http://www.dtreg.com/svm.htm and http://www.teco.edu/~albrecht/neuro/html/node10.html

Topical clustering • Latent Dirichlet Allocation • Implicitly uses co-occurrences patterns • Incorporate the topic distributions as features in the regression model

Evaluation • Why not use the observed c(q) directly? • “Ground truth” is not trustworthy – tail queries • Sort things by predicted c(q) • Should have included optimal ordering!

Learning Website Hierarchies for Keyword Enrichment in Contextual Advertising Pavan Kumar GM, Krishna Leela, Mehul Parsana, Sachin Garg WSDM’11

The problem(s) • Keywords extracted for contextual advertising are not always perfect • Many pages are not indexed – no keywords available. Still have to serve ads • Want a system that for a given URL (indexed or not) outputs good keywords • Key observation: use in-site similarity between pages and content

Preliminaries • Mapping URLs u to key-value pairs • Represent webpage p as vector of keywords • tf, df, and section where found • Goals: • Use u to introduce new kw and/or update existing weights • For unindexed pages get kw via other pages from same site Latency constraint!

What they do • Conceptually: • Train a decision tree with keys K as attribute labels, V as attribute values and pages P as class labels • Too many classes (sparseness, efficiency) • What they do: • Use clusters of web pages as labels

Online Advertising Open lecture at Warsaw University February 25/26, 2011