Online Expansion of Rare Queries for Sponsored Search

Online Expansion of Rare Queries for Sponsored Search Defended by Mykell Miller

Summary: The Short Version This paper describes and evaluates a method of determining which ads to display on a search engine result page. Users input varied queries, so it is beneficial to post ads pertaining to not only the query, but to related queries as well. However, previous methods of finding these related queries and transforming them into ads takes a long time, and therefore are done offline. This paper describes a method that allows some of the work to be done on the fly without too much overhead.

Why it’s good: The Short Version • Useful • Ads fund search engines • If ads were more relevant, Jared might actually click on them • The method shows statistically significant improvement in making ads more relevant, at a low overhead • Interesting • Interestingness is subjective, but this is MY defense • Well-written • Well-organized • I could actually understand the math because they very clearly told me what all the variables meant • They defined all the relevant terms and summarized all the references so I didn’t have to read 32 other papers. • Time Travel • This paper is only three weeks old • A paper that was published in April cited it

Now for the long version…

Broad matching is where an ad is displayed when its bid phrase is similar to, but not exactly, the query the user inputted. What this paper is about

What this paper is about • Sponsored Search • A.K.A. Paid search advertising • On Search Engine Result Pages • All major web search engines do this • Context Match • A.K.A. Contextual Advertising • On other websites • What we looked at last Wednesday

More on Sponsored Search • The authors assume a pay-per-click model • Google, Yahoo, and Microsoft all use this model • Bid Phrases • This is the query that will result in showing this ad. • Bidding system • An advertiser pays the search company whatever it wants to associate its ad with a bid phrase • If an advertiser pays more, its ad gets a higher ranking. • Example: • High Bidders pays $1,000,000,000,000,000,000,000 for the bid phrase “Dummy Query” • Low Bidders pays $1 for the bid phrase “Dummy Query” • When I search for “Dummy Query” I see High Bidders’ ad first, then Low Bidders’ ad.

More on Sponsored Search

Why Do This Paper? • 30-40% of search engine result pages have no ads on them because Google, Yahoo, etc. don’t know what queries are similar to the bid phrase • Previous work has developed systems that are far too inefficient to use in real life

Query: Banana Bread Query: Nut-Free Banana Bread My Own Experiment • Query: Vegan Banana Bread

Why do tail queries have so few ads? • They are often harder to interpret than more common (head and torso) queries • There are rarely exact matches for bid queries • There is little historical click data • Search engines don’t like posting irrelevant ads

What does this paper accomplish? • Online query expansion for tail queries • New way to index query expansions for fast computation of query similarity • A way to go from pre-expanded queries to expanding related queries on the fly • A ranking and scoring method

The Architecture of their system

Query Feature Extraction • Unigrams • Process them via • Stemming • Taking words like “Extraction” and “Extracting” and stemming them to “Extract” • Stop words • Ignoring words you don’t like • Phrases • Multi-word phrases are from a dictionary of ~10 million phrases gathered from query logs and web pages • Semantic Classes • Developed a hierarchical taxonomy of 6000 semantic classes • Annotate each query with the 5 most likely semantic classes

Related Query Retrieval • Now we have a pseudo-query made up of features. • Compare this pseudo-query to our inverted index and pull out related pseudo-queries • Runs a system that pulls out key words then calculates the similarity using a dot product

Query Expansion • Q* is the set of features describing the original features and related queries • The weight of a given feature in Q* is a linear combination of its weight in the original and related queries • This expansion is efficient because you’re only looking at the features in related queries

Ad Feature Weighting • Extract the same features from the bid phrases of ad groups as from queries (unigrams, phrases, semantic classes) • Since the weighting from the queries would unfairly benefit short ad groups, use the BM25 weighting scheme.

Title Match Boosting • Increases the score of ads whose titles match the original query very well

Scoring Function • The end result of all this • A weighted sum of dot products between features and the title match boost

Now on to the results!

Test Set • Test set: 400 random rare queries from Yahoo • 121 were in the lookup table, 279 were not • Eliminated the 10% of rare queries that were foreign • Human editors judged the top 3 ads. • 3556 judgments • The system was built off of every ad Yahoo has and 100 million queries based off of U.S. Yahoo

Metrics • Discounted Cumulative Gain (DCG) • “a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated cumulatively from the top of the result list to the bottom with the gain of each result discounted at lower ranks.” –Wikipedia • DCG is a number; higher numbers are better • Precision-Recall Curves • Precision: Fraction of results returned that are relevant • Recall: Fraction of relevant results that are returned • A way to visualize it; higher is better

Ad Matching Algorithms Tested • Baseline • The original, unexpanded version of the query vector • Offline Expansion • Expands the original query by pre-processing offline only • Online Expansion • Expands the original query by processing online only • Online + Offline Expansion • Expands the original query using both offline and online expansion algorithms

Test Results: Queries not found in lookup table • Tested the baseline vs online expansion • The online expansion gave statistically significant improvements

Test Results: Queries found in lookup table • Tested all 4 algorithms • Best: offline expansion • Second best: online + offline expansion • Difference between the two was not statistically significant

Test results: full set • Tested on all four algorithms • Best: online + offline expansion • Online expansion also offers statistically significant improvement • Even better: hybrid

Efficiency • The table lookup takes only 1 ms • Least efficient when a query is not in the lookup table • When a query is not in the lookup table, there is a 50% overhead • This is bad • But given the small proportion of queries not in the lookup table, the estimated average is 12.5% overhead • This is good

Online Expansion of Rare Queries for Sponsored Search

Online Expansion of Rare Queries for Sponsored Search

Presentation Transcript

Structured Queries for Legal Search

Search for New Physics via η Rare Decay

Adversarial Information Retrieval Aspects of Sponsored Search

Search for rare decays of W bosons

Example queries for Federated search

Reducing Latency of Web Search Queries

Online Expansion of Rare Queries for Sponsored Search

Sponsored Search

Retroactive Answering of Search Queries

Language Identification of Search Engine Queries

A Full-Text Search Algorithm for Long Queries

Retroactive Answering of Search Queries

Online product queries

Search for Rare and Forbidden Processes at BES

Deciphering Mobile Search Patterns: A Study of Yahoo! Mobile Search Queries

Answering Similar Region Search Queries

Clustering User Queries of a Search Engine

An Array of Bolometers for Rare Event’s Search

Investigating the Relevance of Sponsored Results for Web Ecommerce Queries

Building Taxonomy of Web Search Intents for Name Entity Queries

Rare wine online

Sponsored Search