270 likes | 336 Views
This paper presents a novel method for determining and displaying relevant ads on a search engine results page in real-time, improving ad relevance and user engagement. By expanding queries and utilizing a scoring system, the paper enhances the efficiency of ad placements. The research addresses the challenge of tail queries and proposes a unique architecture for query feature extraction and ad feature weighting. Results indicate significant improvements in ad relevance.
E N D
Online Expansion of Rare Queries for Sponsored Search Defended by Mykell Miller
Summary: The Short Version This paper describes and evaluates a method of determining which ads to display on a search engine result page. Users input varied queries, so it is beneficial to post ads pertaining to not only the query, but to related queries as well. However, previous methods of finding these related queries and transforming them into ads takes a long time, and therefore are done offline. This paper describes a method that allows some of the work to be done on the fly without too much overhead.
Why it’s good: The Short Version • Useful • Ads fund search engines • If ads were more relevant, Jared might actually click on them • The method shows statistically significant improvement in making ads more relevant, at a low overhead • Interesting • Interestingness is subjective, but this is MY defense • Well-written • Well-organized • I could actually understand the math because they very clearly told me what all the variables meant • They defined all the relevant terms and summarized all the references so I didn’t have to read 32 other papers. • Time Travel • This paper is only three weeks old • A paper that was published in April cited it
Broad matching is where an ad is displayed when its bid phrase is similar to, but not exactly, the query the user inputted. What this paper is about
What this paper is about • Sponsored Search • A.K.A. Paid search advertising • On Search Engine Result Pages • All major web search engines do this • Context Match • A.K.A. Contextual Advertising • On other websites • What we looked at last Wednesday
More on Sponsored Search • The authors assume a pay-per-click model • Google, Yahoo, and Microsoft all use this model • Bid Phrases • This is the query that will result in showing this ad. • Bidding system • An advertiser pays the search company whatever it wants to associate its ad with a bid phrase • If an advertiser pays more, its ad gets a higher ranking. • Example: • High Bidders pays $1,000,000,000,000,000,000,000 for the bid phrase “Dummy Query” • Low Bidders pays $1 for the bid phrase “Dummy Query” • When I search for “Dummy Query” I see High Bidders’ ad first, then Low Bidders’ ad.
Why Do This Paper? • 30-40% of search engine result pages have no ads on them because Google, Yahoo, etc. don’t know what queries are similar to the bid phrase • Previous work has developed systems that are far too inefficient to use in real life
Query: Banana Bread Query: Nut-Free Banana Bread My Own Experiment • Query: Vegan Banana Bread
Why do tail queries have so few ads? • They are often harder to interpret than more common (head and torso) queries • There are rarely exact matches for bid queries • There is little historical click data • Search engines don’t like posting irrelevant ads
What does this paper accomplish? • Online query expansion for tail queries • New way to index query expansions for fast computation of query similarity • A way to go from pre-expanded queries to expanding related queries on the fly • A ranking and scoring method
Query Feature Extraction • Unigrams • Process them via • Stemming • Taking words like “Extraction” and “Extracting” and stemming them to “Extract” • Stop words • Ignoring words you don’t like • Phrases • Multi-word phrases are from a dictionary of ~10 million phrases gathered from query logs and web pages • Semantic Classes • Developed a hierarchical taxonomy of 6000 semantic classes • Annotate each query with the 5 most likely semantic classes
Related Query Retrieval • Now we have a pseudo-query made up of features. • Compare this pseudo-query to our inverted index and pull out related pseudo-queries • Runs a system that pulls out key words then calculates the similarity using a dot product
Query Expansion • Q* is the set of features describing the original features and related queries • The weight of a given feature in Q* is a linear combination of its weight in the original and related queries • This expansion is efficient because you’re only looking at the features in related queries
Ad Feature Weighting • Extract the same features from the bid phrases of ad groups as from queries (unigrams, phrases, semantic classes) • Since the weighting from the queries would unfairly benefit short ad groups, use the BM25 weighting scheme.
Title Match Boosting • Increases the score of ads whose titles match the original query very well
Scoring Function • The end result of all this • A weighted sum of dot products between features and the title match boost
Test Set • Test set: 400 random rare queries from Yahoo • 121 were in the lookup table, 279 were not • Eliminated the 10% of rare queries that were foreign • Human editors judged the top 3 ads. • 3556 judgments • The system was built off of every ad Yahoo has and 100 million queries based off of U.S. Yahoo
Metrics • Discounted Cumulative Gain (DCG) • “a measure of effectiveness of a Web search engine algorithm or related applications, often used in information retrieval. Using a graded relevance scale of documents in a search engine result set, DCG measures the usefulness, or gain, of a document based on its position in the result list. The gain is accumulated cumulatively from the top of the result list to the bottom with the gain of each result discounted at lower ranks.” –Wikipedia • DCG is a number; higher numbers are better • Precision-Recall Curves • Precision: Fraction of results returned that are relevant • Recall: Fraction of relevant results that are returned • A way to visualize it; higher is better
Ad Matching Algorithms Tested • Baseline • The original, unexpanded version of the query vector • Offline Expansion • Expands the original query by pre-processing offline only • Online Expansion • Expands the original query by processing online only • Online + Offline Expansion • Expands the original query using both offline and online expansion algorithms
Test Results: Queries not found in lookup table • Tested the baseline vs online expansion • The online expansion gave statistically significant improvements
Test Results: Queries found in lookup table • Tested all 4 algorithms • Best: offline expansion • Second best: online + offline expansion • Difference between the two was not statistically significant
Test results: full set • Tested on all four algorithms • Best: online + offline expansion • Online expansion also offers statistically significant improvement • Even better: hybrid
Efficiency • The table lookup takes only 1 ms • Least efficient when a query is not in the lookup table • When a query is not in the lookup table, there is a 50% overhead • This is bad • But given the small proportion of queries not in the lookup table, the estimated average is 12.5% overhead • This is good