380 likes | 493 Views
Query Understanding for Relevance Measurement. Nick Craswell Microsoft. Thought Experiment: “Freshness”. < Wikipedia. Say our query model can detect “fresh intent” For space shuttle , promote: …if “fresh intent” is detected We have an “aggressiveness” parameter. > Wikipedia.
E N D
Query Understanding forRelevance Measurement Nick Craswell Microsoft
Thought Experiment: “Freshness” < Wikipedia • Say our query model can detect “fresh intent” • For space shuttle, promote: • …if “fresh intent” is detected • We have an “aggressiveness” parameter > Wikipedia = Wikipedia Systematic bias towardsestablished pages New judgments, updatedguidelines. Claim victory! Real users. How can weknow this? Understanding the user needs underlying a query can be very difficult, even for a human relevance judge.
Goal: Fidelity of Ground Truth • Correctly capture the preferences of user/users who typed query Q • Secondary goals: Cheap, reusable, flexible, sensitive experiments. Available in academia+industry
Why Should IR People Care? • Affects our results • Judging guidelines change outcome • Informational vs navigational changes outcome • Measuring diversity changes the outcome • Enterprise track: Judge expertise changes outcome • Future: Query model knows more than judge can • “This is wrong but we don’t care” seems like a weak argument • Real queries: Lots of people say it’s important • IR people should care about helping users
Why Industry Cares • Be careful what you wish for (or optimize for) • Rich and powerful models • Trading off: More/less Wikipedia, IMDB vsRottenTomatoes, navvs shopping vs reviews, diverse results, local results, personalized results, fresh results, (not to mention multimedia etc) • You get your ground truth • We have clicks Problem: Highest-fidelity experiments, with real users,are some of the least sensitive i.e. lack statistical power.
Today’s Talk: Two Lines of Work • Test collection: right relevance criterion, real queries, and sufficient judge expertise • Alternative: Real users e.g. clicks/interleaving
Traditional Cranfield/TREC • Detailed query topics • TREC: What is the economic impact of recyclingtires? • Cranfield: what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft • Binary judgments, with a low bar
TREC-8 No big MAP gains from links In retrospect: Informational task, weird queries, judging ASCII, binary judging (low bar?) Hawking, Voorhees, Craswell and Bailey. Overview of the TREC-8 Web Track. TREC, 1999
In fact, under this methodology, we beat the search engines! Hawking, Craswell, Thistlewaite and Harman. Results and Challenges in Web Search Evaluation. WWW8, 2009
2000: Complaints internship Broder taxonomy. “Nav Intent” Changes the Outcome Nick Craswell, David Hawking, Stephen Robertson. Effective Site Finding Using Link Anchor Information. SIGIR-2001
Summary: Adjusting TREC to the Web • Real queries, where possible • Reward different types of results • Homepage, topic distillation, HRel, Rel, NotRel • Smaller reward for being marginally on-topic • Hopefully even penalize spam pages • Coming up: Diversity and real judges(TrecEnt) David Hawking and Nick Craswell. The Very Large Collection and Web Tracks. in TREC: Experiment and Evaluation in Information Retrieval. Edited by Ellen M. Voorhees and Donna K. Harman. September 2005
Clarke, Craswell, Soboroff. Overview of the TREC-2009 Web TrackClarke, Craswell, Soboroff, Cormack. Overview of the TREC-2010 Web Track Web Track Diversity Experiments • Modeling the fact that: • Different users want different things (ambiguous) • Individual users want a variety of things (faceted) <topic number="20" type="ambiguous"> <query>defender</query> <description> I’m looking for the homepage of Windows Defender, an anti-spyware program.</description> <subtopic number="1" type="nav"> I’m looking for the homepage of Windows Defender, an anti-spyware program. </subtopic> <subtopic number="2" type="inf"> Find information on the Land Rover Defender sport-utility vehicle. </subtopic> <subtopic number="3" type="nav"> I want to go to the homepage for Defender Marine Supplies. </subtopic> <subtopic number="4" type="inf"> I’m looking for information on Defender, an arcade game by Williams. Is it possible to play it online? </subtopic> <subtopic number="5" type="inf"> I’d like to find user reports about Windows Defender, particularly problems with the software. </subtopic> <subtopic number="6" type="nav"> Take me to the homepage for the Chicago Defender newspaper. </subtopic> </topic> <topic number="47" type="faceted"> <query>indexed annuity</query> <description>I’m looking for information about indexed annuities. </description> <subtopic number="1" type="inf"> What is an indexed annuity? What are their advantages and disadvantages? What kinds... are there? </subtopic> <subtopic number="2" type="inf"> Where can I buy an indexed annuity? What investment companies offer them? </subtopic> <subtopic number="3" type="inf"> Find ratings of indexed annuities. </subtopic> </topic>
Diversity and Fidelity • Would be great if our query model can understand the dominant intent • ako Army Knowledge Online • Would be even better if our query models can understand the diverse intents+concepts • To have any chance of evaluating this, our ground truth has to match some real user or population of real users
Subtopic Development • Hints from user data • Click data from 2009 • Ian used unigrams for topic development • House, plans, luxury • House, md, tv, show, hugh, laurie Filip Radlinski, Martin Szummer, and Nick Craswell. Inferring Query Intent from Reformulations and Clicks. WWW 2010 poster Alternative: Query suggestion
Crowdsourcing Work with Emine Yilmaz
mayo clinic jacksonvillefl • address of mayo clinic| • I would be looking for information like working hours and location of the clinic.| • I would be looking to learn about Mayo Clinic in Jacksonville,Florida, including medical services available and how to request an appointment.| • I would be trying to find out the address of the Mayo Clinic.| • I would look for information about the doctors at mayo clinic jacksonvillefl, the services available, testimonials, its location and timings.| • I'd be looking for a branch of the Mayo Clinic in the town of Jacksonville, Fl.| • The Mayo Clinic located in Jacksonville, Florida.|Nice Hit. Intents: 1) address, 2) hours, 3) location, 4) doctors, 5) services, 6) testimonials
kenmore gas water heater • A water heater manufactured by Kenmore, which needs fuel to work.|Nice Task. • Checking the price of various brands of water heaters.| • I would be looking for information about 'Kenmore' brand of gas water heaters.| • I would be trying to find out if the Kenmore brand of gas/water heater was any good.| • I'd be looking for information about a kenmore gas water heater from sears to see whether I wanted to buy it to replace my current gas water heater.| • to know the price of kenmore gas water heater| Intents: 1) information regarding purchase, 2) prices, 3) comparison to other brands (is it "any good", "various brands", etc)
fickle creek farm • An pollution free farm in the Piedmont of North Carolina, which provides fruits, vegetables, meat, eggs, dairy etc.|Good • I would be looking for information about fickle creek farm, where it is located and about its produce.| • I would be looking to visit the farm as a tourist.| • I'd be looking for a farm called 'fickle creek.'| Intents: 1) What is it, 2) Can I visit it, 3) What does it produce
travel agent salaries • I would be looking for the average salary for a travel agent, particularly in my country.| • I would be trying to find out the average salary that travel agents make.| • I would be trying to understand the range of salaries, travel agents are paid.| • I'd be looking for the range of salaried that travel agents are paid.| • If i wrote the query "travel agent salaries" to a search engine i would be looking for a travel agent job.| • Making notes for the negotiation on salary during job interview.| • searching for the salary of travel agent| • The average salary of a Travel Agent working in Tourism industry.|Good Hit. Intents: 1) average+range
Summary of Diversity Work • Measuring diversity Changed outcomes • TREC participants: New models • Future models may identify likely intents • ako Army Knowledge Online • To correctly measure that model, we need realistic intents based on e.g. clicks or crowdsourcing
Judge Expertise • Does the judge understand the query-topic? • If it is technical, we might be in trouble • ‘trie in perl’ Wikipedia. That’s great! :-) • Shall we crowdsource: what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft • TREC enterprise track: Gold-Silver-Bronze • Gold: Task expert and topic originator • Silver: Task expert who did not create the topic • Bronze: Not a task expert and did not create topic
Judge Expertise Changes the Outcome Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries and Emine Yilmaz. Relevance Assessment: Are Judges Exchangeable and Does it Matter. SIGIR 2008
Summary Judge Expertise • Judging of expert queries by non-experts may be a bad idea • There are “real queries” that are unjudgeable • … don’t crowdsource the TREC Legal Track?
Lab or Click-based A/B Experiments • Show systems A and B to users • Lab study: Latin square, tasks • Log study: Divide up your userbase, engagement • Problem 1: Sensitivity • A and B may not differ by much • Logs: You may need many users for a long time period • Problem 2: What is success? • Get same answer sooner? Get same answer with more confidence? Get better answer? Find a better question? Minimise frustration? Be educated? Be entertained? Trust the engine more? Like it more? • Despite problems, the gold standard Classic problem: Stop and call it neutral
Interleaving • 2002:Thorsten Joachims [1,2] at Cornell. Presented at Google • 2008 CIKM paper: [3] interleaving can detect large differences reliably. Traditional A/B tests were far less sensitive. Experiments for [3,5] were done on the arXiv.org e-print archive • SIGIR 2010 papers: • Bing paper [4] finds interleaving agrees with NDCG and is more sensitive • Yahoo paper [5] makes interleaving even more sensitive • WSDM 2011 paper: [6] Firefox toolbar, personalization, agree with labels • Interleaving is in use at Search Engines, Cornell, Cambridge [1] T. Joachims. Evaluating Retrieval Performance Using Clickthrough Data . SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2002.[2] T. Joachims. Optimizing Search Engines Using Clickthrough Data . KDD 2002.[3] F. Radlinski, M. Kurup, T. Joachims. How Does Clickthrough Data Reflect Retrieval Quality? CIKM 2008[4] F. Radlinski and N. Craswell. Comparing the Sensitivity of Information Retrieval Metrics . SIGIR 2010[5] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation . SIGIR 2010 [6] N. Matthijs, F. Radlinski. Personalizing Web Search Using Long Term Browsing History. WSDM 2011
F. Radlinski, M. Kurup, T. Joachims. How Does ClickthroughData Reflect Retrieval Quality? CIKM 2008 Interleaving • Team draft interleaving algorithm • Input: Two rankings (team captains) • Repeat: • Coin toss: Who picks next • Winner picks their best remaining* player • Loser picks their best remaining* player • Output: One ranking (2 teams of 5) • Team with most clicks wins Notes: Over many repetitions, each team has a fair chance due to coin toss. The lower you rank a player, the less likely they are on your team. Click Click Click Green Wins * Remaining: Not picked and not near-dupe/redirect of picked
Interleaving highly sensitive. 1 query 10 impressions Interleaving vs NDCG F. Radlinski and N. Craswell. Comparing the Sensitivity of Information Retrieval Metrics. SIGIR 2010
Interleaving vs NDCG: Further Notes • Interleaving: Real user, expertise, context • Interleaving: No guidelines • NDCG: Judge sees “landing page”, can consider nuances of quality • Interleaving preference is based on “caption” • Content farms optimize captions to attract clicks • NDCG: Reusable test collection • Interleaving: Less overfitting
Interleaving vs A/B • Why is interleaving more sensitive to ranking changes than A/B tests? • Say results are worth $$. For a certain query: Ranker A: Offers $11, $10 or $2 User clicks Ranker B: Offers $250, $250 or $95 User also clicks • Users of A may not know what they’re missing • Difference in behavior is small, hard to discern • But interleave A & B Strong preference for B • Direct preference, within user, so highly sensitive
A Lot More to Consider… Peter Bailey, Nick Craswell, Ryen W. White, Liwei Chen, AshwinSatyanarayana, and S. M.M. Tahaghoghi. Evaluating search systems using result page context.IIiX 2010
Conclusion • Test collections: Reusable, fidelity is important • A/B tests: Best fidelity, less power • Less sensitive than interleaving and NDCG • Fidelity: Real user, longitudinal, beyond ranking • “Market share” • Interleaving • Most sensitive, fidelity from real user in context • May corroborate your test collection results • But pairwise, and #experiments limited by #users