1 / 38

Query Understanding for Relevance Measurement

Query Understanding for Relevance Measurement. Nick Craswell Microsoft. Thought Experiment: “Freshness”. < Wikipedia. Say our query model can detect “fresh intent” For space shuttle , promote: …if “fresh intent” is detected We have an “aggressiveness” parameter. > Wikipedia.

badu
Download Presentation

Query Understanding for Relevance Measurement

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Query Understanding forRelevance Measurement Nick Craswell Microsoft

  2. Thought Experiment: “Freshness” < Wikipedia • Say our query model can detect “fresh intent” • For space shuttle, promote: • …if “fresh intent” is detected • We have an “aggressiveness” parameter > Wikipedia = Wikipedia Systematic bias towardsestablished pages New judgments, updatedguidelines. Claim victory! Real users. How can weknow this? Understanding the user needs underlying a query can be very difficult, even for a human relevance judge.

  3. Goal: Fidelity of Ground Truth • Correctly capture the preferences of user/users who typed query Q • Secondary goals: Cheap, reusable, flexible, sensitive experiments. Available in academia+industry

  4. Why Should IR People Care? • Affects our results • Judging guidelines change outcome • Informational vs navigational changes outcome • Measuring diversity changes the outcome • Enterprise track: Judge expertise changes outcome • Future: Query model knows more than judge can • “This is wrong but we don’t care” seems like a weak argument • Real queries: Lots of people say it’s important • IR people should care about helping users

  5. Why Industry Cares • Be careful what you wish for (or optimize for) • Rich and powerful models • Trading off: More/less Wikipedia, IMDB vsRottenTomatoes, navvs shopping vs reviews, diverse results, local results, personalized results, fresh results, (not to mention multimedia etc) • You get your ground truth • We have clicks Problem: Highest-fidelity experiments, with real users,are some of the least sensitive i.e. lack statistical power.

  6. Today’s Talk: Two Lines of Work • Test collection: right relevance criterion, real queries, and sufficient judge expertise • Alternative: Real users e.g. clicks/interleaving

  7. Traditional Cranfield/TREC • Detailed query topics • TREC: What is the economic impact of recyclingtires? • Cranfield: what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft • Binary judgments, with a low bar

  8. TREC-8 No big MAP gains from links In retrospect: Informational task, weird queries, judging ASCII, binary judging (low bar?) Hawking, Voorhees, Craswell and Bailey. Overview of the TREC-8 Web Track. TREC, 1999

  9. In fact, under this methodology, we beat the search engines! Hawking, Craswell, Thistlewaite and Harman. Results and Challenges in Web Search Evaluation. WWW8, 2009

  10. 2000: Complaints  internship Broder taxonomy. “Nav Intent” Changes the Outcome Nick Craswell, David Hawking, Stephen Robertson. Effective Site Finding Using Link Anchor Information. SIGIR-2001

  11. Summary: Adjusting TREC to the Web • Real queries, where possible • Reward different types of results • Homepage, topic distillation, HRel, Rel, NotRel • Smaller reward for being marginally on-topic • Hopefully even penalize spam pages • Coming up: Diversity and real judges(TrecEnt) David Hawking and Nick Craswell. The Very Large Collection and Web Tracks. in TREC: Experiment and Evaluation in Information Retrieval. Edited by Ellen M. Voorhees and Donna K. Harman. September 2005

  12. Clarke, Craswell, Soboroff. Overview of the TREC-2009 Web TrackClarke, Craswell, Soboroff, Cormack. Overview of the TREC-2010 Web Track Web Track Diversity Experiments • Modeling the fact that: • Different users want different things (ambiguous) • Individual users want a variety of things (faceted) <topic number="20" type="ambiguous"> <query>defender</query> <description> I’m looking for the homepage of Windows Defender, an anti-spyware program.</description> <subtopic number="1" type="nav"> I’m looking for the homepage of Windows Defender, an anti-spyware program. </subtopic> <subtopic number="2" type="inf"> Find information on the Land Rover Defender sport-utility vehicle. </subtopic> <subtopic number="3" type="nav"> I want to go to the homepage for Defender Marine Supplies. </subtopic> <subtopic number="4" type="inf"> I’m looking for information on Defender, an arcade game by Williams. Is it possible to play it online? </subtopic> <subtopic number="5" type="inf"> I’d like to find user reports about Windows Defender, particularly problems with the software. </subtopic> <subtopic number="6" type="nav"> Take me to the homepage for the Chicago Defender newspaper. </subtopic> </topic> <topic number="47" type="faceted"> <query>indexed annuity</query> <description>I’m looking for information about indexed annuities. </description> <subtopic number="1" type="inf"> What is an indexed annuity? What are their advantages and disadvantages? What kinds... are there? </subtopic> <subtopic number="2" type="inf"> Where can I buy an indexed annuity? What investment companies offer them? </subtopic> <subtopic number="3" type="inf"> Find ratings of indexed annuities. </subtopic> </topic>

  13. Diversity Changes the Outcome

  14. Diversity and Fidelity • Would be great if our query model can understand the dominant intent • ako Army Knowledge Online • Would be even better if our query models can understand the diverse intents+concepts • To have any chance of evaluating this, our ground truth has to match some real user or population of real users

  15. Subtopic Development • Hints from user data  • Click data from 2009 • Ian used unigrams for topic development • House, plans, luxury • House, md, tv, show, hugh, laurie Filip Radlinski, Martin Szummer, and Nick Craswell. Inferring Query Intent from Reformulations and Clicks. WWW 2010 poster Alternative: Query suggestion

  16. Crowdsourcing Work with Emine Yilmaz

  17. mayo clinic jacksonvillefl • address of mayo clinic| • I would be looking for information like working hours and location of the clinic.|                • I would be looking to learn about Mayo Clinic in Jacksonville,Florida, including medical services available and how to request an appointment.|           • I would be trying to find out the address of the Mayo Clinic.|      • I would look for information about the doctors at mayo clinic jacksonvillefl, the services available, testimonials, its location and timings.|    • I'd be looking for a branch of the Mayo Clinic in the town of Jacksonville, Fl.|       • The Mayo Clinic located in Jacksonville, Florida.|Nice Hit.              Intents: 1) address, 2) hours, 3) location, 4) doctors, 5) services, 6) testimonials

  18. kenmore gas water heater • A water heater manufactured by Kenmore, which needs fuel to work.|Nice Task.            • Checking the price of various brands of water heaters.|                • I would be looking for information about 'Kenmore' brand of gas water heaters.|             • I would be trying to find out if the Kenmore brand of gas/water heater was any good.| • I'd be looking for information about a kenmore gas water heater from sears to see whether I wanted to buy it to replace my current gas water heater.|   • to know the price of kenmore gas water heater|              Intents: 1) information regarding purchase, 2) prices, 3) comparison to other brands (is it "any good", "various brands", etc)

  19. fickle creek farm • An pollution free farm in the Piedmont of North Carolina, which provides fruits, vegetables, meat, eggs, dairy etc.|Good • I would be looking for information about fickle creek farm, where it is located and about its produce.|   • I would be looking to visit the farm as a tourist.|               • I'd be looking for a farm called 'fickle creek.'|      Intents: 1) What is it, 2) Can I visit it, 3) What does it produce

  20. travel agent salaries • I would be looking for the average salary for a travel agent, particularly in my country.| • I would be trying to find out the average salary that travel agents make.|             • I would be trying to understand the range of salaries, travel agents are paid.|    • I'd be looking for the range of salaried that travel agents are paid.|          • If i wrote the query "travel agent salaries" to a search engine i would be looking for a travel agent job.| • Making notes for the negotiation on salary during job interview.|            • searching for the salary of travel agent|                • The average salary of a Travel Agent working in Tourism industry.|Good Hit.       Intents: 1) average+range

  21. Summary of Diversity Work • Measuring diversity  Changed outcomes • TREC participants: New models • Future models may identify likely intents • ako Army Knowledge Online • To correctly measure that model, we need realistic intents based on e.g. clicks or crowdsourcing

  22. Judge Expertise • Does the judge understand the query-topic? • If it is technical, we might be in trouble • ‘trie in perl’  Wikipedia. That’s great! :-) • Shall we crowdsource: what similarity laws must be obeyed when constructing aeroelastic models of heated high speed aircraft • TREC enterprise track: Gold-Silver-Bronze • Gold: Task expert and topic originator • Silver: Task expert who did not create the topic • Bronze: Not a task expert and did not create topic

  23. Judge Expertise Changes the Outcome Peter Bailey, Nick Craswell, Ian Soboroff, Paul Thomas, Arjen P. de Vries and Emine Yilmaz. Relevance Assessment: Are Judges Exchangeable and Does it Matter. SIGIR 2008

  24. Summary Judge Expertise • Judging of expert queries by non-experts may be a bad idea • There are “real queries” that are unjudgeable • … don’t crowdsource the TREC Legal Track?

  25. Setup of ExperimentsChanges the Outcome

  26. Lab or Click-based A/B Experiments • Show systems A and B to users • Lab study: Latin square, tasks • Log study: Divide up your userbase, engagement • Problem 1: Sensitivity • A and B may not differ by much • Logs: You may need many users for a long time period • Problem 2: What is success? • Get same answer sooner? Get same answer with more confidence? Get better answer? Find a better question? Minimise frustration? Be educated? Be entertained? Trust the engine more? Like it more? • Despite problems, the gold standard Classic problem: Stop and call it neutral

  27. Interleaving • 2002:Thorsten Joachims [1,2] at Cornell. Presented at Google • 2008 CIKM paper: [3] interleaving can detect large differences reliably. Traditional A/B tests were far less sensitive. Experiments for [3,5] were done on the arXiv.org e-print archive • SIGIR 2010 papers: • Bing paper [4] finds interleaving agrees with NDCG and is more sensitive • Yahoo paper [5] makes interleaving even more sensitive • WSDM 2011 paper: [6] Firefox toolbar, personalization, agree with labels • Interleaving is in use at Search Engines, Cornell, Cambridge [1] T. Joachims. Evaluating Retrieval Performance Using Clickthrough Data . SIGIR Workshop on Mathematical/Formal Methods in Information Retrieval, 2002.[2] T. Joachims. Optimizing Search Engines Using Clickthrough Data . KDD 2002.[3] F. Radlinski, M. Kurup, T. Joachims. How Does Clickthrough Data Reflect Retrieval Quality? CIKM 2008[4] F. Radlinski and N. Craswell. Comparing the Sensitivity of Information Retrieval Metrics . SIGIR 2010[5] Y. Yue, Y. Gao, O. Chapelle, Y. Zhang, and T. Joachims. Learning More Powerful Test Statistics for Click-Based Retrieval Evaluation . SIGIR 2010 [6] N. Matthijs, F. Radlinski. Personalizing Web Search Using Long Term Browsing History. WSDM 2011

  28. F. Radlinski, M. Kurup, T. Joachims. How Does ClickthroughData Reflect Retrieval Quality? CIKM 2008 Interleaving • Team draft interleaving algorithm • Input: Two rankings (team captains) • Repeat: • Coin toss: Who picks next • Winner picks their best remaining* player • Loser picks their best remaining* player • Output: One ranking (2 teams of 5) • Team with most clicks wins Notes: Over many repetitions, each team has a fair chance due to coin toss. The lower you rank a player, the less likely they are on your team. Click Click Click Green Wins * Remaining: Not picked and not near-dupe/redirect of picked

  29. Interleaving highly sensitive. 1 query  10 impressions Interleaving vs NDCG F. Radlinski and N. Craswell. Comparing the Sensitivity of Information Retrieval Metrics. SIGIR 2010

  30. Interleaving vs NDCG: Further Notes • Interleaving: Real user, expertise, context • Interleaving: No guidelines • NDCG: Judge sees “landing page”, can consider nuances of quality • Interleaving preference is based on “caption” • Content farms optimize captions to attract clicks • NDCG: Reusable test collection • Interleaving: Less overfitting

  31. Interleaving vs A/B • Why is interleaving more sensitive to ranking changes than A/B tests? • Say results are worth $$. For a certain query: Ranker A: Offers $11, $10 or $2 User clicks Ranker B: Offers $250, $250 or $95 User also clicks • Users of A may not know what they’re missing • Difference in behavior is small, hard to discern • But interleave A & B Strong preference for B • Direct preference, within user, so highly sensitive

  32. A Lot More to Consider… Peter Bailey, Nick Craswell, Ryen W. White, Liwei Chen, AshwinSatyanarayana, and S. M.M. Tahaghoghi. Evaluating search systems using result page context.IIiX 2010

  33. Conclusion • Test collections: Reusable, fidelity is important • A/B tests: Best fidelity, less power • Less sensitive than interleaving and NDCG • Fidelity: Real user, longitudinal, beyond ranking • “Market share” • Interleaving • Most sensitive, fidelity from real user in context • May corroborate your test collection results • But pairwise, and #experiments limited by #users

More Related