630 likes | 756 Views
Risk-sensitive Information Retrieval. Kevyn Collins-Thompson Associate Professor, University of Michigan. FIRE Invited talk, Friday Dec. 6, 2013. We tend to remember that 1 failure, rather than the previous 200 successes. Current retrieval algorithms work well on average across queries….
E N D
Risk-sensitive Information Retrieval Kevyn Collins-Thompson Associate Professor, University of Michigan FIRE Invited talk, Friday Dec. 6, 2013
We tend to remember that 1 failure, rather than the previous 200 successes
Current retrieval algorithms work well on average across queries… Queries hurt Queries helped Model > Baseline Model ≤ Baseline Mean Average Precision gain: +30% Query expansion:Current state-of-the-art method
…but are high risk= significant expectation of failure due to high variance across individual queries. Failure = Your algorithm makes results worse than if it had not been applied. Queries hurt Queries hurt Queries helped Queries helped Model > Baseline Model ≤ Baseline This is one of the reasons that even state-of-the-art algorithms are impractical for many real-world scenarios. Query expansion:Current state-of-the-art method
We want more robust IR algorithms having as objective:1. Maximize average effectiveness 2. Minimize risk of significant failures Queries hurt Queries helped Average gain: +30% Average gain: +30% Query expansion:Current state-of-the-art method Robust version
Defining risk and reward in IR • Reward = Effectiveness measure • NDCG, ERR, MAP, … • Define failure for a single query • Typically relative to a baseline • e.g. 25% loss in MAP • e.g. Query hurt (ΔMAP < 0) • Risk= aggregate failure across queries • e.g. P(> 25% MAP loss) • e.g. Average NDCG loss > 10% • e.g. # of queries hurt
Some examples of risky IR operations • Query rewriting and expansion • Spelling correction, common word variants, synonyms and related words, acronym normalization, … • Baseline: the unmodified query • Personalized search • Trying to disambiguate queries, given unknown user intent • Personalized, groupized and contextual re-ranking • Baseline: the original, non-adjusted ranking. Or: ranking from previous version of ranking algorithm. • Resource allocation • Choice of index tiering, collection selection
Example: Gain/loss distribution of topic-based personalization across queries[Sontag et al. WSDM 2012] Relative to Bing production ranking DDR 2012 Seattle
Another example: Gain/loss distribution of location-based personalization across queries[Bennett et al., SIGIR 2011] P(Loss > 20%) = 8% when ranking is affected DDR 2012 Seattle
The three key points of this talk • Many key IR operations are risky to apply. • This risk can often be reduced by better algorithm design. • Evaluation should include risk analysis. • Look at the nature of gain and loss distribution • Not just averages.
This risk-reward tradeoff occurs again and again in search… but is often ignored • A search engine provider must choose between two personalization algorithms: • Algorithm A has expected NDCG gain = +2.5 points • But P(Loss > 20%) = 60% • Algorithm B has NDCG gain = +2.1 points • But P(Loss > 20%) = 10% • Which one will be deployed?
Algorithm deployment typically driven by focus on average NDCG/ERR/MAP/… gain • Little/no consideration of downside risk. • Benefits of reducing risk: • User perception: failures are memorable • Desire to avoid churn – predictability, stability • Increased statistical power of experiments • Goal: Understand, optimize, and control risk-reward tradeoffs in search algorithms
Motivating questions • How can effectiveness and robustness be jointly optimized for key IR tasks? • What tradeoffs are possible? • What are effective definitions of “risk” for different IR tasks? • When and how can search engines effectively “hedge” their bets for uncertain choices? • How can we improve our valuation models for more complex needs, multiple queries or sessions
Scenario 1: Query expansion [Collins-Thompson, NIPS 2008; CIKM 2009]
Example: Ignoring aspect balance increases algorithm risk court 0.026 appeals 0.018 federal 0.012 employees 0.010 case 0.010 education 0.009 school 0.008 union 0.007 seniority 0.007 salary 0.006 Hypothetical query: ‘merit pay law for teachers’ legal aspect is modeled… BUT education & pay aspects thrown away..
A better approach is to optimize selection of terms as a set court 0.026 appeals 0.018 federal 0.012 employees 0.010 case 0.010 education 0.009 school 0.008 union 0.007 seniority 0.007 salary 0.006 Hypothetical query: ‘merit pay law for teachers’ More balanced query model Empirical evidence: Udupa, Bhole and Bhattacharya. ICTIR 2009
Using financial optimization based on portfolio theory to mitigate risk in query expansion[Collins-Thompson, NIPS 2008] • Reward: • Baseline provides initial weight vector c • Prefer words with higher ci values: R(x) = cTx • Risk: • Model uncertainty in c using a covariance matrix Σ • Model uncertainty in Σ using regularized Σγ = Σ + γD • Diagonal: captures individual term variance (term centrality) • Off-diagonal: term covariance (co-occurrence/term association) • Combined objective: • Markowitz-type model
Query Top-ranked documents(or other source of term associations) Baseline expansion algorithm Convex optimizer Constraints on word sets Robust query model Black-box approach works with any expansion algorithm via post-process optimizer[Collins-Thompson, NIPS 2008] • Word graph (Σ): • Individual term risk (diagonal) • Conditional term risk (off-diagonal) We don’t assume the baseline is good or reliable!
Controlling the risk of using query expansion terms Aspect balance Term centrality Aspect coverage REXP algorithm
Example solution output Query:parkinson’s disease Baseline expansion Post-processed robust version parkinson 0.996 disease 0.848 syndrome 0.495 disorders 0.492 parkinsons 0.491 patient 0.483 brain 0.360 patients 0.313 treatment 0.289 diseases 0.153alzheimers 0.114 ...and 90 more... parkinson 0.9900 disease 0.9900 syndrome 0.2077 parkinsons 0.1350 patients 0.0918 brain 0.0256 (All other terms zero)
Evaluating Risk-Reward Tradeoffs: Introducing Risk-Reward Curves • Given a baseline Mb, can we improve average effectiveness over Mb without hurting too many queries? Risk-averse model Gain-only model Average Effectiveness (over baseline) Robust algorithm:Higher effectiveness for any given level of risk Risk (Probability of Failure)
Risk-reward curves as a function of algorithm risk-aversion parameter
Risk-reward curves: Algorithm A dominates algorithm B with consistently superior tradeoff Algorithm A Algorithm B Curves UP and to the LEFT are better
Risk-aversion parameter in query expansion: weight given to original vs expansion query
Robust version significantly reduces the worst expansion failures
Robust version significantly reduces the worst expansion failures
Aspect constraints are well-calibrated to actual expansion benefit • About 15% of queries have infeasible programs (constraints can’t be satisfied) • Infeasible → No expansion
Scenario 2:Risk-sensitive objectives in learning to rank [Wang, Bennett, Collins-Thompson SIGIR 2012]
What Causes Risk in Ranking? Significant differences exist between queries • - Click entropies, clarity, length • - Transactional, informational, navigational Many ways to rank / re-rank • - What features to use? • - What learning algorithm to use? • - How much personalization? “Risk”: One intuitive definition: probability that this is the wrong technique for a particular query (i.e. hurts performance relative to baseline)
Framing the Learning Problem Ranking Model Training data Learning Ranked retrieval Top-K = = = Objective Model class Baseline model Documents Query CHALLENGES: Low-risk and effective (relative to baseline) Ranking model? How to learn? Optimization objective? Captures risk & reward Optimally balance risk & reward
A Combined Risk-Reward Optimization Objective Queries hurt Queries helped Risk: average negative gain (over all queries) Reward: average positive gain (over all queries) baseline new model new model baseline # queries Objective: T(α) = Reward – (1+α) Risk
A General Family of Risk-Sensitive Objectives Can substitute in any effectiveness measure • Objective: T(α) = Reward – (1+α) Risk • Gives a family of tradeoff objectives that captures a spectrum of risk/reward tradeoffs • Some special cases: • : standard average performance optimization (high reward, high risk) • = very large (low risk, low reward) • Robustness of model increases with larger • Optimal value of can be chosen based on application
Integrating Risk-Sensitive Objective into LambdaMART • Extension of LambdaMART (MART + LambdaRank) • Each tree models gradient of tradeoff wrt doc scores + +… + Heavily promote i Sorted by scores j Queries hurt Queries helped Derivative of cross-entropy Change in tradeoff due to swapping i and j Heavily penalize
Experiment Setup • Task: Personalization • Dataset: Location (Bennett et al., 2011) • Selective per-query strategy: Min location entropy • Low location entropy predicts likely local intent • Baseline: Re-ranking model learned on all personalization features
Risk-sensitive re-ranking for location personalization(α = 0, no risk-aversion)
Risk-sensitive re-ranking for location personalization(α = 1, mild risk-aversion)
Risk-sensitive re-ranking for location personalization(α = 5, medium risk-aversion)
Risk-sensitive re-ranking for location personalization(α = 10, highly risk-averse) P(Loss > 20%) → 0 while maintaining significant gains DDR 2012 Seattle
TREC Web Track 2013:Promoting research on risk-sensitive retrieval • New collection: • ClueWeb12 • New task: • Risk-sensitive retrieval • New topics: • Single + multi-faceted topics
Participating groups TREC 2013: 15 groups, 61 runs (TREC 2012: 12 groups, 48 runs) TU Delft (CWI) TU Delft (wistud) Univ. Montreal OmarTech, Beijing Chinese Acad. Sciences MSR/CMU RMIT Technion Univ. Delaware (Fang) Univ. Delaware (udel) Jiangsu Univ. Univ. Glasgow Univ. Twente Univ. Waterloo Univ. Weimar Automatic runs: 53 Manual runs: 8 Category A runs: 52 Category B runs: 9
Topic development • Multi-faceted vs single-faceted topics • Faceted type and structure were not revealed until after run submission • Initial topic release provided the query only 201:raspberry pi 202:uss carl vinson 203:reviews of les miserable 204:rules of golf 205:average charitable donation
Example multi-faceted topicsshowing informational, navigational subtopics <topic number="235" type="faceted"> <query>ham radio</query> <description> How do you get a ham radio license? </description> <subtopic number="1" type="inf">How do you get a ham radio license?</subtopic> <subtopic number="2" type="nav">What are the ham radio license classes?</subtopic> <subtopic number="3" type="inf">How do you build a ham radio station?</subtopic> <subtopic number="4" type="inf">Find information on ham radio antennas.</subtopic> <subtopic number="5" type="nav">What are the ham radio call signs?</subtopic> <subtopic number="6" type="nav">Find the web site of Ham Radio Outlet.</subtopic> </topic>
Example single-facet topics <topic number="227" type="single"> <query>iwill survive lyrics</query> <description> Find the lyrics to the song "I Will Survive". </description> </topic> <topic number="229" type="single"> <query>beef stroganoff recipe</query> <description> Find complete (not partial) recipes for beef stroganoff. </description> </topic>
Track instructions • Via github, participants were provided: • Baseline runs (ClueWeb09 and ClueWeb12) • Risk-sensitive versions of standard evaluation tools • Compute risk-sensitive versions of ERR-IA, NDCG, etc. • gdeval, ndeval: new alpha parameter • Ad-hoc task • Submit up to 3 runs, each with top 10k results, etc. • Risk-sensitive task • Submit up to 3 runs: alpha = 1,5,10 • Could perform new retrieval, not just re-ranking • Participants asked to self-identify alpha-level for each run
Baseline run for risk evaluation • Goals: • Good ad-hoc effectiveness (ERR and NDCG) • Standard, easily reproducible algorithm • Approach: • Selected based on ClueWeb09performance • RM3 Pseudo-relevance feedback from Indri retrieval engine. • For each query: • 10 feedback documents, 20 feedback terms • Linear interpolation weight of 0.60 with original query. • Waterloo spam classifier filtered out all documents with percentile-score < 70.
Ad-hoc run performance (ERR@10) by topic Baseline in red Topics 201-225 Topics 226-250