Putting Query Representation and Understanding in Context:

Putting Query Representation and Understanding in Context: A Decision-Theoretic Framework for Optimal Interactive Retrieval through Dynamic User Modeling ChengXiangZhai Department of Computer Science University of Illinois at Urbana-Champaign Including joint work with XuehuaShen, Bin Tan

What is a query? Query = a sequence of keywords that describe the information need of a particular user at a particular time for finishing a particular task iPhone battery Search • Query = a sequence of keywords? Rich context ! SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Query must be put in a context What queries did the user type in before this query? What documents were just viewed by this user? What documents were skipped by this user? What other users looked for similar information? …… Jaguar Search Car ? Animal ? Mac OS? SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Suppose we know: • Previous query = “racing cars” vs. “Apple OS” • “car” occurs far more frequently than “Apple” in pages browsed by the user in the last 20 days 3. User just viewed an “Apple OS” document Context helps query understanding Car Car Software Car Animal Car SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Questions • How can we model a query in a context-sensitive way?  Generalize query representation to user model • How can we model the dynamics of user information needs?  Dynamic updating of user models • How can we put query representation into a retrieval framework to improve search?  A framework for optimal interactive retrieval SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Rest of the talk: UCAIR Project • A decision-theoretic framework • Statistical language models for implicit feedback (personalized search without extra user effort) • Open challenges SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

UCAIR Project • UCAIR = User-Centered Adaptive IR • user modeling (“user-centered”) • search context modeling (“adaptive”) • interactive retrieval • Implemented as a personalized search agent that • sits on the client-side (owned by the user) • integrates information around a user (1 user vs. N sources as opposed to 1 source vs. N users) • collaborates with each other • goes beyond search toward task support SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

A search agent can know about a particular user very well ... Viewed Web pages Query History Main Idea: Putting the User in the Center! WEB Email Search Engine Search Engine Personalized search agent Search Engine “java” Personalized search agent Desktop Files “java” SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

1. A Decision-Theoretic Framework for Optimal Interactive Retrieval SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

IR as Sequential Decision Making (Information Need) (Model of Information Need) User System A1 : Enter a query Which documents to present? How to present them? Which documents to view? Ri: results (i=1, 2, 3, …) Which part of the document to show? How? A2 :View document R’: Document content View more? A3 : Click on “Back” button SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

History H={(Ai,Ri)} i=1, …, t-1 Rt =? Rt r(At) Retrieval Decisions Given U, C, At , and H, choose the best Rt from all possible responses to At Query=“Jaguar” Click on “Next” button User U: A1 A2 … … At-1 At System: R1 R2 … … Rt-1 The best ranking for the query The best ranking of unseen docs C All possible rankings of C Document Collection All possible rankings of unseen docs SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

User Model Seen docs M=(S, U…) Information need L(ri,At,M) Loss Function Optimal response: r* (minimum loss) Bayes risk Inferred Observed A Risk Minimization Framework Observed User: U Interaction history: H Current user action: At Document collection: C All possible responses: r(At)={r1, …, rn} SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

A Simplified Two-Step Decision-Making Procedure • Approximate the Bayes risk by the loss at the mode of the posterior distribution • Two-step procedure • Step 1: Compute an updated user model M* based on the currently available information • Step 2: Given M*, choose a response to minimize the loss function SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

M*1 P(M1|U,H,A1,C) L(r,A1,M*1) R1 A2 M*2 P(M2|U,H,A2,C) L(r,A2,M*2) R2 A3 … Optimal Interactive Retrieval User U C Collection A1 IR system SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

Refinement of Risk Minimization • r(At): decision space (At dependent) • r(At) = all possible subsets of C (document selection) • r(At) = all possible rankings of docs in C • r(At) = all possible rankings of unseen docs • r(At) = all possible subsets of C + summarization strategies • M: user model • Essential component: U = user information need • S = seen documents • n = “Topic is new to the user” • L(Rt,At,M): loss function • Generally measures the utility of Rt for a user modeled as M • Often encodes retrieval criteria (e.g., using M to select a ranking of docs) • P(M|U, H, At, C): user model inference • Often involves estimating a unigram language model U SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Case 1: Context-Insensitive IR • At=“enter a query Q” • r(At) = all possible rankings of docs in C • M= U, unigram language model (word distribution) • p(M|U,H,At,C)=p(U |Q) SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Case 2: Implicit Feedback • At=“enter a query Q” • r(At) = all possible rankings of docs in C • M= U, unigram language model (word distribution) • H={previous queries} + {viewed snippets} • p(M|U,H,At,C)=p(U |Q,H) SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Case 3: General Implicit Feedback • At=“enter a query Q” or “Back” button, “Next” button • r(At) = all possible rankings of unseen docs in C • M= (U, S), S= seen documents • H={previous queries} + {viewed snippets} • p(M|U,H,At,C)=p(U |Q,H) SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Case 4: User-Specific Result Summary • At=“enter a query Q” • r(At) = {(D,)}, DC, |D|=k, {“snippet”,”overview”} • M= (U, n), n{0,1} “topic is new to the user” • p(M|U,H,At,C)=p(U,n|Q,H), M*=(*, n*) If a new topic (n*=1), give an overview summary; otherwise, a regular snippet summary Choose k most relevant docs SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

2. Statistical Language Models for implicit feedback (Personalized search without extra user effort) SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

Risk Minimization for Implicit Feedback • At=“enter a query Q” • r(At) = all possible rankings of docs in C • M= U, unigram language model (word distribution) • H={previous queries} + {viewed snippets} • p(M|U,H,At,C)=p(U |Q,H) Need to estimate a context-sensitive LM SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

e.g., Apple software Q1 User Query Qk Estimate a Context-Sensitive LM User Clickthrough C1={C1,1, C1,2 ,C1,3 ,…} e.g., Apple - Mac OS X The Apple Mac OS X product page. Describes features in the current version of Mac OS X, … Q2 C2={C2,1, C2,2 ,C2,3 ,… } … e.g., Jaguar User Model: Query History Clickthrough SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Short-term vs. long-term implicit feedback • Short term implicit feedback • context = current retrieval session • past queries in the context are closely related to the current query • clickthroughs user’s current interests • Long term implicit feedback • context = all search interaction history • not all past queries/clickthroughs are related to the current query SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

C1 … Ck-1 Average user query and clickthrough history Dirichlet Prior Q1 … Qk Qk-1 “Bayesian interpolation” for short-term implicit feedback Intuition: trust the current query Qk more if it’s longer SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Overall Effect of Search Context • Short-term context helps system improve retrieval accuracy • BayesInt better than FixInt; BatchUp better than OnlineUp SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Query MAP pr@20 Q3 0.0331 0.125 Performance on unseen docs Q3+HC 0.0661 0.178 Improve 99.7% 42.4% Q4 0.0442 0.165 Q4+HC 0.0739 0.188 Improve 67.2% 13.9% Query MAP pr@20 Q3 0.0421 0.1483 Q3+HC 0.0521 0.1820 Improve 23.8% 23.0% Snippets for non-relevant docs are still useful! Q4 0.0536 0.1930 Q4+HC 0.0620 0.1850 Improve 15.7% -4.1% Using Clickthrough Data Only Clickthrough is the major contributor BayesInt (=0.0,=5.0) SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

θS2 θS1 θSt-1 θq,H Mixture model with dynamic weighting for long-term implicit feedback St-1 S1 S2 St ... qt-1Dt-1Ct-1 q1D1C1 q2D2C2 qtDt λ2? λ1? λt-1? θq θH λq? 1-λq select {λ} to maximize P(Dt | θq, H) EM algorithm SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

Results: Different Individual Search Models recurring ≫ fresh combination ≈ clickthrough > docs > query, contextless SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

Results: Different Weighting Schemes for Overall History Model hybrid ≈ EM > cosine > equal > contextless SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

3. Open Challenges • What is a query? • How to collect as much context information as possible without infringing user privacy? • How to store and organize the collected context information? • How to accurately interpret/exploit context information? • How to formally represent the evolving information need of a user? • How to optimize search results for an entire session? • What’s the right architecture (client-side, server-side, and client-server combo)? SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

References • Framework • XuehuaShen, Bin Tan, and ChengXiangZhai, Implicit User Modeling for Personalized Search , In Proceedings of CIKM 2005, pp. 824-831. • ChengXiangZhai and John Lafferty, A risk minimization framework for information retrieval, Information Processing and Management, 42(1), Jan. 2006. pages 31-55. • Short-term implicit feedback • XuehuaShen, Bin Tan, ChengXiangZhai, Context-Sensitive Information Retrieval with Implicit Feedback, Proceedings of SIGIR 2005, pp. 43-50. • Long-term implicit feedback • Bin Tan, XuehuaShen, ChengXiangZhai, Mining long-term search history to improve search accuracy , Proceedings of KDD 2006, pp. 718-723. SIGIR 2010 Workshop on Query Representation and Understanding, July 23, 2010, Geneva, Switzerland

Thank You! SIGIR 2010 Workshop on Query Representation and Understanding, Geneva, Switzerland, July 23, 2010

Putting Query Representation and Understanding in Context: