310 likes | 440 Views
Thoughts (and Research) on Query Intent. Bruce Croft Center for Intelligent Information Retrieval UMass Amherst. Overview. Query Representation and Understanding Workshop at SIGIR 2010 Research projects in the CIIR. Observations.
E N D
Thoughts (and Research) on Query Intent Bruce Croft Center for Intelligent Information Retrieval UMass Amherst
Overview • Query Representation and Understanding Workshop at SIGIR 2010 • Research projects in the CIIR
Observations • “Query intent” has become a popular phrase at conferences and at companies • Research with query logs = acceptance of paper • Few standards in these papers about test collections, metrics, even tasks • Query processing has been part of IR for a long time • e.g., stemming, expansion, relevance feedback • Most retrieval models say little about queries • So, what’s going on and what’s interesting?
Terminology • Query intent (or search intent) is the same thing as information need • The notion of an information need or problem underlying a query has been discussed in the IR literature for many years, and it was generally agreed that query intent is another way of referring to the same idea • Query representation involves modeling the intent or need • Query understanding refers to the process of identifying the underlying intent or need based on a particular representation • Intent classes, intent dimensions, and query classes • terms used to talk about the many different types of information needs and problems
Terminology • Query rewriting, query transformation, query refinement, query alteration, and query reformulation • names given to the process of changing the original query to better represent the underlying intent (and consequently improve ranking) • Query expansion, substitution, reduction, segmentation • some of the techniques or steps used in the query transformation process • Query • most research assumes the query is the string entered by user. Transformation can produce many different representations of the query. Difference between explicit and implicit query is important
Research Questions • How to develop a unified and general framework for query understanding? • How to formally define a query representation? • How to develop new system architectures for query understanding? • How to combine query understanding with other components in information retrieval systems? • How to conduct evaluations of query understanding? • How to make effective use of both human knowledge and machine learning in query understanding?
Possible Research Tasks • Long query relevance • Query reduction • Similar query finding • Query classification • Named entity recognition in queries • Context-aware search • Intent-aware search
Methodology • Must agree on tasks, evaluation metrics, and text collections • TREC-style vs. “black-box” evaluations • Crowdsourcing for annotations • Resources such as query collections, document collections, query logs, etc. differ widely in their availability in academic and industry settings
Resources • Document collections – TREC ClueWeb collection preferred • Query collections – need collections of different query types (e.g. long, location, product…) validated by industry • Query logs – critical resource for some approaches, not available in academia. Alternatives include MSN/AOL logs, KDD queries, anchor text logs, logs from other applications (Wikipedia), logs from some restricted environment (e.g. academic library) • N-grams, etc. – corpus and query language statistics from web collections
CIIR Projects • Modeling structure in queries • Modeling distributions of queries • Modeling diversity in queries • Transforming long queries • Generating queries from documents • Generating query logs from anchor text • Finding similar queries
The Challenge of Query Representation • User inputs a string of characters • Query structure is never explicitly observed and is difficult to infer • Short and ambiguous search queries • Idiosyncratic grammar • No capitalization and punctuation new york times square do grovercleveland have kids talking to heaven movie
Structural Query Representation • A query Q has a hierarchical representation • A query is a set of structures = {1 ,…, n } • Each structure is a set of concepts ={1 , 2 ,…} • Hierarchical representation allows to • Model arbitrary term dependencies as concepts • Group concepts by structures • Assign weights to concepts/structures
Structures members rock group nirvana Terms [members] [rock] [group] [nirvana] [members rock] [rock group] [group nirvana] Bigrams Chunks [members] [rock group] [nirvana] Key Concepts [members] [nirvana] Dependence [members nirvana] [rock group] Concepts
Encoding Query Structure in a Hypergraph Document Concepts Concepts Structure 1 Structure n
Weighted Sequential Dependence Model (WSD) w - free parameters g - concept importance features [Bendersky, Metzler, and Croft, 2009] • Allow the parameters of the sequential dependence model to depend on the concept • Assume the parameters take a simple parametric form • maintains reasonable model complexity
Defining Concept Importance in WSD • Features g define the concept importance • Depend on the concept (term/bigram) • Independent of a specific document/document corpus • Combine several sources for more accurate weighting • Endogenous Features – collection dependent features • Exogenous Features – collection independent features
WSD Ranking Function Score document D by:
Concept weights may vary even if concept DF is similar Query “civil war battle reenactments” Good segments do not necessarily predict important concepts
TREC Description (Long) Queries +6.3% +1.6% +24.1%
Query Representation Distribution of Terms (DOT) words + phrases : original or new Single Reformulated Query (SRQ) a single reformulation operation Relevance Model [Lavrenko and Croft, SIGIR01] Query Segmentation [Bergsma and Wang, EMNLP-CoNLL07] [Tan and Peng, WWW08] SRQ does not consider combining with other operations, thus missing information about alternative reformulations DOT does not consider how these terms are fitted into actual queries, thus missing the dependencies between them. Sequential Dependence Model [Metzler and Croft, SIGIR05] Query Substitution [Jones et al, WWW06] [Wang and Zhai, CIKM08] Latent Concept Expansion [Metzler and Croft, SIGIR07] Uncertainty in PRF [Collins-Thompson and Callan, SIGIR07] Distribution of Queries (DOQ) each query is the output of applying single or multiple reformulation operations.
Example Original TREC Query: oil industry history Single Reformulated Query (SRQ) Distribution of Terms (DOT) Relevance Model { 0.44 ``industry'', 0.28 ``oil'', 0.08 ``petroleum'', 0.08 ``gas'', 0.08 ``county'', 0.04 ``history''...} Query Substitution ``petroleum industry history'' Query Segmentation ``(oil industry)(history)'' Sequential Dependence Model [Metzler, SIGIR05] { 0.28 ``oil'', 0.28 ``industry'', 0.28 ``history'', 0.08 ``oil industry'', 0.08 ``industry history''...} Distribution of Queries (DOQ) 0.28 ``(oil industry)(history)'', 0.24 ``(petroleum industry)(history)'', 0.20 ``(oil and gas industry)(history)'', 0.18 ``(oil)(industrialized)(history)'' …
Application I • Reducing Long Queries [Xue, Huston, and Croft, CIKM2010] • A novel CRF-based model learns distribution of subset queries, which directly optimizes retrieval performance • using the top 1 subset query • using the top K subset queries • q, d indicate significantly • Better than QL and DM
Query Substitution • A context of a word is the unigram preceding it • Context distribution • The translation model • The substitution model • Q= q1, … qi-2, qi-1, qi, qi+1, qi+2, … qn, candidate = s The probability that the term ci appears in w’s context The KL divergence between the context distributions of w and s How fit the new term is to the context of the current query
Query Expansion and Stemming • Probabilities are estimated from corpus or query log • Using text passages nearly the same as pseudo relevance feedback • Query Expansion is similar to substitution • We add the new term and keep the original term substitution: “cheap airfare” → “cheap flight” expansion: “cheap airfare” → “cheap airfareflight” • Stemming • New terms are restricted to Porter-stemmed root terms “drive direction” → “drivedriving direction”
The Anchor Log • Extract <anchor, url> pairs from the Gov-2 collection to create the anchor log [Dang and Croft, 2009] • The anchor log is very noisy • “click here”, “print version”, … don’t represent the linked page • Anchor textgives comparable performance to MSN log for substitution, expansion, stemming
Learning to Rank Reformulations [Dang, Bendersky, and Croft, 2010]
Using Query Distributions • Reformulating Short Queries [Xue et al, CIKM2010] • Passage Information used to generate candidate queries and estimate probabilities Gov2 o, w, m, a represents different methods to generate candidate queries. q, d, r indicate significantly better than QL, SDM and RM.
Conclusions • Studying query intent is not new, but more data is leading to many new insights • Not just a web search issue, but more obvious in web search • Lots of interesting research to do, but field needs more coherence in terms of research goals, testbeds