Towards Information Retrieval with More Inferential Power

Towards Information Retrieval with More Inferential Power Jian-Yun Nie Department of Computer Science University of Montreal nie@iro.umontreal.ca

Background • IR Goal: • Retrieve relevant information from a large collection of documents to satisfy user’s information need • Traditional relevance: • Query Q and document D in a given corpus: Score (Q,D) • User-independent • Knowledge independent • Independent of all contextual factors • Expected relevance: • Also depends on users (U) and contexts (C): Score (Q,D,U,C) • Reasoning with contextual information • Several approaches in IR can be viewed as simple inference • We have to consider more complex inference

Overview • Introduction: Current Approaches to IR • Inference using terms relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Conclusion and Future Work

Traditional Methods in IR • Each query term t matches a list of documents t: {…, D, …} • Final answer list = combining all the lists of query terms • e.g. Vector space model: Language model: • 2 implicit assumptions: • Information need is only specified by the query terms • Query terms are independent

Reality • A term is only one of the possible expression of a meaning • Synonyms, related terms • Query is only a partial specification of user’s information need • Many words can be omitted in the query: e.g. “Java hotel”: hotel booking in Java island, … • How to make the query more complete?

Dealing with relations between terms • Previous methods try to enhance the query: • Query expansion (add some related terms) • Thesauri: Wordnet, Hownet • Statistical co-occurrence: 2 terms that often co-occur together in the same context • Pseudo relevance feedback: top-ranked documents retrieved with original query • User profile, background, preference… (a set of background terms) • Used to re-rank the documents • Equivalent to a query expansion

Question • Are these related to inference? • How to perform inference in general in IR? LM as a tool for implementing logical IR

What is logical IR? • Key: inference – infer query from document • D: Tsunami • Q: natural disaster • DQ?

Using knowledge to make inference in IR • K | D Q • K: general knowledge • No knowledge • Thesauri • Co-occurrence • … • K: user knowledge • Characterizes the knowledge of a particular user

Simple inference – the core of logical IR • Logical deduction (A  B)  (B  C)  A  C • In IR: (D  Q’)  (Q’  Q)  D  Q (D  D’)  (D’  Q)  D  Q Doc. matching Inference on query Inference on doc. Doc. matching

Is language modeling a reasonable framework? 1. Basic generative model: • P(Q|D) ~ P(DQ) • Current Smoothing: • E.g. D=Tsunami, PML(natural disaster|D)=0 change to P(natural disaster|D)>0 • Not inference • P(computer|D)>0 ~ P(natural disaster|D)>0

Effect of smoothing? • Doc: Tsunami, ocean, Asia, … • Smoothing inference • Redistribution uniformly/according to collection (also to unrelated terms) Tsunami ocean Asia computer nat.disaster …

Expected effect • Using Tsunami  natural disaster • Knowledge-based smoothing Tsunami ocean Asia computer nat.disaster …

Inference: Translation model (Berger & Lafferty 99) Traditional LM Inference

Using more types of knowledge for document expansion (Cao et al. 05) • Different ways to satisfy a query (term) • Directly though unigram model • Indirectly (by inference) through Wordnet relations • Indirectly trough Co-occurrence relations • … • Dti if DUG ti or DWN ti or DCO ti

Inference using different types of knowledge (Cao et al. 05) qi PWN(qi|w1) PCO(qi|w1) w1 w2 … wn w1 w2 … wn WN model CO model UG model λ1 λ2 λ3 document

Experiments (Cao et al. 05) Integrating more types of relation is useful

Query expansion in LM • KL-div: • With no query expansion, equivalent to generative model Query model Smoothed doc. model

Expanding query model Classical LM Relation model

Using co-occurrence information • Using an external knowledge base (e.g. Wordnet) • Pseudo-rel. feedback • Other term relationships • …

Using co-occurrence relation • Use term co-occurrence relationship • Terms that often co-occur in the same windows are related • Window size: 10 words • Unigram relationship (wj  wi ) • Query expansion

Problem co-occurrence relations • Ambiguity • Term relationship between two single words e.g. “Java  programming” • No information to determine the appropriate context e.g. “Java travel” by “programming” • Solution: add some context information into term relationship

Overview • Introduction: Current Approaches to IR • Inference using terms relations • Extracting context-dependent term relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Conclusion and Future Work

General Idea (Bai et al. 06) • Use (t1, t2, t3, …)  t instead of t1  t • e.g. “(Java, computer, language)  programming” • Problem with arbitrary number of terms in condition: • Complexity with many words in condition part • Difficult to obtain reliable relations • Our solution: • Limit condition part to 2 words e.g. “(Java, computer)  programming” “(Java, travel)  island” • One word specifies the context to the other

Hypotheses • Hypothesis 1: most words can be disambiguated with one useful context word • e.g. “Java + computer, Java + travel, Java + taste” • Hypothesis 2: users often choose useful related words to form their queries • A word in query provides useful information to disambiguate another word • Possible queries: e.g. “windows version” “doors and windows” • Seldom case: users do not express their need clearly e.g. “windows installation” ?

Context-dependent co-occurrences (Bai et al. 06) • wiwj wk • New relation model

Experimental Results (Average Precision) * and ** indicate the difference is statistically significant by t-test (*: p-value < 0.05, **: p-value < 0.01)

Experimental Analysis (example) • Query #55: “Insider trading” • Unigram relationships: P(*|insider) or P(*|trading) stock:0.014177 market:0.0113156 US:0.0112784 year:0.010224 exchang:0.0101797 trade:0.00922486 report:0.00825644 price:0.00764028 dollar:0.00714267 1:0.00691906 govern:0.00669295 state:0.00659957 futur:0.00619518 million:0.00614666 dai:0.00605674 offici:0.00597034 peopl:0.0059315 york:0.00579298 issu:0.00571347 nation:0.00563911 • Bi-term relationships: P(*|insider, trading) secur:0.0161779 charg:0.0158751 stock:0.0137123 scandal:0.0128471 boeski:0.0125011 inform:0.011982 street:0.0113332 wall:0.0112034 case:0.0106411 year:0.00908383 million:0.00869452 investig:0.00826196 exchang:0.00804568 govern:0.00778614 sec:0.00778614 drexel:0.00756986 fraud:0.00718055 law:0.00631543 ivan:0.00609914 profit:0.00566658 => Expansion terms determined by BQE are more relevant than UQE

Logical point of view of the extensions D tj … ti

LM for context-dependent IR? • Context (X) = background knowledge of the user, the domain of interest, … • Document model smoothed by context model X | DQ = | X+D  Q • Similar to doc. Expansion approaches • Query smoothed by context model X | DQ = | DQ+X • Similar to (Lau et al. 04) and query expansion approaches • Utilizations of context: • domain knowledge (e.g. javaprogramming only in computer science) • Specification of the area of interest (e.g. science): background terms • Characteristics of the collection

Contexts and Utilization (1) • General term relations (Knowledge) • Traditional term relations are context-independent : e.g. “Java  programming”, Prob(programming|Java) • Context-dependent term relations: add some context words in term relations e.g. “{Java, computer} programming” (“programming” only derived to expand a query containing both “Java” and “computer”) “{Java, computer}” identifies a better context than “Java ”to determine expansion terms

Contexts and Utilization (2) • Topic domains of the query (Domain background) • Consider topic domain as specifying a set of background terms frequently used in the domain • However, these terms are often omitted in the queries e.g. in Computer Science domain, term “computer” is often implied by queries in this domain, but usually omitted • e.g. Computer Science domain: any query  “computer”, …

Example: “bus services in Java” • 99 concern “Java language”, only one related to “transportation” (but irrelevant to the query) • Reason: do not consider the retrieval context - the user is preparing a travel

Example: “bus services in Java + transportation, hotel, flight” • 12 among 20 related to “transportation” • Reason: the additional terms specify the appropriate context and make query less ambiguous

Contexts and Utilization (3) • Query-specific collection characteristics (Feedback model) • What terms are useful to retrieve relevant documents in a particular corpus? • ~ What other topics are often described together with the query topic in the corpus? e.g. in a corpus, “terrorism” can be described often with “9-11, air hijacking, world trade center, …” • Expand query with related terms • Feedback model: capture query-related collection context

Enhanced Query Model • Basic idea for query expansion: • Combine original query model with the expansion model • Generalized model: 3 expansion models from 3 contextual factors: • : original query model : knowledge model • : domain model : feedback model whereX={0, K, Dom, FB} is the set of all component models is the mixture weight Original query model Expansion model

Illustration: Expanded Query Model • Term t can be derived from query model by several inference paths • Once a path is selected, the corresponding LM is to generate term t

Creating Domain Models • Assumption: each topic domain contains a set of example (in-domain) documents • Extract domain-specific terms from them • Use EM algorithm: extract only the specific terms • Assume each in-domain document is generated from: (Dom=0.5) • Domain model is extracted by EM so as to maximize P(Dom|θ’Dom):

Effect of EM Process • Term probabilities in domain “Environment” before/after EM (12 iterations) => Extract domain-specific terms while filtering out common words

How to Gather in-domain Documents • Existing directories: ODP, Yahoo! directory • We assume that user defines his own domains, and assigns a domain to each of his queries (during the training phase) • Gather relevant documents of the queries (by user’s relevance judgments) (C1) • Simply collect the top-ranked documents (without user’s relevance judgments) (C2) • (This strategy is used in order to test on TREC data)

How to Determine the Domain of a New Query • 2 strategies to assign domain to the query: • Manually (U1) • Automatically (U2) • Automatic query classification by LM: • Similar to text classification, but query is much shorter than text document • Select domain with the lowest KL-divergence score of the query: • This is an extension to Naïve Bayes classification [Peng et al. 2003]

Overview • Introduction: Current Approaches to IR • Inference using terms relations • A General Model to Integrate Contextual Factors • Constructing and Using Domain Models • Experiments • Conclusion and Future Work

Experimental Setting • Text collection statistics: TREC • Training collection: to determine the parameter values (mixture weights)

An Example: Query with Manually Assigned Domain <top> <head> Tipster Topic Description <num> Number: 055 <dom> Domain: International Economics <title> Topic: Insider Trading (only use title as Query) <desc> Description: Document discusses an insider-trading case. … Figure: Distribution of the queries among 13 domains in TREC

Baseline Methods • Document model: Jelinek-Mercer smoothing

Constructing and Using Domain Models • 2 Strategies to create domain models: (current test query is excluded from domain model construction) • with the relevant documents for in-domain queries (C1) • User judges which documents relevant to the domain • Similar to manual construction of directories • with the top-100 documents retrieved by in-domain queries (C2) • User specifies a domain for queries without judging relevant documents • System gathers in-domain documents from user’s search history • Once constructed domain models, 2 Strategies to use them: • Domain can be assigned to a new query by user manually (U1) • Domain is determined by the system automatically using query classification (U2)

Creating Domain Models • C1 (constructed with relevant documents) vs. C2 (with top-100):

Towards Information Retrieval with More Inferential Power

Towards Information Retrieval with More Inferential Power

Presentation Transcript

Information retrieval

Information Retrieval

Information Retrieval to Knowledge Retrieval , one more step

Information retrieval

Information Retrieval

Information Retrieval

Towards a Game-Theoretic Framework for Information Retrieval

Information Retrieval

Towards Compression-based Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Towards a Game-Theoretic Framework for Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

Information Retrieval

information retrieval

Information Retrieval