860 likes | 1.03k Views
Statistical Models for Information Retrieval and Text Mining. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign
E N D
Statistical Models for Information Retrieval and Text Mining ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu
Course Overview Scope of the course Information Retrieval Multimedia Data Text Data Computer Vision Natural Language Processing Machine Learning Statistics
Goal of the Course • Overview of techniques for information retrieval (IR) • Detailed explanation of a few statistical models for IR and text mining • Probabilistic retrieval models (for search) • Probabilistic topic models (for text mining) • Potential benefit for you: • Some ideas working well for text retrieval may also work for computer vision • Techniques for computer vision may be applicable to IR • IR and text mining raise new challenges as well as opportunities for machine learning
Course Plan • Lecture 1: Overview of information retrieval • Lecture 2: Statistical language models for IR: Part 1 • Lecture 3: Statistical language models for IR: Part 2 • Lecture 4: Formal retrieval frameworks • Lecture 5: Probabilistic topic models for text mining
Lecture 1: Overview of IR • Basic Concepts in Text Retrieval (TR) • Evaluation of TR • Common Components of a TR system • Overview of Retrieval Models
What is Text Retrieval (TR)? • There exists a collection of text documents • User gives a query to express the information need • A retrieval system returns relevant documents to users • Known as “search technology” in industry
History of TR on One Slide • Birth of TR • 1945: V. Bush’s article “As we may think” • 1957: H. P. Luhn’s idea of word counting and matching • Indexing & Evaluation Methodology (1960’s) • Smart system (G. Salton’s group) • Cranfield test collection (C. Cleverdon’s group) • Indexing: automatic can be as good as manual (controlled vocabulary) • TR Models (1970’s & 1980’s) … • Large-scale Evaluation & Applications (1990’s-Present) • TREC (D. Harman & E. Voorhees, NIST) • Web search, PubMed, … • Boundary with related areas are disappearing
Short vs. Long Term Info Need • Short-term information need (Ad hoc retrieval) • “Temporary need”, e.g., info about used cars • Information source is relatively static • User “pulls” information • Application example: library search, Web search • Long-term information need (Filtering) • “Stable need”, e.g., new data mining algorithms • Information source is dynamic • System “pushes” information to user • Applications: news filter
Importance of Ad hoc Retrieval • Directly manages any existing large collection of information • There are many many “ad hoc” information needs • A long-term information need can be satisfied through frequent ad hoc retrieval • Basic techniques of ad hoc retrieval can be used for filtering and other “non-retrieval” tasks, such as automatic summarization.
Formal Formulation of TR • Vocabulary V={w1, w2, …, wN} of language • Query q = q1,…,qm, where qi V • Document di = di1,…,dimi, where dij V • Collection C= {d1, …, dk} • Set of relevant documents R(q) C • Generally unknown and user-dependent • Query is a “hint” on which doc is in R(q) • Task = compute R’(q), an “approximate R(q)”
Computing R(q) • Strategy 1: Document selection • R(q)={dC|f(d,q)=1}, where f(d,q) {0,1} is an indicator function or classifier • System must decide if a doc is relevant or not (“absolute relevance”) • Strategy 2: Document ranking • R(q) = {dC|f(d,q)>}, where f(d,q) is a relevance measure function; is a cutoff • System must decide if one doc is more likely to be relevant than another (“relative relevance”)
Document Selection vs. Ranking - - + - - - + - + - + + + - R’(q) R’(q) 1 True R(q) Doc Selection f(d,q)=? - - - 0 - + - + - - + + - + - - - - - - - - - 0.98 d1 + 0.95 d2 + 0.83 d3 - 0.80 d4 + 0.76 d5 - 0.56 d6 - 0.34 d7 - 0.21 d8 + 0.21 d9 - - Doc Ranking f(d,q)=? -
Problems of Doc Selection • The classifier is unlikely accurate • “Over-constrained” query (terms are too specific): no relevant documents found • “Under-constrained” query (terms are too general): over delivery • It is extremely hard to find the right position between these two extremes • Even if it is accurate, all relevant documents are not equally relevant • Relevance is a matter of degree!
Ranking is often preferred • Relevance is a matter of degree • A user can stop browsing anywhere, so the boundary is controlled by the user • High recall users would view more items • High precision users would view only a few • Theoretical justification: Probability Ranking Principle [Robertson 77]
Probability Ranking Principle[Robertson 77] • As stated by Cooper • Robertson provides two formal justifications • Assumptions: Independent relevance and sequential browsing (not necessarily all hold in reality) “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately a possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.”
According to the PRP, all we need is “A relevance measure function f”which satisfiesFor all q, d1, d2, f(q,d1) > f(q,d2) iff p(Rel|q,d1) >p(Rel|q,d2) Most IR research has focused on finding a good function f
Evaluation Criteria • Effectiveness/Accuracy • Precision, Recall • Efficiency • Space and time complexity • Usability • How useful for real user tasks?
Methodology: Cranfield Tradition • Laboratory testing of system components • Precision, Recall • Comparative testing • Test collections • Set of documents • Set of questions • Relevance judgments
The Contingency Table Action Retrieved Not Retrieved Doc Relevant Retrieved Relevant Rejected Relevant Irrelevant Retrieved Irrelevant Rejected Not relevant
How to measure a ranking? • Compute the precision at every recall point • Plot a precision-recall (PR) curve Which is better? precision x precision x x x x x x x recall recall
Summarize a Ranking: MAP • Given that n docs are retrieved • Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs • E.g., if the first rel. doc is at the 2nd rank, then p(1)=1/2. • If a relevant document never gets retrieved, we assume the precision corresponding to that rel. doc to be zero • Compute the average over all the relevant documents • Average precision = (p(1)+…p(k))/k • This gives us (non-interpolated) average precision, which captures both precision and recall and is sensitive to the rank of each relevant document • Mean Average Precisions (MAP) • MAP = arithmetic mean average precision over a set of topics • gMAP = geometric mean average precision over a set of topics (more affected by difficult topics)
Summarize a Ranking: NDCG • What if relevance judgments are in a scale of [1,r]? r>2 • Cumulative Gain (CG) at rank n • Let the ratings of the n documents be r1, r2, …rn (in ranked order) • CG = r1+r2+…rn • Discounted Cumulative Gain (DCG) at rank n • DCG = r1 + r2/log22 + r3/log23 + … rn/log2n • We may use any base for the logarithm, e.g., base=b • For rank positions above b, do not discount • Normalized Cumulative Gain (NDCG) at rank n • Normalize DCG at rank n by the DCG value at rank n of the ideal ranking • The ideal ranking would first return the documents with the highest relevance level, then the next highest relevance level, etc • Compute the precision (at rank) where each (new) relevant document is retrieved => p(1),…,p(k), if we have k rel. docs • NDCG is now quite popular in evaluating Web search
When There’s only 1 Relevant Document • Scenarios: • known-item search • navigational queries • Search Length = Rank of the answer: • measures a user’s effort • Mean Reciprocal Rank (MRR): • Reciprocal Rank: 1/Rank-of-the-answer • Take an average over all the queries
Precion-Recall Curve Out of 4728 rel docs, we’ve got 3212 Recall=3212/4728 Precision@10docs about 5.5 docs in the top 10 docs are relevant Breakeven Point (prec=recall) Mean Avg. Precision (MAP) D1 + D2 + D3 – D4 – D5 + D6 - Total # rel docs = 4 System returns 6 docs Average Prec = (1/1+2/2+3/5+0)/4
What Query Averaging Hides Slide from Doug Oard’s presentation, originally from Ellen Voorhees’ presentation
The Pooling Strategy • When the test collection is very large, it’s impossible to completely judge all the documents • TREC’s strategy: pooling • Appropriate for relative comparison of different systems • Given N systems, take top-K from the result of each, combine them to form a “pool” • Users judge all the documents in the pool; unjudged documents are assumed to be non-relevant • Advantage: less human effort • Potential problem: • bias due to incomplete judgments (okay for relative comparison) • Favor a system contributing to the pool, but when reused, a new system’s performance may be under-estimated • Reuse the data set with caution!
User Studies • Limitations of Cranfield evaluation strategy: • How do we evaluate a technique for improving the interface of a search engine? • How do we evaluate the overall utility of a system? • User studies are needed • General user study procedure: • Experimental systems are developed • Subjects are recruited as users • Variation can be in the system or the users • Users use the system and user behavior is logged • User information is collected (before: background, after: experience with the system) • Clickthrough-based real-time user studies: • Assume clicked documents to be relevant • Mix results from multiple methods and compare their clickthroughs
Typical TR System Architecture judgments Feedback docs query Tokenizer Doc Rep (Index) Query Rep User Scorer Indexer results Index
Text Representation/Indexing • Making it easier to match a query with a document • Query and document should be represented using the same units/terms • Controlled vocabulary vs. full text indexing • Full-text indexing is more practically useful and has proven to be as effective as manual indexing with controlled vocabulary
What is a good indexing term? • Specific (phrases) or general (single word)? • Luhn found that words with middle frequency are most useful • Not too specific (low utility, but still useful!) • Not too general (lack of discrimination, stop words) • Stop word removal is common, but rare words are kept • All words or a (controlled) subset? When term weighting is used, it is a matter of weighting not selecting of indexing terms
Tokenization • Word segmentation is needed for some languages • Is it really needed? • Normalize lexical units: Words with similar meanings should be mapped to the same indexing term • Stemming: Mapping all inflectional forms of words to the same root form, e.g. • computer -> compute • computation -> compute • computing -> compute (but king->k?) • Are we losing finer-granularity discrimination? • Stop word removal • What is a stop word? What about a query like “to be or not to be”?
Relevance Feedback Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Judgments: d1 + d2 - d3 + … dk - ... Feedback
Pseudo/Blind/Automatic Feedback top 10 Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query Document collection Judgments: d1 + d2 + d3 + … dk - ... Feedback
Implicit Feedback Results: d1 3.5 d2 2.4 … dk 0.5 ... Retrieval Engine Query Updated query User Document collection Judgments: d1 + d2 - d3 + … dk - ... infer Feedback User Activities e.g. clickthroughs
Important Points to Remember • PRP provides a justification for ranking, which is generally preferred to document selection • How to compute the major evaluation measure (precision, recall, precision-recall curve, MAP, gMAP, breakeven precision, NDCG, MRR) • What is pooling • What is tokenization (word segmentation, stemming, stop word removal) • What are relevance feedback, pseudo relevance feedback, and implicit feedback
Overview of Retrieval Models Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Learn to Rank (Joachims 02) (Burges et al. 05)
Relevance = Similarity • Assumptions • Query and document are represented similarly • A query can be regarded as a “document” • Relevance(d,q) similarity(d,q) • R(q) = {dC|f(d,q)>}, f(q,d)=(Rep(q), Rep(d)) • Key issues • How to represent query/document? • How to define the similarity measure ?
Vector Space Model • Represent a doc/query by a term vector • Term: basic concept, e.g., word or phrase • Each term defines one dimension • N terms define a high-dimensional space • Element of vector corresponds to term weight • E.g., d=(x1,…,xN), xi is “importance” of term i • Measure relevance by the distance between the query vector and document vector in the vector space
VS Model: illustration Starbucks ? ? D2 D9 ? ? D11 D5 D3 D10 D4 D6 Java Query D7 D1 D8 Microsoft ??
What the VS model doesn’t say • How to define/select the “basic concept” • Concepts are assumed to be orthogonal • How to assign weights • Weight in query indicates importance of term • Weight in doc indicates how well the term characterizes the doc • How to define the similarity/distance measure
What’s a good “basic concept”? • Orthogonal • Linearly independent basis vectors • “Non-overlapping” in meaning • No ambiguity • Weights can be assigned automatically and hopefully accurately • Many possibilities: Words, stemmed words, phrases, “latent concept”, …
How to Assign Weights? • Very very important! • Why weighting • Query side: Not all terms are equally important • Doc side: Some terms carry more information about contents • How? • Two basic heuristics • TF (Term Frequency) = Within-doc-frequency • IDF (Inverse Document Frequency) • TF normalization
TF Weighting • Idea: A term is more important if it occurs more frequently in a document • Some formulas: Let f(t,d) be the frequency count of term t in doc d • Raw TF: TF(t,d) = f(t,d) • Log TF: TF(t,d)=log f(t,d) • Maximum frequency normalization: TF(t,d) = 0.5 +0.5*f(t,d)/MaxFreq(d) • “Okapi/BM25 TF”: TF(t,d) = k f(t,d)/(f(t,d)+k(1-b+b*doclen/avgdoclen)) • Normalization of TF is very important!
TF Normalization • Why? • Document length variation • “Repeated occurrences” are less informative than the “first occurrence” • Two views of document length • A doc is long because it uses more words • A doc is long because it has more contents • Generally penalize long doc, but avoid over-penalizing (pivoted normalization)
TF Normalization: How? Norm. TF Raw TF Which curve is more reasonable? Should normalized-TF be up-bounded? Normalization interacts with the similarity measure
Regularized/“Pivoted” Length Normalization Norm. TF Raw TF “Pivoted normalization”: Using avg. doc length to regularize normalization 1-b+b*doclen/avgdoclen (b varies from 0 to 1) What would happen if doclen is {>, <,=} avgdoclen? Advantage: stabalize parameter setting