1 / 90

龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2

龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2. ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign

Download Presentation

龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 龙星计划课程:信息检索Overview of Text Retrieval: Part 2 ChengXiang Zhai (翟成祥) Department of Computer Science Graduate School of Library & Information Science Institute for Genomic Biology, Statistics University of Illinois, Urbana-Champaign http://www-faculty.cs.uiuc.edu/~czhai, czhai@cs.uiuc.edu

  2. Outline • Other retrieval models • Implementation of a TR System • Applications of TR techniques

  3. P-norm (Extended Boolean)(Salton et al. 83) • Motivation: how to rank documents with a Boolean query? • Intuitions • Docs satisfying the query constraint should get the highest ranks • Partial satisfaction of query constraint can be used to rank other docs • Question: How to capture “partial satisfaction”?

  4. P-norm: Basic Ideas • Normalized term weights for doc rep ([0,1]) • Define similarity between a Boolean query and a doc vector Q= T1 ORT2 Q= T1 AND T2 (0,1) (1,1) (0,1) (1,1) (x,y) (x,y) (0,0) (1,0) (0,0) (1,0)

  5. P-norm: Formulas Since the similarity value is normalized to [0,1], these two formulas can be applied recursively. 1 P + vector-space Boolean/Fuzzy logic

  6. P-norm: Summary • A general (and elegant) similarity function for Boolean query and a regular document vector • Connecting Boolean model and vector space model with models in between • Allowing different “confidence” on Boolean operators (different p for different operators) • A model worth more exploration (how to learn optimal p values from feedback?)

  7. Probabilistic Retrieval Models

  8. Overview of Retrieval Models Relevance P(d q) or P(q d) Probabilistic inference (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance Regression Model (Fox 83) Generative Model Different inference system Different rep & similarity Query generation Doc generation … Inference network model (Turtle & Croft, 91) Prob. concept space model (Wong & Yao, 95) Vector space model (Salton et al., 75) Prob. distr. model (Wong & Yao, 89) Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) Learn to Rank (Joachims 02) (Burges et al. 05)

  9. Formally… 3 random variables: query Q, document D, relevance R {0,1} Given a particular query q, a particular document d, p(R=1|Q=q,D=d)=? The Basic Question What is the probability that THIS document is relevant to THIS query?

  10. Probability of Relevance • Three random variables • Query Q • Document D • Relevance R  {0,1} • Goal: rank D based on P(R=1|Q,D) • Evaluate P(R=1|Q,D) • Actually, only need to compare P(R=1|Q,D1) with P(R=1|Q,D2), I.e., rank documents • Several different ways to refine P(R=1|Q,D)

  11. Refining P(R=1|Q,D) Method 1: conditional models • Basic idea: relevance depends on how well a query matches a document • Define features on Q x D, e.g., #matched terms, # the highest IDF of a matched term, #doclen,.. • P(R=1|Q,D)=g(f1(Q,D), f2(Q,D),…,fn(Q,D), ) • Using training data (known relevance judgments) to estimate parameter  • Apply the model to rank new documents • Special case: logistic regression

  12. Logistic Regression (Cooper 92, Gey 94) logit function: logistic (sigmoid) function: P(R=1|Q,D) Uses 6 features X1, …, X6 1.0 X

  13. Features/Attributes Average Absolute Query Frequency Query Length Average Absolute Document Frequency Document Length Average Inverse Document Frequency Inverse Document Frequency Number of Terms in common between query and document -- logged

  14. Logistic Regression: Pros & Cons • Advantages • Absolute probability of relevance available • May re-use all the past relevance judgments • Problems • Performance is very sensitive to the selection of features • No much guidance on feature selection • In practice, performance tends to be average

  15. Refining P(R=1|Q,D) Method 2:generative models • Basic idea • Define P(Q,D|R) • Compute O(R=1|Q,D) using Bayes’ rule • Special cases • Document “generation”: P(Q,D|R)=P(D|Q,R)P(Q|R) • Query “generation”: P(Q,D|R)=P(Q|D,R)P(D|R) Ignored for ranking D

  16. Document Generation Model of relevant docs for Q Model of non-relevant docs for Q Assume independent attributes A1…Ak ….(why?) Let D=d1…dk, where dk{0,1} is the value of attribute Ak (Similarly Q=q1…qk ) Non-query terms are equally likely to appear in relevant and non-relevant docs

  17. Robertson-Sparck Jones Model(Robertson & Sparck Jones 76) (RSJ model) Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc qi = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc How to estimate parameters? Suppose we have relevance judgments, “+0.5” and “+1” can be justified by Bayesian estimation

  18. RSJ Model: No Relevance Info(Croft & Harper 79) (RSJ model) How to estimate parameters? Suppose we do not have relevance judgments, - We will assume pi to be a constant - Estimate qi by assuming all documents to be non-relevant N: # documents in collection ni: # documents in which term Ai occurs

  19. RSJ Model: Summary • The most important classic prob. IR model • Use only term presence/absence, thus also referred to as Binary Independence Model • Essentially Naïve Bayes for doc ranking • Most natural for relevance/pseudo feedback • When without relevance judgments, the model parameters must be estimated in an ad hoc way • Performance isn’t as good as tuned VS model

  20. Improving RSJ: Adding TF Basic doc. generation model: Let D=d1…dk, where dkis the frequency count of term Ak 2-Poisson mixture model Many more parameters to estimate! (how many exactly?)

  21. BM25/Okapi Approximation(Robertson et al. 94) • Idea: Approximate p(R=1|Q,D) with a simpler function that share similar properties • Observations: • log O(R=1|Q,D) is a sum of term weights Wi • Wi= 0, if TFi=0 • Wi increases monotonically with TFi • Wi has an asymptotic limit • The simple function is

  22. Adding Doc. Length & Query TF • Incorporating doc length • Motivation: The 2-Poisson model assumes equal document length • Implementation: “Carefully” penalize long doc • Incorporating query TF • Motivation: Appears to be not well-justified • Implementation: A similar TF transformation • The final formula is called BM25, achieving top TREC performance

  23. The BM25 Formula “Okapi TF/BM25 TF”

  24. Extensions of “Doc Generation” Models • Capture term dependence (Rijsbergen & Harper 78) • Alternative ways to incorporate TF (Croft 83, Kalt96) • Feature/term selection for feedback (Okapi’s TREC reports) • Other Possibilities (machine learning … )

  25. Query Generation Query likelihoodp(q| d) Document prior Assuming uniform prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Leading to the so-called “Language Modeling Approach” …

  26. What is a Statistical LM? • A probability distribution over word sequences • p(“Today is Wednesday”)  0.001 • p(“Today Wednesday is”)  0.0000000000001 • p(“The eigenvalue is positive”)  0.00001 • Context-dependent! • Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

  27. The Simplest Language Model(Unigram Model) • Generate a piece of text by generating each word INDEPENDENTLY • Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) • Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) • Essentially a multinomial distribution over words • A piece of text can be regarded as a sample drawn according to this word distribution

  28. Text Generation with Unigram LM Text mining paper Food nutrition paper (Unigram) Language Model  p(w| ) Sampling Document … text 0.2 mining 0.1 association 0.01 clustering 0.02 … food 0.00001 … Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 … Topic 2: Health

  29. Estimation of Unigram LM … text ? mining ? association ? database ? … query ? … 10/100 5/100 3/100 3/100 1/100 (Unigram) Language Model  p(w| )=? Estimation Document text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 A “text mining paper” (total #words=100)

  30. Language Models for Retrieval(Ponte & Croft 98) Language Model … text ? mining ? assocation ? clustering ? … food ? … ? Which model would most likely have generated this query? … food ? nutrition ? healthy ? diet ? … Document Query = “data mining algorithms” Text mining paper Food nutrition paper

  31. Ranking Docs by Query Likelihood Doc LM Query likelihood d1 p(q| d1) p(q| d2) d2 p(q| dN) dN d1 q d2 dN

  32. Retrieval as Language Model Estimation Document language model • Document ranking based on query likelihood • Retrieval problem  Estimation of p(wi|d) • Smoothing is an important issue, and distinguishes different approaches

  33. How to Estimate p(w|d)? • Simplest solution: Maximum Likelihood Estimator • P(w|d) = relative frequency of word w in d • What if a word doesn’t appear in the text? P(w|d)=0 • In general, what probability should we give a word that has not been observed? • If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words • This is what “smoothing” is about …

  34. Language Model Smoothing (Illustration) Smoothed LM P(w) Max. Likelihood Estimate Word w

  35. How to Smooth? • All smoothing methods try to • discount the probability of words seen in a document • re-allocate the extra counts so that unseen words will have a non-zero count • A simple method (additive smoothing): Add a constant  to the counts of each word • Problems? Counts of w in d “Add one”, Laplace smoothing Vocabulary size Length of d (total counts)

  36. A General Smoothing Scheme Discounted ML estimate Collection language model • All smoothing methods try to • discount the probability of words seen in a doc • re-allocate the extra probability so that unseen words will have a non-zero probability • Most use a reference model (collection language model) to discriminate unseen words

  37. Smoothing & TF-IDF Weighting Doc length normalization (long doc is expected to have a smaller d) TF weighting IDFweighting Ignore for ranking • Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain • Smoothing with p(w|C) TF-IDF + length norm.

  38. Derivation of the Query Likelihood Retrieval Formula Discounted ML estimate Reference language model Retrieval formula using the general smoothing scheme Key rewriting step Similar rewritings are very common when using LMs for IR…

  39. More Smoothing Methods • Method 1 (Absolute discounting): Subtract a constant  from the counts of each word • Method 2 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF) # uniq words parameter ML estimate

  40. More Smoothing Methods (cont.) • Method 3 (Dirichlet Prior/Bayesian): Assumepseudo counts p(w|REF) • Method 4 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way parameter

  41. Dirichlet Prior Smoothing • ML estimator: M=argmax M p(d|M) • Bayesian estimator: • First consider posterior: p(M|d) =p(d|M)p(M)/p(d) • Then, consider the mean or mode of the posterior dist. • p(d|M) : Sampling distribution (of data) • P(M)=p(1 ,…,N) : our prior on the model parameters • conjugate = prior can be interpreted as “extra”/“pseudo” data • Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts i= p(wi|REF)

  42. Dirichlet Prior Smoothing (cont.) Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing

  43. Advantages of Language Models • Solid statistical foundation • Parameters can be optimized automatically using statistical estimation methods • Can easily model many different retrieval tasks • To be covered more later

  44. What You Should Know • Global relationship among different probabilistic models • How logistic regression works • How the Robertson-Sparck Jones model works • The BM25 formula • All document-generation models have trouble when no relevance judgments are available • How the language modeling approach (query likelihood scoring) works • How Dirichlet prior smoothing works • 3 state of the art retrieval models: Pivoted Norm  Okapi/BM25  Query Likelihood (Dirichlet prior smoothing)

  45. Implementation of an IR System

  46. IR System Architecture docs INDEXING Query Rep query Doc Rep User Ranking SEARCHING results INTERFACE Feedback judgments QUERY MODIFICATION

  47. Indexing • Indexing = Convert documents to data structures that enable fast search • Inverted index is the dominating indexing method (used by all search engines) • Other indices (e.g., document index) may be needed for feedback

  48. Inverted Index • Fast access to all docs containing a given term (along with freq and pos information) • For each term, we get a list of tuples (docID, freq, pos). • Given a query, we can fetch the lists for all query terms and work on the involved documents. • Boolean query: set operation • Natural language query: term weight summing • More efficient than scanning docs (why?)

  49. Inverted Index Example Doc 1 Dictionary Postings This is a sample document with one sample sentence Doc 2 This is another sample document

  50. Data Structures for Inverted Index • Dictionary: modest size • Needs fast random access • Preferred to be in memory • Hash table, B-tree, trie, … • Postings: huge • Sequential access is expected • Can stay on disk • May contain docID, term freq., term pos, etc • Compression is desirable

More Related