1 / 17

Modeling

This chapter introduces the ranking algorithms used in information retrieval systems and explores the taxonomy of IR models, including Boolean, Vector, and Probabilistic Retrieval. It also covers the estimation of term relevance in these models.

lholmes
Download Presentation

Modeling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, 1999. (Chapter 2)

  2. Introduction • Ranking algorithms • The central problem regarding IR systems is the issue of predicting which documents are relevant and which are not. • Taxonomy of IR Models • Boolean: set theoretic • Vector: algebraic • Probabilistic

  3. Retrieval • Ad hoc • the documents in the collection remain relatively static while new queries are submitted to the system • Filtering (Routing) • the queries remain relatively static while new documents come into the system • construction of user profile

  4. Basic Concepts • In the classic models • each document is described by a set of representative keywords called index terms • index terms are mainly nouns • distinct index terms have varying relevance • index term weights are usually assumed to be mutually independent

  5. Boolean Model • Binary decision criterion • Data retrieval model • A query is a Boolean expression which can be represented as a disjunction of conjunctive vectors • Advantage • clean formalism, simplicity • Disadvantage • exact matching may lead to retrieval of too few or too many documents

  6. Vector Model (1/4) • Index terms are assigned non-binary weights • Term weights are used to compute the degree of similarity between documents and the user query • Then, retrieved documents are sorted in decreasing order. • Definition For the vector model, the weight wi,j is associated with term ki and document dj

  7. Vector Model (2/4) • Degree of similarity

  8. Vector Model (3/4) • Salton • IR vs. clustering • intra-clustering similarity: tf factor (term frequency) • inter-cluster dissimilarity: idf factor (inverse document frequency) • Definition • normalized frequency • inverse document fequency • term-weighting schemes • query-term weights

  9. Vector Model (4/4) • Advantages • its term-weighting scheme improves retrieval performance • its partial matching strategy allows retrieval of documents that approximate the query conditions • its cosine ranking formula sorts the documents according to their degree of similarity to the query • Disadvantage • The assumption of mutual independence between index terms

  10. Probabilistic Model (1/7) • Introduced by Roberston and Sparck Jones, 1976 • Also called binary independence retrieval (BIR) model • Idea: Given a user query q, and the ideal answer set of the relevant documents, the problem is to specify the properties for this set. • i.e.the probabilistic model tries to estimate the probability that the user will find the document dj relevant with ratio P(dj relevant to q)/P(dj nonrelevant to q)

  11. Probabilistic Model (2/7) • Definition • All index term weights are all binary i.e., wi,j {0,1} • Let R be the set of documents know to be relevant to query q • Let be the complement of R • Let be the probability that the document dj is relevant to the query q • Let be the probability that the document dj is nonelevant to query q

  12. Probabilistic Model (3/7) • The similarity sim(dj,q) of the document dj to the query q is defined as the ratio • Using Bayes’ rule, • P(R) stands for the probability that a document randomly selected from the entire collection is relevant • stands for the probability of randomly selecting the document dj from the set R of relevant documents

  13. Probabilistic Model (4/7) • Assuming independence of index terms and given q=(d1, d2, …, dt),

  14. Probabilistic Model (5/7) • Pr(ki |R) stands for the probability that the index term ki is present in a document randomly selected from the set R • stands for the probability that the index term ki is not present in a document randomly selected from the set R • let Pr(ki |R)=pi di is either 0 or 1 0: di is absent from q 1: di is present in q

  15. Probabilistic Model (6/7)

  16. Probabilistic Model (7/7) • The retrieval value of each ki present in a document (i.e., di=1) is term relevance weight • pj= 0.5, qj= dfj/ N

  17. Estimation of Term Relevance

More Related