380 likes | 494 Views
Information Retrieval Modeling. CS 652 Information Extraction and Integration. Introduction. IR systems usually adopt index terms to index documents & process queries Index term: a keyword or group of selected/related words any word (in general) Problems of IR based on index terms:
E N D
Information Retrieval Modeling CS 652 Information Extraction and Integration
Introduction • IR systems usually adopt index terms to index documents & process queries • Index term: • a keyword or group of selected/related words • any word (in general) • Problems of IR based on index terms: • oversimplication & lose of semantics • badly-formed queries • Stemming might be used: • connect: connecting, connection, connections • An inverted file is built for the chosen index terms
Introduction Docs Index Terms doc match Ranking Information Need query
Introduction • Matching at index term level is quite imprecise • Consequence: no surprise that users get frequently unsatisfied • Since most users have no training in query formation, problem is even worst • Result: frequent dissatisfaction of Web users • Issue of deciding relevance is critical for IR systems: ranking
Introduction • A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query • A ranking is based on fundamental premises regarding the notion of relevance, such as: • common sets of index terms • sharing of weighted terms • likelihood of relevance • Eachset of premises leads to a distinct IR model
Algebraic Set Theoretic Generalized Vector (Space) Latent Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic Boolean Vector (Space) Probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing
IR Models • The IR model, the logical view of the docs, and the retrieval task are distinct aspects of the system
Retrieval: Ad Hoc vs. Filtering • Ad hoc retrieval: Static set of documents & dynamic queries Q1 Q2 Collection “Fixed Size” Q3 Q4 Q5
Retrieval: Ad Hoc vs. Filtering • Filtering: dynamic set of documents & static queries Docs Filtered for User 2 User 2 Profile outgoing User 1 Profile Docs for User 1 outgoing incoming Documents Stream
Classic IR Models – Basic Concepts • Each document is described by a set of representative keywords or index terms • Index terms are document words (i.e. nouns), which have meaning by themselves for remembering the main themes of a document • However, search engines assume that all words are index terms (full text representation) • User profile construction process: • user-provided keywords • relevance feedback cycle • system adjusted user profile description • repeat step (2) till stabilized
Classic IR Models - Basic Concepts • Not all terms are equally useful for representing the document contensts: less frequent terms allow identifying a narrower set of documents • The importance of the index terms is represented by weights associated to them • Let • kibe an index term • djbe a document • wij is a weight associated with (ki,dj ),which quantifies the importance of ki for describing the contents of dj
Classic IR Models - Basic Concepts • Kiis a generic index term • djis a document • t is the total number of index terms in an IR sytem • K = (k1, k2, …, kt) is the set of all index terms • wij 0 is a weight associated with (ki,dj) • wij= 0 indicates that ki does not belong to dj • vec(dj) = (w1j, w2j, …, wtj ) is a weighted vector associated with the document dj • gi(vec(dj)) = wijis a function which returns the weight associated with pair (ki,dj )
The Boolean Model • Simple model based on set theory & Boolean algebra • Queries specified as booleanexpressions • precise semantics • neat formalism • q = ka (kb kc ) • Terms are either present or absent. Thus, Ki {0,1} • Consider • q = ka (kb kc ) • vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) • vec(qcc) = (1,1,0)is a conjunctive component
The Boolean Model Ka Kb (1,1,0) • q = ka (kb kc) 1 if vec(qcc) | (vec(qcc) vec(qdnf)) ( ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise (1,0,0) (1,1,1) Kc sim(q,dj ) =
Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • Noranking of the documents is provided (absence of a grading scale) • Information need has to be translated into a Boolean expression which most users find awkward • The Boolean queries formulated by the users are most often too simplistic • Using the exact matching approach, the model frequently returns either too few or too many documents in response to a user query
The Vector (Space) Model (VSM) • Use of binary weights is toolimiting • Non-binary weights provide consideration for partial matching • These term weights are used to compute a degree of similarity between a query and a document • Ranked set of documents provides better matching, i.e., partal matching vs. exact matching • Answer set of VSM is a lot moreprecise than the answer set retrieved by the Boolean model
The Vector (Space) Model • Define: • wij > 0 whenever ki dj • wiq>= 0 associated with the pair (ki,q) • vec(dj ) = (w1j, w2j, ..., wtj),document vectorof dj vec(q) = (w1q, w2q, ..., wtq),query vectorof q • To each term kiis associated a unitary vector vec(i) • The unitary vectors vec(i) and vec(j) are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) • The t unitary vectors vec(i) form an orthonormal basis for a t-dimensionalspace • In this space, queries and documents are represented as weighted vectors
The Vector (Space) Model j dj q i • Sim(q,dj ) = cos() = [vec(dj) vec(q)] / |dj | |q| = [ti=1wij wiq ] / ti=1wij2 ti=1wiq2 where is the inner product operator & |q| is the length of q • Since wij0and wiq 0, 1 sim(q, dj ) 0 • A document is retrieved even if it matches the query terms only partially
The Vector (Space) Model • Sim(q, dj ) = [ti=1 wij wiq ] / |dj | |q| • How to compute the weights wijand wiq? • A good weight must take into account two effects: • quantification of intra-document contents (similarity) • tf factor, the term frequency within a document • quantification of inter-documents separation (dissi- milarity) • idf factor, the inverse document frequency • wij = tf(i, j) idf(i)
The Vector (Space) Model • Let, • N be the total number of documents in the collection • nibe the number of documents which containki • freq(i, j), theraw frequency of kiwithin dj • A normalizedtf factor is given by • f(i, j) = freq(i, j) / max(freq(l, j)), where the maximum is computed over all terms which occur within the document dj • The inverse document frequency (idf)factor is • idf(i) = log (N / ni ) • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with term ki.
The Vector (Space) Model • The best term-weighting schemes use weights which are give by • wij = f(i, j) log(N / ni) • the strategy is called a tf-idf weighting scheme • For the query term weights, a suggestion is • Wiq = (0.5 + [0.5 freq(i, q) / max(freq(l, q))]) log(N/ni) • The vector model with tf-idf weights is a good ranking strategy with general collections • The VSM is usually as good as the known ranking alternatives. It is also simple and fast to compute
The Vector (Space) Model • Advantages: • term-weighting improves quality of the answer set • partial matching allows retrieval of documents that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • A popular IR model because of its simplicity & speed • Disadvantages: • assumes mutuallyindependence of index terms (??); not clear that this is bad though
k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector (Space) Model Example I
k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector (Space) Model Example II
k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector (Space) Model Example III
Probabilistic Model • Objective: to capture the IR problem using a probabilistic framework • Given a user query, there is anidealanswerset • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve by iteration
Probabilistic Model • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10 - 20 need to be inspected) • IR system uses this information to refine description of ideal answer set • By repeating this process, it is expected that the description of the ideal answer set will improve • Have always in mind the need to guess at the very beginning the description of the ideal answer set • Description of ideal answer set is modeled in probabilisticterms
Probabilistic Ranking Principle • Given a user query q and a document d in a collection, the probabilistic model tries to estimate the probability that the user will find the document dinteresting (i.e., relevant). The model assumes that this probability of relevance depends on the query and the document representations only. The ideal answer set R should maximize the probability of relevance. Documents in R are predicted to be relevant to q. • Problems: • How to compute probabilities of relevance? • What is the sample space?
The Ranking • Probabilistic ranking is computed as: • sim(q, dj ) = P(dj relevant-to q) / P(dj non-relevant-to q) • This is the odds of the document dj being relevant to q • Taking the oddsminimize the probability of an erroneous judgement • Definitions: • wij,wiq{0,1}, i.e., all index term weights are binary • P(R | vec(dj )): probability that doc dj is relevant to q, where R is the known set of docs to be relevant • P(R | vec(dj )): probability that doc dj is notrelevant to q, where R is the complement of R
The Ranking • sim(dj, q) = P(R | vec(dj)) / P(R | vec(dj)) = [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)] ~ P(vec(dj) | R) P(vec(dj) | R) where, P(vec(dj ) | R): probability of randomly selecting dj from R of relevant documents P(R): probability of a randomly selected doc from the collection is relevant (Bayes’ rule)
The Ranking • sim(dj, q) ~ P(vec(dj) | R) P(vec(dj) | R) ~ [ gi(vec(dj))=1 P(ki | R)] * [ gi(vec(dj))=0 P(ki | R)] [ gi(vec(dj))=1 P(ki | R)] * [ gi(vec(dj))=0 P(ki | R)] where, P(ki | R) : probability that the index term ki is present in a document randomly selected from R of relevant documents
The Ranking • sim(dj,q) ~ log [ P(ki | R)] * [ P(ki | R)] [ P(ki | R)] * [ P(ki | R)] ~ K * [ log P(ki | R) + log P(ki| R) ] P(ki | R) P(ki | R) ~ ti=1wiq* wij* (log P(ki | R) + log 1 - P(ki | R)) P(ki | R) 1 - P(ki | R)
The Initial Ranking • sim(dj, q) ~ wiq* wij* (log P(ki| R) + log 1 - P(ki | R)) P(ki | R) 1 - P(ki | R) • Probabilities P(ki | R) and P(ki | R) ? • Estimates based on assumptions: • P(ki | R) = 0.5 • P(ki| R) = niN where ni is the number of docs that contain ki N is the number of docs in the collection • Use this initial guess to retrieve an initial ranking • Improve upon this initial ranking
Improving the Initial Ranking • sim(dj,q) ~ wiq* wij* (log P(ki| R) + log 1 – P(ki | R)) P(ki | R) 1 - P(ki | R) • Let • V : set of docs initially retrieved & ranked • Vi : subset of V retrieved that containki • Reevaluate estimates: • P(ki | R) = ViV • P(ki | R) = ni - ViN - V • Repeat recursively (~ P(Ki| R) by the distribution of Ki among V) (All the non-retrieved docs are irrelevant)
Improving the Initial Ranking • sim(dj, q) ~ wiq* wij* (log P(ki| R) + log 1 - P(ki | R ) ) P(ki | R) 1 - P(ki | R) • To avoid problems with V = 1 & Vi = 0: • P(ki | R) = Vi + 0.5V + 1 • P(ki | R) = ni - Vi + 0.5N - V + 1 • Or (alteratively) • P(ki | R) = Vi + ni /NV + 1 • P(ki | R) = ni - Vi + ni/NN - V + 1 (ni / Nis an adjustment factor)
Pluses and Minuses • Advantages: • Documents ranked in decreasing order of probability of relevance • Disadvantages: • Need to guess initial estimates for P(ki | R) • Method does not take into account tf and idf factors • assumes mutuallyindependence of index terms (??); not clear that this is bad though
Brief Comparison of Classic Models • Boolean model does not provide for partialmatches and is considered to be the weakest classic model • Salton and Buckley did a series of experiments that indicate that, in general, the vector (space) model outperforms the probabilistic model with general collections • This seems also to be the view of the research community