Information Retrieval

Information Retrieval CSE 8337 Spring 2007 Modeling Queries and Documents Material for these slides obtained from: Modern Information Retrieval by Ricardo Baeza-Yates and Berthier Ribeiro-Neto http://www.sims.berkeley.edu/~hearst/irbook/ Introduction to Modern Information Retrieval by Gerard Salton and Michael J. McGill, McGraw-Hill, 1983.

Modeling TOC • Introduction • Classic IR Models • Boolean Model • Vector Model • Probabilistic Model • Set Theoretic Models • Fuzzy Set Model • Extended Boolean Model • Generalized Vector Model • Latent Semantic Indexing • Neural Network Model • Alternative Probabilistic Models • Inference Network • Belief Network

Introduction • IR systems usually adopt index terms to process queries • Index term: • a keyword or group of selected words • any word (more general) • Stemming might be used: • connect: connecting, connection, connections • An inverted file is built for the chosen index terms

Introduction Docs Index Terms doc match Ranking Information Need query

Introduction • Matching at index term level is quite imprecise • No surprise that users get frequently unsatisfied • Since most users have no training in query formation, problem is even worst • Frequent dissatisfaction of Web users • Issue of deciding relevance is critical for IR systems: ranking

Algebraic Set Theoretic Generalized Vector Lat. Semantic Index Neural Networks Structured Models Fuzzy Extended Boolean Non-Overlapping Lists Proximal Nodes Classic Models Probabilistic boolean vector probabilistic Inference Network Belief Network Browsing Flat Structure Guided Hypertext IR Models U s e r T a s k Retrieval: Adhoc Filtering Browsing

Classic IR Models - Basic Concepts • Each document represented by a set of representative keywords or index terms • An index term is a document word useful for remembering the document main themes • Usually, index terms are nouns because nouns have meaning by themselves • However, search engines assume that all words are index terms (full text representation)

Classic IR Models - Basic Concepts • The importance of the index terms is represented by weights associated to them • ki- an index term • dj- a document • wij - a weight associated with (ki,dj) • The weight wijquantifies the importance of the index term for describing the document contents

Classic IR Models - Basic Concepts • t is the total number of index terms • K = {k1, k2, …, kt} is the set of all index terms • wij >= 0 is a weight associated with (ki,dj) • wij = 0 indicates that term does not belong to doc • dj= (w1j, w2j, …, wtj) is a weighted vector associated with the document dj • gi(dj) = wij is a function which returns the weight associated with pair (ki,dj)

The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics and neat formalism • Terms are either present or absent. Thus, wij  {0,1} • Consider • q = ka  (kb  kc) • qdnf = (1,1,1)  (1,1,0)  (1,0,0) • qcc= (1,1,0) is a conjunctive component

Ka Kb (1,1,0) (1,0,0) (1,1,1) Kc The Boolean Model • q = ka  (kb kc) • sim(q,dj) = 1 if  qcc| (qcc  qdnf)  (ki, gi(dj)= gi(qcc)) 0 otherwise

Drawbacks of the Boolean Model • Retrieval based on binary decision criteria with no notion of partial matching • No ranking of the documents is provided • Information need has to be translated into a Boolean expression • The Boolean queries formulated by the users are most often too simplistic • As a consequence, the Boolean model frequently returns either too few or too many documents in response to a user query

The Vector Model • Use of binary weights is too limiting • Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document • Ranked set of documents provides for better matching

The Vector Model • wij > 0 whenever ki appears in dj • wiq >= 0 associated with the pair (ki,q) • dj = (w1j, w2j, ..., wtj) • q = (w1q, w2q, ..., wtq) • To each term ki is associated a unitary vector i • The unitary vectors i and j are assumed to be orthonormal (i.e., index terms are assumed to occur independently within the documents) • The t unitary vectors i form an orthonormal basis for a t-dimensional space where queries and documents are represented as weighted vectors

The Vector Model j dj  q i • Sim(q,dj) = cos() = [dj  q] / |dj| * |q| = [ wij * wiq] / |dj| * |q| • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 • A document is retrieved even if it matches the query terms only partially

Weights wij and wiq ? • One approach is to examine the frequency of the occurence of a word in a document: • Absolute frequency: • tf factor, the term frequency within a document • freqi,j - raw frequency of ki within dj • Both high-frequency and low-frequency terms may not actually be significant • Relative frequency: tf divided by number of words in document • Normalized frequency: fi,j = (freqi,j)/(maxl freql,j)

Inverse Document Frequency • Importance of term may depend more on how it can distinguish between documents. • Quantification of inter-documents separation • Dissimilarity not similarity • idf factor, the inverse document frequency

IDF • N be the total number of docs in the collection • ni be the number of docs which contain ki • The idf factor is computed as • idfi = log (N/ni) • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki. • IDF Ex: • N=1000, n1=100, n2=500, n3=800 • idf1= 3 - 2 = 1 • idf2= 3 – 2.7 = 0.3 • idf3 = 3 – 2.9 = 0.1

The Vector Model • The best term-weighting schemes take both into account. • wij = fi,j * log(N/ni) • This strategy is called a tf-idf weighting scheme

The Vector Model • For the query term weights, a suggestion is • wiq = (0.5 + [0.5 * freqi,q / max(freql,q]) * log(N/ni) • The vector model with tf-idf weights is a good ranking strategy with general collections • The vector model is usually as good as any known ranking alternatives. • It is also simple and fast to compute.

The Vector Model • Advantages: • term-weighting improves quality of the answer set • partial matching allows retrieval of docs that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • Assumes independence of index terms (??); not clear that this is bad though

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example I

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example II

k2 k1 d7 d6 d2 d4 d5 d3 d1 k3 The Vector Model: Example III

Probabilistic Model • Objective: to capture the IR problem using a probabilistic framework • Given a user query, there is an ideal answer set • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve by iteration

Probabilistic Model • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) • IR system uses this information to refine description of ideal answer set • By repeting this process, it is expected that the description of the ideal answer set will improve • Have always in mind the need to guess at the very beginning the description of the ideal answer set • Description of ideal answer set is modeled in probabilistic terms

Probabilistic Ranking Principle • Given a user query q and a document dj, the probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). Ideal answer set is referred to as R and should maximize the probability of relevance. Documents in the set R are predicted to be relevant. • But, • how to compute probabilities? • what is the sample space?

The Ranking • Probabilistic ranking computed as: • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) • This is the odds of the document dj being relevant • Taking the odds minimize the probability of an erroneous judgement • Definition: • wij {0,1} • P(R | dj) :probability that given doc is relevant • P(R | dj) : probability doc is not relevant

Improving the Initial Ranking • Let • V : set of docs initially retrieved • Vi : subset of docs retrieved that contain ki • Reevaluate estimates: • P(ki | R) = Vi V • P(ki | R) = ni - Vi N - V • Repeat recursively

Improving the Initial Ranking • To avoid problems with V=1 and Vi=0: • P(ki | R) = Vi + 0.5 V + 1 • P(ki | R) = ni - Vi + 0.5 N - V + 1 • Also, • P(ki | R) = Vi + ni/N V + 1 • P(ki | R) = ni - Vi + ni/N N - V + 1

Pluses and Minuses • Advantages: • Docs ranked in decreasing order of probability of relevance • Disadvantages: • need to guess initial estimates for P(ki | R) • method does not take into account tf and idf factors

Brief Comparison of Classic Models • Boolean model does not provide for partial matches and is considered to be the weakest classic model • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections • This seems also to be the view of the research community

Set Theoretic Models • The Boolean model imposes a binary criterion for deciding relevance • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past • We discuss now two set theoretic models for this: • Fuzzy Set Model • Extended Boolean Model

Fuzzy Set Model • This vagueness of document/query matching can be modeled using a fuzzy framework, as follows: • with each term is associated a fuzzy set • each doc has a degree of membership in this fuzzy set • Here, we discuss the model proposed by Ogawa, Morita, and Kobayashi (1991)

Fuzzy Set Theory • A fuzzy subset A of U is characterized by a membership function (A,u) : U  [0,1] which associates with each element u of U a number (u) in the interval [0,1] • Definition • Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then, • (¬A,u) = 1 - (A,u) • (AB,u) = max((A,u), (B,u)) • (AB,u) = min((A,u), (B,u))

Fuzzy Information Retrieval • Fuzzy sets are modeled based on a thesaurus • This thesaurus is built as follows: • Let c be a term-term correlation matrix • Let ci,l be a normalized correlation factor for (ki,kl): ci,l = ni,l ni + nl - ni,l • ni: number of docs which contain ki • nl: number of docs which contain kl • ni,l: number of docs which contain both ki and kl • We now have the notion of proximity among index terms.

Fuzzy Information Retrieval • The correlation factor ci,l can be used to define fuzzy set membership for a document dj as follows: i,j = 1 -  (1 - ci,l) ki  dj • i,j : membership of doc dj in fuzzy subset associated with ki • The above expression computes an algebraic sum over all terms in the doc dj

Fuzzy Information Retrieval • A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki • If doc dj contains a term kl which is closely related to ki, we have • ci,l ~ 1 • i,j ~ 1 • index ki is a good fuzzy index for doc

Ka Kb cc2 cc3 cc1 Kc Fuzzy IR: An Example • q = ka (kb kc) • qdnf = (1,1,1) + (1,1,0) + (1,0,0) = cc1 + cc2 + cc3 • q,dj = cc1+cc2+cc3,j = 1 - (1 - a,j b,j) c,j) * (1 - a,j b,j (1-c,j)) * (1 - a,j (1-b,j) (1-c,j))

Fuzzy Information Retrieval • Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory • Experiments with standard test collections are not available • Difficult to compare at this time

Extended Boolean Model • Boolean model is simple and elegant. • But, no provision for a ranking • As with the fuzzy model, a ranking can be obtained by relaxing the condition on set membership • Extend the Boolean model with the notions of partial matching and term weighting • Combine characteristics of the Vector model with properties of Boolean algebra

The Idea • The Extended Boolean Model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra • Let, • q = kx ky • wxj = fxj * idfx associated with [kx,dj] max(idfi) • Further, wxj = x and wyj = y

2 2 sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 The Idea: qand = kx ky; wxj = x and wyj = y (1,1) ky AND y = wyj dj (0,0) x = wxj kx

2 2 sim(qor,dj) = sqrt( x + y ) 2 The Idea: qor = kx ky; wxj = x and wyj = y (1,1) ky OR dj y = wyj (0,0) x = wxj kx

Generalizing the Idea • We can extend the previous model to consider Euclidean distances in a t-dimensional space • This can be done using p-norms which extend the notion of distance to include p-distances, where 1  p  is a new parameter

Generalizing the Idea 1 1 p p p p p p    p p p    p p • sim(qor,dj) = (x1 + x2 + . . . + xm ) m p p p • sim(qand,dj)=1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m • A generalized disjunctive query is given by • qor = k1 k2 . . . kt • A generalized conjunctive query is given by • qand = k1 k2 . . . kt

Information Retrieval