770 likes | 1.04k Views
Intelligent Information Retrieval. Xiaoyong Du Ji-Rong Wen. 授课风格. 基础知识 + 专题讲座 专题讲座由微软研究员担任 , 信息量非常大 考核方式 : (1) 选择某一个专题写一个综述性质的报告 , 包括 : 研究的问题是什么 ? 该领域的理论基础是什么 ? 技术难点在那里 ? 目前大致有什么解决问题的手段和方法 ? 研究这些问题的实验方法和评价体系是什么 ? 提出自己的观点 . (2) 将打印好的文章在最后一节课交给助教 , 将分发给相关的老师进行评阅。
E N D
Intelligent Information Retrieval Xiaoyong Du Ji-Rong Wen
授课风格 • 基础知识+专题讲座 • 专题讲座由微软研究员担任,信息量非常大 • 考核方式: • (1)选择某一个专题写一个综述性质的报告 ,包括:研究的问题是什么? 该领域的理论基础是什么?技术难点在那里? 目前大致有什么解决问题的手段和方法? 研究这些问题的实验方法和评价体系是什么?提出自己的观点. • (2) 将打印好的文章在最后一节课交给助教,将分发给相关的老师进行评阅。 • (3) 平时考核,主要是参与讨论的情况.
Reading Materials • R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press,1999 • E.M.Voorhees,D.K.Harman, TREC: Experiment and Evaluation in Information Retrieval, The MIT Press 2005 • K. S. Jones, P. Willett, Readings in Information Retrieval, Morgan Kaufmann,1997 • Proceedings SIGIR, SIGMOD
课程内容 • 课程安排请见网站信息
课程网站 • http://iir.ruc.edu.cn/ • duyong@ruc.edu.cn • 信息楼 0459
Classical Formal Information Retrieval Models Prof. Xiaoyong DU School of Information Renmin University of China
IR Model • Representation How to represent document/query • Bag-of-word • Sequence-of-word • Link of documents • Semantic Network • Similarity/relevance Evaluation sim(dj,q)=?
Outline • Boolean Model(BM) • Vector Space Model(VSM) • Probabilistic Model(PM)
Classic IR Models - Basic Concepts • Bag-of-Word Model • Each document represented by a set of representative keywords or index terms • The importance of the index terms is represented by weights associated to them • Let • ki : an index term • dj : a document • t : the total number of docs • K = {k1, k2, …, kt} : the set of all index terms
Classic IR Models - Basic Concepts • wij >= 0 : a weight associated with (ki,dj) The weight wij quantifies the importance of the index term for describing the document contents • wij = 0 indicates that term does not belong to doc • vec(dj) = (w1j, w2j, …, wtj) : a weighted vector associated with the document dj • gi(vec(dj)) = wij : a function which returns the weight of term ki in document dj
Classic IR Models - Basic Concepts • A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query • A ranking is based on fundamental premisses regarding the notion of relevance, such as: • common sets of index terms • sharing of weighted terms • likelihood of relevance • Each set of premisses leads to a distinct IR model
The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics • neat formalism • q = ka (kb kc) • Terms are either present or absent. Thus, wij ∈ {0,1} • Consider • q = ka (kb kc) • vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) • vec(qcc) = (1,1,0) is a conjunctive component
Ka Kb (1,1,0) (1,0,0) (1,1,1) Kc The Boolean Model • q = ka (kb kc) • sim(q,dj) = 1 if vec(qcc) | (vec(qcc) /in vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise
Drawbacks of the Boolean Model • Exact matching • No ranking: • Awkward: Information need has to be translated into a Boolean expression • Too simple: The Boolean queries formulated by the users are most often too simplistic • Unsatisfiable Results: The Boolean model frequently returns either too few or too many documents in response to a user query
Outline • Boolean Model(BM) • Vector Space Model(VSM) • Probabilistic Model(PM)
The Vector Model • Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document • Ranked set of documents provides for better matching
The Vector Model • Define: • wij > 0 whenever ki dj • wiq >= 0 associated with the pair (ki,q) • vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq) • index terms are assumed to occur independently within the documents ,That means the vector space is orthonormal. • The t terms form an orthonormal basis for a t-dimensional space • In this space, queries and documents are represented as weighted vectors
The Vector Model j dj q i • Sim(q,dj) = cos() = [vec(dj) vec(q)] / (|dj| * |q|) = [ wij * wiq] / (|dj| * |q|) • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 • A document is retrieved even if it matches the query terms only partially
The Vector Model • Sim(q,dj) = [ wij * wiq] / ( |dj| * |q|) • The KEY is to compute the weights wij and wiq ? • A good weight must take into account two effects: • quantification of intra-document contents (similarity) • tf factor, the term frequency within a document • quantification of inter-documents separation (dissi-milarity) • idf factor, the inverse document frequency • TF*IDF formular: wij = tf(i,j) * idf(i)
The Vector Model • Let, • N be the total number of docs in the collection • ni be the number of docs which contain ki • freq(i,j) raw frequency of ki within dj • A normalized tf factor is given by • tf(i,j) = freq(i,j) / max(freq(l,j)) • where kl ∈ dj • The idf factor is computed as • idf(i) = log (N/ni) • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
The Vector Model • tf-idf weighting scheme • wij = tf(i,j) * log(N/ni) • The best term-weighting schemes • For the query term weights, • wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) • Or specified by the user • The vector model with tf-idf weights is a good ranking strategy with general collections • The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.
The Vector Model • Advantages: • term-weighting improves quality of the answer set • partial matching allows retrieval of docs that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • assumes independence of index terms
Outline • Boolean Model(BM) • Vector Space Model(VSM) • Probabilistic Model(PM)
Probabilistic Model • Objective: to capture the IR problem using a probabilistic framework • Given a user query, there is an ideal answer set • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve by iteration
Probabilistic Model • Baisc ideas: • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) • IR system uses this information to refine description of ideal answer set • By repeting this process, it is expected that the description of the ideal answer set will improve • Description of ideal answer set is modeled in probabilistic terms
Probabilistic Ranking Principle • The probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). • The model assumes that this probability of relevance depends on the query and the document representations only. • Let R be the Ideal answer set. • But, • how to compute probabilities? • what is the sample space?
The Ranking • Probabilistic ranking computed as: • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) • Definition: • wij {0,1} • P(R | vec(dj)) :probability that given doc is relevant • P(R | vec(dj)) : probability doc is not relevant
The Ranking • sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj)) = [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)] ~ P(vec(dj) | R) P(vec(dj) | R) • P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents
The Ranking • sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R) ~ [ P(ki | R)] * [ P(ki | R)] [ P(ki | R)] * [ P(ki | R)] • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents
The Ranking • sim(dj,q) ~ log [ P(ki | R)] * [ P(kj | R)] [ P(ki | R)] * [ P(kj | R)] ~ K * [ log P(ki | R) + P(ki | R) log P(ki | R) ] P(ki | R) ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)
The Initial Ranking • sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) • How to compute Probabilities P(ki | R) and P(ki | R) ? • Estimates based on assumptions: • P(ki | R) = 0.5 • P(ki | R) = ni N where ni is the number of docs that contain ki • Use this initial guess to retrieve an initial ranking • Improve upon this initial ranking
Improving the Initial Ranking • sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) • Let • V : set of docs initially retrieved • Vi : subset of docs retrieved that contain ki • Reevaluate estimates: • P(ki | R) = Vi V • P(ki | R) = ni - Vi N - V • Repeat recursively
Improving the Initial Ranking • sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) P(ki | R) P(ki | R) • To avoid problems with V=1 and Vi=0: • P(ki | R) = Vi + 0.5 V + 1 • P(ki | R) = ni - Vi + 0.5 N - V + 1 • Also, • P(ki | R) = Vi + ni/N V + 1 • P(ki | R) = ni - Vi + ni/N N - V + 1
Discussion • Advantages: • Docs ranked in decreasing order of probability of relevance • Disadvantages: • need to guess initial estimates for P(ki | R) • method does not take into account tf and idf factors
Brief Comparison of Classic Models • Boolean model does not provide for partial matches and is considered to be the weakest classic model • Salton and Buckley did a series of experiments that indicate that, in general, the vector model outperforms the probabilistic model with general collections
Extension of Classical IR Models Set Theoretic Model Fuzzy Set Model Extended Boolean Model Algebaric Model Generalized Vector Model Latent Semantic Indexing
Set Theoretic Models • The question of how to extend the Boolean model to accomodate partial matching and a ranking has attracted considerable attention in the past • Two set theoretic models for this: • Fuzzy Set Model • Extended Boolean Model
Fuzzy Set Model • 在Boolean模型中,查询和文档都是用一组索引项表示的。这些索引项和文档的联系是否都是同等重要的呢?直观上我们知道这是否定的。 • 我们用模糊集合的隶属函数概念可以很好地表达这个联系的程度: • 每一个词用一个 fuzzy set 表示 • 每一个文档对于这个词都有一个隶属度 • 隶属函数的构造方法各不相同。 • 这里介绍的是Ogawa, Morita, and Kobayashi (1991)的方法
Fuzzy Set Theory • Definition • A fuzzy subset A of U is characterized by a membership function (A,u) : U [0,1] • Definition • Let A and B be two fuzzy subsets of U. Also, let ¬A be the complement of A. Then, • (¬A,u) = 1 - (A,u) • (AB,u) = max((A,u), (B,u)) • (AB,u) = min((A,u), (B,u))
Fuzzy Information Retrieval • Fuzzy sets are modeled based on a thesaurus • This thesaurus is built as follows: • Let vec(c) be a term-term correlation matrix • Let c(i,l) be a normalized correlation factor for (ki,kl): c(i,l) = n(i,l) ni + nl - n(i,l) • ni: number of docs which contain ki • nl: number of docs which contain kl • n(i,l): number of docs which contain both ki and kl • We now have the notion of proximity among index terms.
Fuzzy Information Retrieval • The correlation factor c(i,l) can be used to define fuzzy set membership for a document dj to a term ci as follows: (i,j) = 1 - (1 - c(i,l)) kl dj • (i,j) : membership of doc dj in fuzzy subset associated with ki • The above expression computes an algebraic sum over all terms in the doc dj • A doc dj belongs to the fuzzy set for ki, if its own terms are associated with ki
Fuzzy Information Retrieval • (i,j) = 1 - (1 - c(i,l)) kl dj • (i,j) : membership of doc dj in fuzzy subset associated with ki • If doc dj contains a term kl which is closely related to ki, we have • c(i,l) ~ 1 • (i,j) ~ 1 • index ki is a good fuzzy index for doc
Ka Kb cc2 cc3 cc1 Kc Fuzzy IR: An Example • q = ka (kb kc) • vec(qdnf) = (1,1,1) + (1,1,0) + (1,0,0) = vec(cc1) + vec(cc2) + vec(cc3) • (q,dj) = (cc1+cc2+cc3,j) = 1 - (1 - (a,j) (b,j) (c,j)) * (1 - (a,j) (b,j) (1-(c,j))) * (1 - (a,j) (1-(b,j)) (1-(c,j)))
Fuzzy Information Retrieval • Fuzzy IR models have been discussed mainly in the literature associated with fuzzy theory • Experiments with standard test collections are not available (!!)
Extended Boolean Model • 如何在BOOLEAN模型的技术上为文档赋予一个排序号? • How to extend the model? • interpret conjunctions and disjunctions in terms of Euclidean distances • Combine characteristics of the Vector model with properties of Boolean algebra
The Idea • The extended Boolean model (introduced by Salton, Fox, and Wu, 1983) is based on a critique of a basic assumption in Boolean algebra • Let, • q = kx ky • wxj = tfxj * idf(x) associated with [kx,dj] max(idf(i)) • Further, wxj =〉 x and wyj =〉 y
2 2 sim(qand,dj) = 1 - sqrt( (1-x) + (1-y) ) 2 The Idea: • qand = kx ky; wxj = x and wyj = y (1,1) ky dj+1 AND y = wyj dj (0,0) x = wxj kx
2 2 sim(qor,dj) = sqrt( x + y ) 2 The Idea: • qor = kx ky; wxj = x and wyj = y (1,1) ky dj+1 OR dj y = wyj (0,0) x = wxj kx
p p p p p p Generalizing the Idea • We can extend the previous model to consider Euclidean distances in a t-dimensional space • This can be done using p-norms which extend the notion of distance to include p-distances, where 1 p is a new parameter • A generalized conjunctive query is given by • qor = k1 k2 . . . kt • A generalized disjunctive query is given by • qand = k1 k2 . . . kt
1 1 p p p p p • sim(qor,dj) = (x1 + x2 + . . . + xm ) m Generalizing the Idea p p p • sim(qand,dj) = 1 - ((1-x1) + (1-x2) + . . . + (1-xm) ) m