980 likes | 1.17k Views
智能信息检索. 杜小勇教授 , 中国人民大学 文继荣教授 , 微软亚洲研究院. 课程介绍. 课程历史. 2007 年 9 月 27 日,我们和微软亚洲研究院联合举办了一次 “互联网数据管理主题学术报告” 既聘任文继荣博士为兼职研究员的活动 ; 2008 年春季学期 , 第一次与微软合作 , 开设 《 智能信息检索 》 课程 ; 来自 MSRA 的 9 位研究员共进行了 11 次讲座 ; 参考文献 : 从智能信息检索看微软前瞻性课程, 《 计算机教育 》. 授课风格. IR 基础知识 + 专题讲座 专题讲座由微软研究员担任 , 信息量非常大 考核方式 :
E N D
智能信息检索 杜小勇教授,中国人民大学 文继荣教授,微软亚洲研究院
课程历史 • 2007年9月27日,我们和微软亚洲研究院联合举办了一次“互联网数据管理主题学术报告”既聘任文继荣博士为兼职研究员的活动; • 2008年春季学期,第一次与微软合作,开设《智能信息检索》课程; 来自MSRA的9位研究员共进行了11次讲座; • 参考文献:从智能信息检索看微软前瞻性课程,《计算机教育》
授课风格 • IR基础知识+专题讲座 • 专题讲座由微软研究员担任,信息量非常大 • 考核方式: • (1)选择某一个专题写一个综述性质的报告 ,包括:研究的问题是什么? 该领域的理论基础是什么?技术难点在那里? 目前大致有什么解决问题的手段和方法? 研究这些问题的实验方法和评价体系是什么?提出自己的观点. • (2) 将打印好的文章在最后一节课交给助教,将分发给相关的老师进行评阅。 • (3) 平时考核,主要是参与讨论的情况.
授课内容 • 基础知识 • 基本模型 • 基本实现技术 • 评价方法 • 核心技术与系统 • Ranking • Information Extraction • Log Mining • System Implementation • 应用 • Image search • Multi-language search • Knowledge Management • ……
Reading Materials • R. Baeza-Yates, B. Ribeiro-Neto, Modern Information Retrieval, ACM Press,1999 • E.M.Voorhees,D.K.Harman, TREC: Experiment and Evaluation in Information Retrieval, The MIT Press 2005 • K. S. Jones, P. Willett, Readings in Information Retrieval, Morgan Kaufmann,1997 • Proceedings SIGIR, SIGMOD,WWW
课程安排 • http://iir.ruc.edu.cn/ • 联系方式: • 杜小勇 duyong@ruc.edu.cn(信息楼 0459) • 文继荣JRWEN@microsoft.com
Introduction to IR Prof. Xiaoyong Du
What is Information Retrieval • Definition from comparison
What is IR? • Definition by examples: • Library system, like CALIS • Search engine, like Google, Baidu
What is IR? • Definition by content • IR = <D, Q, R(qi,dj)>, where • D: document collection • Q: User’s query • R: the relevance/similarity degree between query qi and document dj
IR System Architecture User Interface Ranker Data Miner Query log Feedback Unstructured Data Index WEB Crawler Indexer Extractor Classical IR WEB IR
Content • IR Model • System architecture • Evaluation and benchmark • Key techniques • Media-related operators • Indexing • IE • Classification and clustering • Link analysis • Relevance evaluation • ……
Related Area • Natural Language Processing • Large-scale distributed computing • Database • Data mining • Information Science • Artificial Intelligence • ……
IR Model • Representation How to represent document/query • Bag-of-word • Sequence-of-word • Link of documents • Semantic Network • Similarity/relevance Evaluation sim(dj,q)=?
两大类的模型 • 基于文本内容的检索模型 • 布尔模型 • 向量空间模型 • 概率模型 • 统计语言模型 • 与内容无关的其他检索模型 • 基于协同的模型 • 基于链接分析的模型 • 基于关联的模型
Classical IR Models ---- Basic Concepts • Bag-of-Word Model • Each document represented by a set of representative keywords or index terms • The importance of the index terms is represented by weights associated to them • Let • ki : an index term • dj : a document • t : the total number of docs • K = {k1, k2, …, kt} : the set of all index terms
Classic IR Models - Basic Concepts • wij >= 0 : a weight associated with (ki,dj) The weight wij quantifies the importance of the index term for describing the document contents • wij = 0 indicates that term does not belong to doc • vec(dj) = (w1j, w2j, …, wtj) : a weighted vector associated with the document dj • gi(vec(dj)) = wij : a function which returns the weight of term ki in document dj
Classical IR Models - Basic Concepts • A ranking is an ordering of the documents retrieved that (hopefully) reflects the relevance of the documents to the user query • A ranking is based on fundamental premises regarding the notion of relevance, such as: • common sets of index terms • sharing of weighted terms • likelihood of relevance • Each set of premises leads to a distinct IR model
The Boolean Model • Simple model based on set theory • Queries specified as boolean expressions • precise semantics • neat formalism • q = ka (kb kc) • Terms are either present or absent. Thus, wij ∈ {0,1} • Consider • q = ka (kb kc) • vec(qdnf) = (1,1,1) (1,1,0) (1,0,0) • vec(qcc) = (1,1,0) is a conjunctive component
Outline • Boolean Model(BM) • Vector Space Model(VSM) • Probabilistic Model(PM) • Language Model(LM)
Ka Kb (1,1,0) (1,0,0) (1,1,1) Kc The Boolean Model • q = ka (kb kc) • sim(q,dj) = 1 if vec(qcc) | (vec(qcc) /in vec(qdnf)) (ki, gi(vec(dj)) = gi(vec(qcc))) 0 otherwise
Drawbacks of the Boolean Model • Exact matching • No ranking: • Awkward: Information need has to be translated into a Boolean expression • Too simple: The Boolean queries formulated by the users are most often too simplistic • Unsatisfiable Results: The Boolean model frequently returns either too few or too many documents in response to a user query
Outline • Boolean Model(BM) • Vector Space Model(VSM) • Probabilistic Model(PM) • Language Model (LM)
The Vector Model • Non-binary weights provide consideration for partial matches • These term weights are used to compute a degree of similarity between a query and each document • Ranked set of documents provides for better matching
The Vector Model • Define: • wij > 0 whenever ki dj • wiq >= 0 associated with the pair (ki,q) • vec(dj) = (w1j, w2j, ..., wtj) vec(q) = (w1q, w2q, ..., wtq) • index terms are assumed to occur independently within the documents ,That means the vector space is orthonormal. • The t terms form an orthonormal basis for a t-dimensional space • In this space, queries and documents are represented as weighted vectors
The Vector Model j dj q i • Sim(q,dj) = cos() = [vec(dj) vec(q)] / (|dj| * |q|) = [ wij * wiq] / (|dj| * |q|) • Since wij > 0 and wiq > 0, 0 <= sim(q,dj) <=1 • A document is retrieved even if it matches the query terms only partially
The Vector Model • Sim(q,dj) = [ wij * wiq] / ( |dj| * |q|) • The KEY is to compute the weights wij and wiq ? • A good weight must take into account two effects: • quantification of intra-document contents (similarity) • tf factor, the term frequency within a document • quantification of inter-documents separation (dissimilarity) • idf factor, the inverse document frequency • TF*IDF formular: wij = tf(i,j) * idf(i)
The Vector Model • Let, • N be the total number of docs in the collection • ni be the number of docs which contain ki • freq(i,j) raw frequency of ki within dj • A normalized tf factor is given by • tf(i,j) = freq(i,j) / max(freq(l,j)) • where kl ∈ dj • The idf factor is computed as • idf(i) = log (N/ni) • the log is used to make the values of tf and idf comparable. It can also be interpreted as the amount of information associated with the term ki.
The Vector Model • tf-idf weighting scheme • wij = tf(i,j) * log(N/ni) • The best term-weighting schemes • For the query term weights, • wiq = (0.5 + [0.5 * freq(i,q) / max(freq(l,q)]) * log(N/ni) • Or specified by the user • The vector model with tf-idf weights is a good ranking strategy with general collections • The vector model is usually as good as the known ranking alternatives. It is also simple and fast to compute.
The Vector Model • Advantages: • term-weighting improves quality of the answer set • partial matching allows retrieval of docs that approximate the query conditions • cosine ranking formula sorts documents according to degree of similarity to the query • Disadvantages: • assumes independence of index terms
Outline • Boolean Model(BM) • Vector Space Model(VSM) • Probabilistic Model(PM) • Language Model (LM)
Probabilistic Model • Objective: to capture the IR problem using a probabilistic framework • Given a user query, there is an ideal answer set • Querying as specification of the properties of this ideal answer set (clustering) • But, what are these properties? • Guess at the beginning what they could be (i.e., guess initial description of ideal answer set) • Improve by iteration
Probabilistic Model • Baisc ideas: • An initial set of documents is retrieved somehow • User inspects these docs looking for the relevant ones (in truth, only top 10-20 need to be inspected) • IR system uses this information to refine description of ideal answer set • By repeting this process, it is expected that the description of the ideal answer set will improve • Description of ideal answer set is modeled in probabilistic terms
Probabilistic Ranking Principle • The probabilistic model tries to estimate the probability that the user will find the document dj interesting (i.e., relevant). • The model assumes that this probability of relevance depends on the query and the document representations only. • Let R be the Ideal answer set. • But, • how to compute probabilities? • what is the sample space?
The Ranking • Probabilistic ranking computed as: • sim(q,dj) = P(dj relevant-to q) / P(dj non-relevant-to q) • Definition: • wij {0,1} • P(R | vec(dj)) :probability that given doc is relevant • P(R | vec(dj)) : probability doc is not relevant
The Ranking • sim(dj,q) = P(R | vec(dj)) / P(R | vec(dj)) = [P(vec(dj) | R) * P(R)] [P(vec(dj) | R) * P(R)] ~ P(vec(dj) | R) P(vec(dj) | R) • P(vec(dj) | R) : probability of randomly selecting the document dj from the set R of relevant documents
The Ranking • sim(dj,q) ~ P(vec(dj) | R) P(vec(dj) | R) ~ [ P(ki | R)] * [ P(ki | R)] [ P(ki | R)] * [ P(ki | R)] • P(ki | R) : probability that the index term ki is present in a document randomly selected from the set R of relevant documents
The Ranking • sim(dj,q) ~ log [ P(ki | R)] * [ P(kj | R)] [ P(ki | R)] * [ P(kj | R)] ~ K * [ log P(ki | R) + P(ki | R) log P(kj | R) ] P(kj | R) ~ wiq * wij * (log P(ki | R) + log P(kj | R) ) P(ki | R) P(kj | R) where P(ki | R) = 1 - P(ki | R) P(ki | R) = 1 - P(ki | R)
The Initial Ranking • sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) 1-P(ki | R) 1-P(ki | R) • How to compute Probabilities P(ki | R) and P(ki | R) ? • Estimates based on assumptions: • P(ki | R) = 0.5 • P(ki | R) = ni N • where ni is the number of docs that contain ki • Use this initial guess to retrieve an initial ranking • Improve upon this initial ranking
Improving the Initial Ranking • sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) 1-P(ki | R) 1-P(ki | R) • Let • V : set of docs initially retrieved • Vi : subset of docs retrieved that contain ki • Re-evaluate estimates: • P(ki | R) = Vi V • P(ki | R) = ni - Vi N - V • Repeat recursively
Improving the Initial Ranking • sim(dj,q) ~ ~ wiq * wij * (log P(ki | R) + log P(ki | R) ) 1-P(ki | R) 1-P(ki | R) • To avoid problems with V=1 and Vi=0: • P(ki | R) = Vi + 0.5 V + 1 • P(ki | R) = ni - Vi + 0.5 N - V + 1 • Also, • P(ki | R) = Vi + ni/N V + 1 • P(ki | R) = ni - Vi + ni/N N - V + 1
Discussion • Advantages: • Docs ranked in decreasing order of probability of relevance • Disadvantages: • need to guess initial estimates for P(ki | R) • method does not take into account tf and idf factors
Outline • Boolean Model(BM) • Vector Space Model(VSM) • Probabilistic Model(PM) • Language Model (LM)
Document Representation • Bag-of-words • Bag-of-facts • Bag-of-sentences • Bag-of-nets
Document = Bag of Words • Document = Bag of Sentences, Sentence = word sequence p(南京市长)p(江大桥|南京市长) << p(南京市)p(长江大桥|南京市) p(中国人民大学) >> p(中国大学人民)
What is a LM? • “语言”就是其字母表上的某种概率分布,该分布反映了任何一个字母序列成为该语言的一个句子(或其他任何的语言单元) 的可能性,称这个概率分布为语言模型。 • 给定的一个语言,对于一个语言“句子”(符号串),可以估计其出现的概率。 • 例如:假定对于英语, p1 (a quick brown dog) > p2 ( dog brown a quick) > p3 (brown dog 棕熊) > p4 (棕熊) • 若p1=p2,称为一阶语言模型,否则称为高阶语言模型
Basic Notation • M: language we are try to model, it can be thought as a source • s: observation (string of tokens) • P(s|M): probability of observation “s” in M, that is the probability of getting “s” during random sampling from M