620 likes | 803 Views
搜索引擎技术. 闫宏飞, yhf@net.pku.edu.cn 北京大学计算机系网络实验室 2004 年 12 月 24 日 @CERNET2004. 内容提要. 搜索引擎工作原理 信息检索相关研究和机构. 搜索引擎 — Web Search Engines. 定义: 允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。 创建索引的方法 手工索引 自动索引 系统结构 集中式体系结构 分布式体系结构. Two service extremes. Browsing Services. Search Engine Services.
E N D
搜索引擎技术 闫宏飞,yhf@net.pku.edu.cn 北京大学计算机系网络实验室 2004年12月24日@CERNET2004
内容提要 • 搜索引擎工作原理 • 信息检索相关研究和机构
搜索引擎 — Web Search Engines • 定义:允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。 • 创建索引的方法 • 手工索引 • 自动索引 • 系统结构 • 集中式体系结构 • 分布式体系结构
Two service extremes Browsing Services Search Engine Services ??? Web Pages ??? Bag of Words Two semantics extremes
搜索引擎三段式工作流程 • 搜集 • 批量搜集,增量式搜集;搜集目标,搜集策略 • 预处理 • 关键词提取;重复网页消除;链接分析;索引 • 服务 • 查询方式和匹配;结果排序;文档摘要 搜集 整理 服务
抓取 进程 抓取 进程 抓取 进程 协调 进程 (节点) 协调 进程 (节点) 协调 进程 (节点) …… 调度模块 分布式Web搜集系统结构
天网存储格式 version: 1.0 // version number url: http://www.pku.edu.cn/ // URL origin: http://www.somewhere.cn/ // original URL date: Tue, 15 Apr 2003 08:13:06 GMT // time of harvest ip: 162.105.129.12 // IP address unzip-length: 30233 // If included, the data must be compressed length: 18133 // data length // a blank line XXXXXXXX // the followings are data part XXXXXXXX …. XXXXXXXX // data end // insert a new line
File Organizations (Indexes) • Choices for accessing data during query evaluation • Scan the entire collection • Typical in early (batch) retrieval systems • Computational and I/O costs are O(characters in collection) • Practical for only “small” text collections • Large memory systems make scanning feasible • Use indexes for direct access • Evaluation time O(query term occurrences in collection) • Practical for “large” collections • Many opportunities for optimization • Hybrids: Use small index, then scan a subset of the collection
Indexes • What should the index contain? • Database systems index primary and secondarykeys • This is the hybrid approach • Index provides fast access to a subset of database records • Scan subset to find solution set • IR Problem: • Cannot predict keys that people will use in queries • Every word in a document is a potential search term • IR Solution: Index by all keys (words) full text indexes
Index Contents • The contents depend upon the retrieval model • Feature presence/absence • Boolean • Statistical (tf, df, ctf, doclen, maxtf) • Often about 10% the size of the raw data, compressed • Positional • Feature location within document • Granularities include word, sentence, paragraph, etc • Coarse granularities are less precise, but take less space • Word-level granularity about 20-30% the size of the raw data,compressed
Indexes: Implementation • Common implementations of indexes • Bitmaps • Signature files • Inverted files • Common index components • Dictionary (lexicon) • Postings • document ids • word positions No positional data indexed
Inverted Search Algorithm • Find query elements (terms) in the lexicon • Retrieve postings for each lexicon entry • Manipulate postings according to the retrieval model
Word-Level Inverted File lexicon posting Query: 1.porridge & pot (BOOL)2.“porridge pot” (BOOL) 3. porridge pot (VSM) Answer
内容提要 • 搜索引擎工作原理 • 信息检索相关研究和机构
A Brief history of Modern Information Retrieval • In 1945, Vannevar Bush published "As We May Think" in the Atlantic monthly. • In the 1960s, the SMART system by Gerard Salton and his students • Cranfield evaluations done by Cyril Cleverdon • The 1970s and 1980s saw many developments built on the advances of the 1960s. • In 1992 with the inception of Text Retrieval Conference. • The algorithms developed • The algorithms developed in IR were employed for searching the Web from 1996.
信息检索相关研究和机构 • CIIR, University of Massachusetts • LTI, Carnegie Mellon University • The Stanford University DB Group • Microsoft Research Asia • TREC • 北京大学, 网络实验室, 天网组
Lemur简介 • http://www-2.cs.cmu.edu/~lemur/
Lemur Toolkit • 目标:为促进LM和IR研究的research system • ad hoc , distributed retrieval, cross-language IR, summarization, filtering, and classification • 功能: • 支持大规模文档数据库的索引 • 建立Simple Language Model • 实现基于Language Model和其它多个检索模型的系统 • 实现: • C and C++ • Unix / Windows • Current Version 3.1
MRA: Towards Next Generation Web Search • From Pages to Blocks • Analyze the Web at finer granularity • From Surface Web to Deep Web • Unleash the huge assets of high-value information • From Unstructure to Structure • Provide well organized results • From relevance to intelligence • Contribute knowledge discovery with search • From Desktop Search to Mobile Search • Bridge physical world search to digital world search
The Stanford Univ. DB Group • WebBase • Crawling, storage, indexing, and querying of large collections of Web pages. • Digital Libraries • Infrastructure and services for creating, disseminating, sharing and managing information
TREC Conference • Established in 1992 to evaluate large-scale IR • Retrieving documents from a gigabyte collection • Has run continuously since then • TREC 2004(13th) meeting is in November • Run by NIST’s Information Access Division • Probably most well known IR evaluation setting • Started with 25 participating organizations in 1992 evaluation • In 2003, there were 93 groups from 22 different countries • Proceedings available on-line (http://trec.nist.gov ) • Overview of TREC 2003 at http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf
TREC General Format • TREC consists of IR research tracks • Ad hoc, routing, confusion ( scanned documents, speech recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, … • Each track works on roughly the same model • November: track approved by TREC community • Winter: track’s members finalize format for track • Spring: researchers train system based on specification • Summer: researchers carry out format evaluation • Usually a “blind” evaluation: research do not know answer • Fall: NIST carries out evaluation • November: Group meeting (TREC) to find out: • How well your site did • How others tackled the program • Many tracks are run by volunteers outside of NIST (e.g. Web) • “Coopetition” model of evaluation • Successful approaches generally adopted in next cycle
CWT100g构建时间表 √ √ √ √ 我是一小步,人类的一大步!