搜索引擎技术

搜索引擎技术 闫宏飞，yhf@net.pku.edu.cn 北京大学计算机系网络实验室 2004年12月24日@CERNET2004

内容提要 • 搜索引擎工作原理 • 信息检索相关研究和机构

搜索引擎 — Web Search Engines • 定义：允许用户递交查询，检索出与查询相关的网页结果列表，并且排序输出。 • 创建索引的方法 • 手工索引 • 自动索引 • 系统结构 • 集中式体系结构 • 分布式体系结构

Two service extremes Browsing Services Search Engine Services ??? Web Pages ??? Bag of Words Two semantics extremes

搜索引擎三段式工作流程 • 搜集 • 批量搜集，增量式搜集；搜集目标，搜集策略 • 预处理 • 关键词提取；重复网页消除；链接分析；索引 • 服务 • 查询方式和匹配；结果排序；文档摘要搜集整理服务

搜索引擎系统流程

天网搜索引擎系统流程

抓取进程抓取进程抓取进程协调进程（节点）协调进程（节点）协调进程（节点） …… 调度模块分布式Web搜集系统结构

天网存储格式 version: 1.0 // version number url: http://www.pku.edu.cn/ // URL origin: http://www.somewhere.cn/ // original URL date: Tue, 15 Apr 2003 08:13:06 GMT // time of harvest ip: 162.105.129.12 // IP address unzip-length: 30233 // If included, the data must be compressed length: 18133 // data length // a blank line XXXXXXXX // the followings are data part XXXXXXXX …. XXXXXXXX // data end // insert a new line

File Organizations (Indexes) • Choices for accessing data during query evaluation • Scan the entire collection • Typical in early (batch) retrieval systems • Computational and I/O costs are O(characters in collection) • Practical for only “small” text collections • Large memory systems make scanning feasible • Use indexes for direct access • Evaluation time O(query term occurrences in collection) • Practical for “large” collections • Many opportunities for optimization • Hybrids: Use small index, then scan a subset of the collection

Indexes • What should the index contain? • Database systems index primary and secondarykeys • This is the hybrid approach • Index provides fast access to a subset of database records • Scan subset to find solution set • IR Problem: • Cannot predict keys that people will use in queries • Every word in a document is a potential search term • IR Solution: Index by all keys (words) full text indexes

Index Contents • The contents depend upon the retrieval model • Feature presence/absence • Boolean • Statistical (tf, df, ctf, doclen, maxtf) • Often about 10% the size of the raw data, compressed • Positional • Feature location within document • Granularities include word, sentence, paragraph, etc • Coarse granularities are less precise, but take less space • Word-level granularity about 20-30% the size of the raw data,compressed

Indexes: Implementation • Common implementations of indexes • Bitmaps • Signature files • Inverted files • Common index components • Dictionary (lexicon) • Postings • document ids • word positions No positional data indexed

Inverted Files

Word-Level Inverted File

Inverted Search Algorithm • Find query elements (terms) in the lexicon • Retrieve postings for each lexicon entry • Manipulate postings according to the retrieval model

Word-Level Inverted File lexicon posting Query: 1.porridge & pot (BOOL)2.“porridge pot” (BOOL) 3. porridge pot (VSM) Answer

内容提要 • 搜索引擎工作原理 • 信息检索相关研究和机构

A Brief history of Modern Information Retrieval • In 1945, Vannevar Bush published "As We May Think" in the Atlantic monthly. • In the 1960s, the SMART system by Gerard Salton and his students • Cranfield evaluations done by Cyril Cleverdon • The 1970s and 1980s saw many developments built on the advances of the 1960s. • In 1992 with the inception of Text Retrieval Conference. • The algorithms developed • The algorithms developed in IR were employed for searching the Web from 1996.

Clustering of SIGIR papers by topic vs. year

Question answering

Clustering

Inverted files & Implementations

Message understanding & TDT

Filtering

Hypertext IR, Multiple evidence

Probabilistic & Language models

Distributed IR

Evaluation

Topic distillation & Linkage retrieval

Text categorisation

Document summarisation

Cross lingual

信息检索相关研究和机构 • CIIR, University of Massachusetts • LTI, Carnegie Mellon University • The Stanford University DB Group • Microsoft Research Asia • TREC • 北京大学, 网络实验室, 天网组

Lemur简介 • http://www-2.cs.cmu.edu/~lemur/

Lemur Toolkit • 目标：为促进LM和IR研究的research system • ad hoc , distributed retrieval, cross-language IR, summarization, filtering, and classification • 功能: • 支持大规模文档数据库的索引 • 建立Simple Language Model • 实现基于Language Model和其它多个检索模型的系统 • 实现: • C and C++ • Unix / Windows • Current Version 3.1

MRA: Towards Next Generation Web Search • From Pages to Blocks • Analyze the Web at finer granularity • From Surface Web to Deep Web • Unleash the huge assets of high-value information • From Unstructure to Structure • Provide well organized results • From relevance to intelligence • Contribute knowledge discovery with search • From Desktop Search to Mobile Search • Bridge physical world search to digital world search

The Stanford Univ. DB Group • WebBase • Crawling, storage, indexing, and querying of large collections of Web pages. • Digital Libraries • Infrastructure and services for creating, disseminating, sharing and managing information

TREC Conference • Established in 1992 to evaluate large-scale IR • Retrieving documents from a gigabyte collection • Has run continuously since then • TREC 2004(13th) meeting is in November • Run by NIST’s Information Access Division • Probably most well known IR evaluation setting • Started with 25 participating organizations in 1992 evaluation • In 2003, there were 93 groups from 22 different countries • Proceedings available on-line (http://trec.nist.gov ) • Overview of TREC 2003 at http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf

TREC General Format • TREC consists of IR research tracks • Ad hoc, routing, confusion ( scanned documents, speech recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, … • Each track works on roughly the same model • November: track approved by TREC community • Winter: track’s members finalize format for track • Spring: researchers train system based on specification • Summer: researchers carry out format evaluation • Usually a “blind” evaluation: research do not know answer • Fall: NIST carries out evaluation • November: Group meeting (TREC) to find out: • How well your site did • How others tackled the program • Many tracks are run by volunteers outside of NIST (e.g. Web) • “Coopetition” model of evaluation • Successful approaches generally adopted in next cycle

TREC Tracks

Summary of VLC/Web Track evaluation 1996 - 2003

Tianwang Group @PKU

http://www.infomall.cn/

CWT100g构建时间表 √ √ √ √ 我是一小步，人类的一大步!

搜索引擎技术

搜索引擎技术

Presentation Transcript