1 / 62

搜索引擎技术

搜索引擎技术. 闫宏飞, yhf@net.pku.edu.cn 北京大学计算机系网络实验室 2004 年 12 月 24 日 @CERNET2004. 内容提要. 搜索引擎工作原理 信息检索相关研究和机构. 搜索引擎 — Web Search Engines. 定义: 允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。 创建索引的方法 手工索引 自动索引 系统结构 集中式体系结构 分布式体系结构. Two service extremes. Browsing Services. Search Engine Services.

Download Presentation

搜索引擎技术

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 搜索引擎技术 闫宏飞,yhf@net.pku.edu.cn 北京大学计算机系网络实验室 2004年12月24日@CERNET2004

  2. 内容提要 • 搜索引擎工作原理 • 信息检索相关研究和机构

  3. 搜索引擎 — Web Search Engines • 定义:允许用户递交查询,检索出与查询相关的网页结果列表,并且排序输出。 • 创建索引的方法 • 手工索引 • 自动索引 • 系统结构 • 集中式体系结构 • 分布式体系结构

  4. Two service extremes Browsing Services Search Engine Services ??? Web Pages ??? Bag of Words Two semantics extremes

  5. 搜索引擎三段式工作流程 • 搜集 • 批量搜集,增量式搜集;搜集目标,搜集策略 • 预处理 • 关键词提取;重复网页消除;链接分析;索引 • 服务 • 查询方式和匹配;结果排序;文档摘要 搜集 整理 服务

  6. 搜索引擎系统流程

  7. 天网搜索引擎系统流程

  8. 抓取 进程 抓取 进程 抓取 进程 协调 进程 (节点) 协调 进程 (节点) 协调 进程 (节点) …… 调度模块 分布式Web搜集系统结构

  9. 天网存储格式 version: 1.0 // version number url: http://www.pku.edu.cn/ // URL origin: http://www.somewhere.cn/ // original URL date: Tue, 15 Apr 2003 08:13:06 GMT // time of harvest ip: 162.105.129.12 // IP address unzip-length: 30233 // If included, the data must be compressed length: 18133 // data length // a blank line XXXXXXXX // the followings are data part XXXXXXXX …. XXXXXXXX // data end // insert a new line

  10. File Organizations (Indexes) • Choices for accessing data during query evaluation • Scan the entire collection • Typical in early (batch) retrieval systems • Computational and I/O costs are O(characters in collection) • Practical for only “small” text collections • Large memory systems make scanning feasible • Use indexes for direct access • Evaluation time O(query term occurrences in collection) • Practical for “large” collections • Many opportunities for optimization • Hybrids: Use small index, then scan a subset of the collection

  11. Indexes • What should the index contain? • Database systems index primary and secondarykeys • This is the hybrid approach • Index provides fast access to a subset of database records • Scan subset to find solution set • IR Problem: • Cannot predict keys that people will use in queries • Every word in a document is a potential search term • IR Solution: Index by all keys (words) full text indexes

  12. Index Contents • The contents depend upon the retrieval model • Feature presence/absence • Boolean • Statistical (tf, df, ctf, doclen, maxtf) • Often about 10% the size of the raw data, compressed • Positional • Feature location within document • Granularities include word, sentence, paragraph, etc • Coarse granularities are less precise, but take less space • Word-level granularity about 20-30% the size of the raw data,compressed

  13. Indexes: Implementation • Common implementations of indexes • Bitmaps • Signature files • Inverted files • Common index components • Dictionary (lexicon) • Postings • document ids • word positions No positional data indexed

  14. Inverted Files

  15. Inverted Files

  16. Word-Level Inverted File

  17. Inverted Search Algorithm • Find query elements (terms) in the lexicon • Retrieve postings for each lexicon entry • Manipulate postings according to the retrieval model

  18. Word-Level Inverted File lexicon posting Query: 1.porridge & pot (BOOL)2.“porridge pot” (BOOL) 3. porridge pot (VSM) Answer

  19. 内容提要 • 搜索引擎工作原理 • 信息检索相关研究和机构

  20. A Brief history of Modern Information Retrieval • In 1945, Vannevar Bush published "As We May Think" in the Atlantic monthly. • In the 1960s, the SMART system by Gerard Salton and his students • Cranfield evaluations done by Cyril Cleverdon • The 1970s and 1980s saw many developments built on the advances of the 1960s. • In 1992 with the inception of Text Retrieval Conference. • The algorithms developed • The algorithms developed in IR were employed for searching the Web from 1996.

  21. Clustering of SIGIR papers by topic vs. year

  22. Question answering

  23. Clustering

  24. Inverted files & Implementations

  25. Message understanding & TDT

  26. Filtering

  27. Hypertext IR, Multiple evidence

  28. Probabilistic & Language models

  29. Distributed IR

  30. Evaluation

  31. Topic distillation & Linkage retrieval

  32. Text categorisation

  33. Document summarisation

  34. Cross lingual

  35. 信息检索相关研究和机构 • CIIR, University of Massachusetts • LTI, Carnegie Mellon University • The Stanford University DB Group • Microsoft Research Asia • TREC • 北京大学, 网络实验室, 天网组

  36. Lemur简介 • http://www-2.cs.cmu.edu/~lemur/

  37. Lemur Toolkit • 目标:为促进LM和IR研究的research system • ad hoc , distributed retrieval, cross-language IR, summarization, filtering, and classification • 功能: • 支持大规模文档数据库的索引 • 建立Simple Language Model • 实现基于Language Model和其它多个检索模型的系统 • 实现: • C and C++ • Unix / Windows • Current Version 3.1

  38. MRA: Towards Next Generation Web Search • From Pages to Blocks • Analyze the Web at finer granularity • From Surface Web to Deep Web • Unleash the huge assets of high-value information • From Unstructure to Structure • Provide well organized results • From relevance to intelligence • Contribute knowledge discovery with search • From Desktop Search to Mobile Search • Bridge physical world search to digital world search

  39. The Stanford Univ. DB Group • WebBase • Crawling, storage, indexing, and querying of large collections of Web pages. • Digital Libraries • Infrastructure and services for creating, disseminating, sharing and managing information

  40. TREC Conference • Established in 1992 to evaluate large-scale IR • Retrieving documents from a gigabyte collection • Has run continuously since then • TREC 2004(13th) meeting is in November • Run by NIST’s Information Access Division • Probably most well known IR evaluation setting • Started with 25 participating organizations in 1992 evaluation • In 2003, there were 93 groups from 22 different countries • Proceedings available on-line (http://trec.nist.gov ) • Overview of TREC 2003 at http://trec.nist.gov/pubs/trec12/papers/OVERVIEW.12.pdf

  41. TREC General Format • TREC consists of IR research tracks • Ad hoc, routing, confusion ( scanned documents, speech recognition ), video, filtering, multilingual ( cross-language, Spanish, Chinese ), question answering, novelty, high precision, interactive, Web, database merging, NLP, … • Each track works on roughly the same model • November: track approved by TREC community • Winter: track’s members finalize format for track • Spring: researchers train system based on specification • Summer: researchers carry out format evaluation • Usually a “blind” evaluation: research do not know answer • Fall: NIST carries out evaluation • November: Group meeting (TREC) to find out: • How well your site did • How others tackled the program • Many tracks are run by volunteers outside of NIST (e.g. Web) • “Coopetition” model of evaluation • Successful approaches generally adopted in next cycle

  42. TREC Tracks

  43. Summary of VLC/Web Track evaluation 1996 - 2003

  44. Tianwang Group @PKU

  45. http://www.infomall.cn/

  46. CWT100g构建时间表 √ √ √ √ 我是一小步,人类的一大步!

More Related