天网搜索平台 PARADISE

天网搜索平台PARADISE 闫宏飞北京大学计算机系网络实验室 2009/4/24

信息检索前沿概述 闫宏飞北京大学计算机系网络实验室 2009/4/24

Outline • Issues: search engine and web mining • Goal: Framework • Principles: Related components (照应任务各部分关联) • Implementation: Achievements （应用成果） • Case study （具体到一事）

Search Engine and Web Mining • Crawling • Full-text indexing • Retrieving • Web archiving • and Mining

WebGraph: bowtie  teapot • Model: bag of words  • Infomall, CDAL: archive  histrace • Evaluation: manual  automatic • Platform: proprietary  open

Outline • Motivation to Build PARADISE • Design and Implement PARADISE • Research Issues of Concern

Corpus (1/3) • Shakespeare’s collected works • A gzipped tar-file [2039k] • http://www.cs.su.oz.au/~matty/Shakespeare/shakespeare.tar.gz • RCV1, Reuters Corpus Volume 1. • one year of Reuters newswire • 1996-08-20 to 1997-08-19 • contains about 810,000 Reuters • English Language News stories. • requires about 2.5 GB for storage of the uncompressed files. • http://trec.nist.gov/data/reuters/reuters.html

Corpus (2/3) CWT100GB, Chinese Web Test collection 2004年6月搜集，有5,712,710个网页，解压容量为90GB。 CWT200GB 2005年11月搜集，有37,482,913个网页，压缩容量为197GB http://www.cwirf.org/

Corpus (3/3) 2008/10/15

Hardware assumptions symbol statistic value s average seek time 5 ms = 5 x 10−3 s b transfer time per byte 0.02 μs = 2 x 10−8 s processor’s clock rate 10 −9 s p lowlevel operation 0.01 μs = 10−8 s (e.g., compare & swap a word) size of main memory several GB size of disk space 1 TB or more

Reuters RCV1 statistics • symbol statistic value • N documents 800,000 • L avg. # tokens per doc 200 • M terms (= word types) 400,000 • avg. # bytes per token 6 (incl. spaces/punct.) • avg. # bytes per token 4.5 (without spaces/punct.) • avg. # bytes per term 7.5 • tokens 100,000,000 4.5 bytes per word token vs. 7.5 bytes per word type: why?

Documents are parsed to extract words and these are saved with the Document ID. Index construction Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

Key step After all documents have been parsed, the inverted file is sorted by terms. We focus on this sort step. We have 100M items to sort.

Scaling index construction • In-memory index construction does not scale. • How can we construct an index for very large collections? • Taking into account the hardware constraints • Memory, disk, speed etc.

Sort-based Index construction As we build the index, we parse docs one at a time. While building the index, we cannot easily exploit compression tricks (you can, but much more complex) The final postings for any term are incomplete until the end. At 12 bytes per postings entry, demands a lot of space for large collections. T = 100,000,000 in the case of RCV1 So … we can do this in memory in 2008, but typical collections are much larger. E.g. New York Times provides index of >150 years of newswire Thus: We need to store intermediate results on disk.

Use the same algorithm for disk? Can we use the same index construction algorithm for larger collections, but by using disk instead of memory? No: Sorting T = 100,000,000 records on disk is too slow – too many disk seeks. We need an external sorting algorithm.

Bottleneck Parse and build postings entries one doc at a time Now sort postings entries by term (then by doc within each term) Doing this with random disk seeks would be too slow – must sort T=100M records If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take?

12-byte (4+4+4) records (term, doc, freq). These are generated as we parse docs. Must now sort 100M such 12-byte records by term. Define a Block ~ 10M such records Can easily fit a couple into memory. Will have 10 such blocks to start with. Basic idea of algorithm: Accumulate postings for each block, sort, write to disk. Then merge the blocks into one long sorted order. BSBI: Blocked sort-based Indexing (Sorting with fewer disk seeks)

BSBI: Blocked sort-based Indexing

Remaining problem with sort-based algorithm Our assumption was: we can keep the dictionary in memory. We need the dictionary (which grows dynamically) in order to implement a term to termID mapping. Actually, we could work with term,docID postings instead of termID,docID postings . . . . . . but then intermediate files become very large. (We would end up with a scalable, but very slow index construction method.)

SPIMI: Single-pass in-memory indexing Key idea 1: Generate separate dictionaries for each block – no need to maintain term-termID mapping across blocks. Key idea 2: Don’t sort. Accumulate postings in postings lists as they occur. With these two ideas we can generate a complete inverted index for each block. These separate indexes can then be merged into one big index.

SPIMI-Invert Merging of blocks is analogous to BSBI.

Distributed indexing For web-scale indexing (don’t try this at home!): must use a distributed computing cluster Individual machines are fault-prone Can unpredictably slow down or fail How do we exploit such a pool of machines?

Google data centers • Google data centers mainly contain commodity machines. • Data centers are distributed around the world. • Estimate: a total of 1 million servers, 3 million processors/cores (Gartner 2007) • Estimate: Google installs 100,000 servers each quarter. • Based on expenditures of 200–250 million dollars per year • This would be 10% of the computing capacity of the world!?!

Google data centers If in a non-fault-tolerant system with 1000 nodes, each node has 99.9% uptime, what is the uptime of the system? Answer: 37% Calculate the number of servers failing per minute for an installation of 1 million servers.

Distributed indexing Maintain a master machine directing the indexing job – considered “safe”. Break up indexing into sets of (parallel) tasks. Master machine assigns each task to an idle machine from a pool.

Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents (corresponding to blocks in BSBI/SPIMI)

Parsers Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs Parser writes pairs into j partitions Each partition is for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion

Inverters An inverter collects all (term,doc) pairs (= postings) for one term-partition. Sorts and writes to postings lists

Data flow Master assign assign Postings Parser Inverter a-f g-p q-z a-f Parser a-f g-p q-z Inverter g-p Inverter splits q-z Parser a-f g-p q-z Map phase Reduce phase Segment files

MapReduce The index construction algorithm we just described is an instance of MapReduce. MapReduce (Dean and Ghemawat 2004) is a robust and conceptually simple framework for distributed computing without having to write code for the distribution part. They describe the Google indexing system (ca. 2002) as consisting of a number of phases, each implemented in MapReduce.

MapReduce Index construction was just one phase. Another phase: transforming a term-partitioned index into document-partitioned index. Term-partitioned: one machine handles a subrange of terms Document-partitioned: one machine handles a subrange of documents most search engines use a document-partitioned index … better load balancing, etc.

Schema for index construction in MapReduce Schema of map and reduce functions map: input → list(k, v) reduce: (k,list(v)) → output Instantiation of the schema for index construction map: web collection → list(termID, docID) reduce: (<termID1, list(docID)>, <termID2, list(docID)>, …) → (postings list1, postings list2, …) Example for index construction map: d2 : C died. d1 : C came, C c’ed. → (<C, d2>, <died,d2>, <C,d1>, <came,d1>, <C,d1>, <c’ed, d1> reduce: (<C,(d2,d1,d1)>, <died,(d2)>, <came,(d1)>, <c’ed,(d1)>) → (<C,(d1:2,d2:1)>, <died,(d2:1)>, <came,(d1:1)>, <c’ed,(d1:1)>)

有几个问题需要解决 • 很难找到合适的硬件设施，基于这些数据开展工作 • 存储，运行，维护，…. • 缺少处理这些数据的有效的软件工具 • 分析，检索，评测，…. • 如何维护有state of arts算法的软件工具 • 模块替换，功能测试，性能测试，…

智能中文搜索引擎技术研究及平台构建 • 由一个平台, 四个任务组成的. • 达到平台能够展示四个任务. • 平台是: • 一个开放式智能化中文搜索引擎平台 • 任务是: • 构建适用于中文互联网的文本内容抽取系统及网页文本内容结构化分析 • 基于内容的多媒体信息检索 • 基于自然语言理解的查询分析 • 基于用户属性挖掘的个性化检索 http://sewm.pku.edu.cn/src/paradise/

Data nodes Tianwang jobs and evaluation jobs Automation Cluster Management Tools Bare Metal Machines Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers Servers PARADISE DesignPlatform for Applying, Researching And Developing Intelligent Search Engine

PARADISE索引设计介绍

当前设计中的基本功能 • Using faster single-pass indexing • Highly compressed index disk data structures • Various compress algorithm(*) • Various invert index file format(*) • index position ,freq value or impact value(*) • Index field information,such as title,date • Cache support(*)

下一步设计的功能 • Index various document format，such as word, pdf • 使用MapReduce等技术的索引 • 多种切词的实现与处理 • 增量索引与文档删除 • Term vector存储 • 多segment的合并策略

Inverted index • Document-sorted index • The pointers in each inverted list are stored in increasing document order • Using d-gap • Frequency-sorted index • Inverted list are stored in decreasing term frequency score • Impact-sorted index • The pointers are ordered according to the impact value • Impact value is a small integer representing the overall contribution of term t to the score of document d, include the factor used to normalize for document length

Processing query mode • Document-at-a-time • All inverted lists are simultaneously accessed • |q|-way processing is carried out to handle a query q of |q| terms • Term-at-a-time • Only on inverted list is accessed at any given time • A sequence of |q|-1 binary operations are performed • Not clear better than document-at-a-time processing • Score-at-a-time • All of the term lists are open for reading • Processed in decreasing score contribution order rather than in increasing document number order • Dynamic query pruning

Index levels • Document numbers only • Only support Boolean queries • Document numbers and Score • Score can be impact scores or frequent or both • Support Boolean and ranked queries • Document numbers, Score, Positions • Support Boolean ,ranked, phrase (proximity queries)

Types of index and support query modes

Interleaving • Pointer interleaving • Each document number is immediately followed by the corresponding w or f value or pos value • Term interleaving • Keep various related parts of each inverted list in contiguous blocs • Non-interleaving • Using separate inverted file • Each storing a particular type of information • For distinct disk pointers maintained in each entry in the dictionary

Block-interleaved indexes • Each inverted list is built up as a sequence of one or more groups • Within each group there are four or more k-blocks

Cache • 是一个工具包，为一些模块提供cache功能 • Cahce 策略与存储分开 • 有多种cahce策略 • 当前实现MRU策略 • 多种存储策略 • 基于STL map • 采用模板实现，保证其通过性

Document • The logical representation of a Document for indexing and searching • A Document is a collection of Fieldables • A Fieldable is a logical representation of a user's content that needs to be indexed or stored • Fieldables have a number of properties that tell the index how to treat the content (like indexed, tokenized, stored, etc.)

天网搜索平台 PARADISE