Building a Distributed Full-Text Index for the Web

Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina

Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.

3 2 1 Dog Cat Fish Dog Fly Dog Pig Pig Cat Fish Cat Inverted list Cat-> (1,2), (1,4), (3,2) Dog->(2,2), (3,1), (3,4) Fish->(1,3), (3,3) Pig->(1,1), (2,3) Inverted index location

Inverted indexconsist of an inverted lists for each sorted term. Inverted listconsist of a locations in sorted way. Location consist of (page identifier, position in the page). Posting consist of (index term, location).

Building an inverted index over a collection of web pages involves: 1. Processing each page to extract postings. 2. Building for each term inverted list. 3. Writing out on disk.

Important problems when building web-scale inverted index: 1. Scale and growth rate. 2. Rate of change

Distributors. • Indexers. • Query servers.

Distributed inverted index organization: • Local inverted files. • 2. Global inverted files.

Global inverted files Cat->(1,2), (1,4), (3,2) Dog->(2,2), (3,1), (3,4) Query server 1 a-e Fish->(1,3), (3,3) Pig->(1,1), (2,3) Query server 2 f-z 3 2 1 Dog Cat Fish Dog Fly Dog Pig Pig Cat Fish Cat

Local inverted files f-z a-e Query server 2 Query server 1 Cat->(3,2) Dog->(3,1), (3,4) Fish->(3,3) Cat->(1,2), (1,4) Dog->(2,2) Fish->(1,3) Fly->(2,1) Pig->(1,1), (2,3) 3 Dog Cat Fish Dog 2 Fly Dog Pig 1 Pig Cat Fish Cat

Local vs. Global • Resilience to failures. • Network load.

Testbed environment: The indexers and the query servers are single processor PC’s with 350-500 MHz processors, 300-500 MB of main memory, and equipped with multiple disks. All the machines are interconnected by a 100 Mbps Ethernet LAN network.

The WebBase collection: To study some properties of web pages that are relevant to text indexing, we analyzed 5 samples, of 100,000 pages each, from different portions of the WebBase repository.

Table 1: Properties of the WebBase collection

Design of the Indexer • Software pipeline. • The storage of the inverted files generated by the process.

Software pipeline • The process can logically be split into 3 phases: • Processing -> CPU intensive. • Flushing -> disk. • loading -> network.

The goal of our pipelining technique is to design an execution schedule for the different indexing phases that will result in minimal overall running time. Examples: F Execution of the pipeline P L

t Pipeline time

Theoretical analysis vs. experimental results

Design of the Indexer • Software pipeline. • The storage of the inverted files generated by the process.

Storage schemes: We consider ed three storage schemes for storing inverted files as sets of (key, value) pairs in a B-tree: 1. Full list. 2. Single payload. 3. Mixed list.

A qualitative comparison of these storage schemes: • Index size • Zig-zag joins • Hot updates

Zig-zag join using ordered indexes 1 2 3 4 7 9 18 1 7 9 11 12 17 19

Experimental results (using mixed list)

Table 5:Mixed-list scheme index sizes Only one posting was generated for all the occurrences of a word in a page

Two problems that must be addressed when building an inverted index on a distributed architecture: • Page distribution: The question of when and how to distribute pages to the indexing nodes. • Collecting global statistics: the question of where, when, and how to compute and distribute global statistics.

Two strategies for page distribution: • A priori distribution. • Runtime distribution.

Three advantages of runtime distribution: • Space. • Load balancing. • Effective pipelining.

Collecting global statistics • A dedicated server known as the statistician. • Parallel computation. • Minimize the number of conversations among servers. • Avoid extra disk I/O • Reduces network overhead.

Two strategies for sending information to the statistician: • ME Strategy: sending local information during merging. • FL Strategy: sending local information during flushing.

comparison

Building a Distributed Full-Text Index for the Web