460 likes | 565 Views
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina. Introduction. Testbed architecture. Design of the indexer. Distributed indexing. Introduction. Testbed architecture. Design of the indexer. Distributed indexing. 3. 2. 1. Dog
E N D
Building a Distributed Full-Text Index for the Web S. Melnik, S. Raghavan, B.Yang, H. Garcia-Molina
Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.
Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.
3 2 1 Dog Cat Fish Dog Fly Dog Pig Pig Cat Fish Cat Inverted list Cat-> (1,2), (1,4), (3,2) Dog->(2,2), (3,1), (3,4) Fish->(1,3), (3,3) Pig->(1,1), (2,3) Inverted index location
Inverted indexconsist of an inverted lists for each sorted term. Inverted listconsist of a locations in sorted way. Location consist of (page identifier, position in the page). Posting consist of (index term, location).
Building an inverted index over a collection of web pages involves: 1. Processing each page to extract postings. 2. Building for each term inverted list. 3. Writing out on disk.
Important problems when building web-scale inverted index: 1. Scale and growth rate. 2. Rate of change
Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.
Distributors. • Indexers. • Query servers.
Distributed inverted index organization: • Local inverted files. • 2. Global inverted files.
Global inverted files Cat->(1,2), (1,4), (3,2) Dog->(2,2), (3,1), (3,4) Query server 1 a-e Fish->(1,3), (3,3) Pig->(1,1), (2,3) Query server 2 f-z 3 2 1 Dog Cat Fish Dog Fly Dog Pig Pig Cat Fish Cat
Local inverted files f-z a-e Query server 2 Query server 1 Cat->(3,2) Dog->(3,1), (3,4) Fish->(3,3) Cat->(1,2), (1,4) Dog->(2,2) Fish->(1,3) Fly->(2,1) Pig->(1,1), (2,3) 3 Dog Cat Fish Dog 2 Fly Dog Pig 1 Pig Cat Fish Cat
Local vs. Global • Resilience to failures. • Network load.
Testbed environment: The indexers and the query servers are single processor PC’s with 350-500 MHz processors, 300-500 MB of main memory, and equipped with multiple disks. All the machines are interconnected by a 100 Mbps Ethernet LAN network.
The WebBase collection: To study some properties of web pages that are relevant to text indexing, we analyzed 5 samples, of 100,000 pages each, from different portions of the WebBase repository.
Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.
Design of the Indexer • Software pipeline. • The storage of the inverted files generated by the process.
Software pipeline • The process can logically be split into 3 phases: • Processing -> CPU intensive. • Flushing -> disk. • loading -> network.
The goal of our pipelining technique is to design an execution schedule for the different indexing phases that will result in minimal overall running time. Examples: F Execution of the pipeline P L
t Pipeline time
Design of the Indexer • Software pipeline. • The storage of the inverted files generated by the process.
Storage schemes: We consider ed three storage schemes for storing inverted files as sets of (key, value) pairs in a B-tree: 1. Full list. 2. Single payload. 3. Mixed list.
A qualitative comparison of these storage schemes: • Index size • Zig-zag joins • Hot updates
Zig-zag join using ordered indexes 1 2 3 4 7 9 18 1 7 9 11 12 17 19
Table 5:Mixed-list scheme index sizes Only one posting was generated for all the occurrences of a word in a page
Introduction. • Testbed architecture. • Design of the indexer. • Distributed indexing.
Two problems that must be addressed when building an inverted index on a distributed architecture: • Page distribution: The question of when and how to distribute pages to the indexing nodes. • Collecting global statistics: the question of where, when, and how to compute and distribute global statistics.
Two strategies for page distribution: • A priori distribution. • Runtime distribution.
Three advantages of runtime distribution: • Space. • Load balancing. • Effective pipelining.
Collecting global statistics • A dedicated server known as the statistician. • Parallel computation. • Minimize the number of conversations among servers. • Avoid extra disk I/O • Reduces network overhead.
Two strategies for sending information to the statistician: • ME Strategy: sending local information during merging. • FL Strategy: sending local information during flushing.