310 likes | 388 Views
Index Construction: sorting. Paolo Ferragina Dipartimento di Informatica Università di Pisa. Reading Chap 4. Indexer steps. Dictionary & postings: How do we construct them? Scan the texts Find the proper tokens-list Append to the proper tokens-list the docID. How do we:
E N D
Index Construction:sorting Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading Chap 4
Indexer steps • Dictionary & postings: How do we construct them? • Scan the texts • Find the proper tokens-list • Append to the proper tokens-list the docID • How do we: • Find ? …time issue… • Append ? …space issues … • Postings’ size? • Dictionary size ? … in-memory issues …
Indexer steps: create token • Sequence of pairs: • < Modified token, Document ID > • What about var-length strings? Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious
Indexer steps: Sort • Sort by Term • Then sort by DocID Core indexing step
Sec. 1.2 Indexer steps: Dictionary & Postings • Multiple term entries in a single document are merged. • Split into Dictionary and Postings • Doc. frequency information is added.
Sec. 1.2 Some key issues… Lists of docIDs Terms and counts • Now: • How do we sort? • How much storage is needed ? Pointers
Keep attention on disk... • If sorting needs to manage strings Key observations: • Array A is an “array of pointers to objects” • For each object-to-object comparison A[i] vs A[j]: • 2 random accesses to 2 memory locations A[i] and A[j] • Q(n log n) random memory accesses (I/Os ??) You sort A Not the strings Memory containing the strings A Again chaching helps, Buthowmuch ?
1 2 8 10 7 9 13 19 1 Binary Merge-Sort Merge-Sort(A,i,j) 01 if (i < j) then 02 m = (i+j)/2; 03 Merge-Sort(A,i,m); 04 Merge-Sort(A,m+1,j); 05 Merge(A,i,m,j) Divide Conquer Combine 2 7 Merge is linear in the #items to be merged
Few key observations • Items = (short) strings = atomic... • On english wikipedia, about 109 tokens to sort • Q(n log n) memory accesses (I/Os ??) • [5ms] * n log2 n ≈ 3years In practice it is a “faster”, why?
SPIMI: Single-pass in-memory indexing • Key idea #1: Generate separate dictionaries for each block (No need for term termID) • Key idea #2: Don’t sort. Accumulate postings in postings lists as they occur (in internal memory). • Generate an inverted index for each block. • More space for postings available • Compression is possible • Merge indexes into one big index. • Easy append with 1 file per posting (docID are increasing within a block), otherwise multi-merge like
Use the same algorithm for disk? • multi-way merge-sort aka BSBI: Blocked sort-based Indexing • Mapping term termID • to be kept in memory for constructing the pairs • Consumes memory to pairs’ creation • Needs two passes, unless you use hashing and thus some probability of collision.
15 4 10 2 13 19 5 1 9 7 10 2 5 1 13 19 9 7 10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11 15 4 8 3 12 17 6 11 12 17 6 11 10 2 5 1 13 19 9 7 15 4 8 3 12 17 6 11 8 3 13 8 7 15 4 19 3 12 2 9 6 11 1 5 10 17 Recursion log2 N
4 15 2 10 13 19 1 5 7 9 1 2 5 7 9 10 13 19 3 8 3 4 6 8 11 12 15 17 12 17 6 11 1 2 5 10 7 9 13 19 3 4 8 15 6 11 12 17 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 19 12 7 15 4 8 3 13 11 9 6 1 5 2 10 17 Implicit Caching… 2 passes (one Read/one Write) = 2 * (N/B) I/Os Log2 (N/M) 2 passes (R/W) 2 passes (R/W) log2 N M N/M runs, each sorted in internal memory (no I/Os) — I/O-cost for binary merge-sort is ≈ 2 (N/B) log2 (N/M)
1 2 4 7 9 10 13 19 3 5 6 8 11 12 15 17 A key inefficiency After few steps, every run is longer than B !!! Output Run 1, 2, 3 B B B Output Buffer Disk 1, 2, 3 4, ... B We are using only 3 pages But memory contains M/B pages ≈ 230/215 = 215
Multi-way Merge-Sort • Sort N items with main-memory M and disk-pages B: • Pass 1: Produce (N/M) sorted runs. • Pass i: merge X =M/B-1 runs logX N/M passes Pg for run1 . . . . . . Pg for run 2 . . . Out Pg Pg for run X Disk Disk Main memory buffers of B items
1 2 5 10 7 9 13 19 1 2 5 7…. 1 2 3 4 5 6 7 8 9 10 11 12 13 15 17 19 How it works 2 passes (one Read/one Write) = 2 * (N/B) I/Os LogX (N/M) X X M M N/M runs, each sorted in internal memory = 2 (N/B) I/Os — I/O-cost for X-way merge is ≈ 2 (N/B) I/Os per level
Cost of Multi-way Merge-Sort • Number of passes = logX N/M logM/B (N/M) • Total I/O-cost is Q( (N/B) logM/B N/M ) I/Os • In practice • M/B ≈ 105#passes =1 few mins Tuning depends on disk features • Large fan-out (M/B) decreases #passes • Compression would decrease the cost of a pass!
Distributed indexing • For web-scale indexing: must use a distributed computing cluster of PCs • Individual machines are fault-prone • Can unpredictably slow down or fail • How do we exploit such a pool of machines?
Distributed indexing • Maintain a master machine directing the indexing job – considered “safe”. • Break up indexing into sets of (parallel) tasks. • Master machine assigns tasks to idle machines • Other machines can play many roles during the computation
Parallel tasks • We will use two sets of parallel tasks • Parsers and Inverters • Break the document collection in two ways: • Term-based partition • one machine handles a subrange of terms • Doc-based partition • one machine handles a subrange of documents
Data flow: doc-based partitioning Master assign assign Postings Set1 Parser Inverter IL1 Set2 Parser Inverter IL2 Inverter splits ILk Parser Setk Each query-term goes to many machines
Data flow: term-based partitioning Master assign assign Postings Set1 Parser Inverter a-f g-p q-z a-f Set2 Parser a-f g-p q-z Inverter g-p Inverter splits q-z Parser a-f g-p q-z Setk Each query-term goes to one machine
MapReduce • This is • a robust and conceptually simple framework for distributed computing • … without having to write code for the distribution part. • Google indexing system (ca. 2002) consists of a number of phases, each implemented in MapReduce.
Guarantee fitting in one machine ? Guarantee fitting in one machine ? Data flow: term-based partitioning 16Mb -64Mb Master assign assign Postings Parser Inverter a-f g-p q-z a-f Parser a-f g-p q-z Inverter g-p Inverter splits q-z Parser a-f g-p q-z Segment files (local disks) Map phase Reduce phase
Dynamic indexing • Up to now, we have assumed static collections. • Now more frequently occurs that: • Documents come in over time • Documents are deleted and modified • And this induces: • Postings updates for terms already in dictionary • New terms added/deleted to/from dictionary
Simplest approach • Maintain “big” main index • New docs go into “small” auxiliary index • Search across both, and merge the results • Deletions • Invalidation bit-vector for deleted docs • Filter search results (i.e. docs) by the invalidation bit-vector • Periodically, re-index into one main index
Issues with 2 indexes • Poor performance • Merging of the auxiliary index into the main index is efficient if we keep a separate file for each postings list. • Merge is the same as a simple append [new docIDs are greater]. • But this needs a lot of files – inefficient for O/S. • In reality: Use a scheme somewhere in between (e.g., split very large postings lists, collect postings lists of length 1 in one file etc.)
Logarithmic merge • Maintain a series of indexes, each twice as large as the previous one: 20 M, 21 M , 22 M , 23 M , … • Keep a small index (Z) in memory (of size 20 M) • Store I0, I1, … on disk (sizes 20 M , 21 M , …) • If Z gets too big (> M), write to disk as I0 or merge with I0 (if I0 already exists) • Either write merge Z to disk as I1 (if no I1) or merge with I1 to form a larger Z • etc. # indexes = logarithmic
Some analysis (T = #postings = #tokens) • Auxiliary and main index: index construction time is O(T2) as each posting is touched in each merge. • Logarithmic merge: Each posting is merged O(log (T/M) ) times, so complexity is O( T log (T/M) ) Logarithmic merge is more efficient for index construction, but its query processing requires the merging of O( log (T/M) ) lists of results
Web search engines • Most search engines now support dynamic indexing • News items, blogs, new topical web pages • But (sometimes/typically) they also periodically reconstruct the index • Query processing is then switched to the new index, and the old index is then deleted