1 / 55

INF 2914 Information Retrieval and Web Search

INF 2914 Information Retrieval and Web Search. Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining. (Offline) Search Engine Data Flow. Parse & Tokenize. Global Analysis. Index Build. Crawler. Dup detection

kane-ramos
Download Presentation

INF 2914 Information Retrieval and Web Search

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INF 2914Information Retrieval and Web Search Lecture 6: Index Construction These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining

  2. (Offline) Search Engine Data Flow Parse & Tokenize Global Analysis Index Build Crawler • Dup detection • Static rank • Anchor text • Spam analysis • - … - Scan tokenized web pages, anchor text, etc- Generate text index web page - Parse- Tokenize- Per page analysis 1 2 3 4 in background duptable tokenizedweb pages anchortext ranktable spam table invertedtext index

  3. Brutus Calpurnia Caesar Dictionary Postings lists Inverted index • For each term T, we must store a list of all documents that contain T. Posting 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Sorted by docID (more later on why).

  4. Tokenizer Friends Romans Countrymen Token stream. Linguistic modules friend friend roman countryman Modified tokens. roman Indexer 2 4 countryman 1 2 Inverted index. 16 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen.

  5. Indexer steps • Sequence of (Modified token, Document ID) pairs. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

  6. Sort by terms. Core indexing step.

  7. Multiple term entries in a single document are merged. • Frequency information is added. Why frequency? Will discuss later.

  8. The result is split into a Dictionary file and a Postings file.

  9. The index we just built • How do we process a query?

  10. 2 4 8 16 32 64 1 2 3 5 8 13 21 Query processing: AND • Consider processing the query: BrutusANDCaesar • Locate Brutus in the Dictionary; • Retrieve its postings. • Locate Caesar in the Dictionary; • Retrieve its postings. • “Merge” the two postings: 128 Brutus Caesar 34

  11. Brutus Caesar 13 128 2 2 4 4 8 8 16 16 32 32 64 64 8 1 1 2 2 3 3 5 5 8 8 21 21 13 34 The merge • Walk through the two postings simultaneously, in time linear in the total number of postings entries 128 2 34 If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docID.

  12. Index construction • How do we construct an index? • What strategies can we use with limited main memory?

  13. Our corpus for this lecture • Number of docs = n = 1M • Each doc has 1K terms • Number of distinct terms = m = 500K • 667 million postings entries

  14. How many postings? • Number of 1’s in the i th block = nJ/i • Summing this over m/J blocks, we have • For our numbers, this should be about 667 million postings.

  15. Recall index construction • Documents are processed to extract words and these are saved with the Document ID. Doc 1 Doc 2 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious

  16. Key step • After all documents have been processed the inverted file is sorted by terms. We focus on this sort step. We have 667M items to sort.

  17. Index construction • At 10-12 bytes per postings entry, demands several temporary gigabytes

  18. System parameters for design • Disk seek ~ 10 milliseconds • Block transfer from disk ~ 1 microsecond per byte (following a seek) • All other ops ~ 1 microsecond • E.g., compare two postings entries and decide their merge order

  19. Bottleneck • Build postings entries one doc at a time • Now sort postings entries by term (then by doc within each term) • Doing this with random disk seeks would be too slow – must sort N=667M records If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take?

  20. Disk-based sorting Build postings entries one doc at a time Now sort postings entries by term Doing this with random disk seeks would be too slow – must sort N=667M records If every comparison took 2 disk seeks, and N items could be sorted with N log2N comparisons, how long would this take? 12.4 years!!! 20

  21. Sorting with fewer disk seeks 12-byte (4+4+4) records (term, doc, freq). These are generated as we process docs. Must now sort 667M such 12-byte records by term. Define a Block ~ 10M such records can “easily” fit a couple into memory. Will have 64 such blocks to start with. Will sort within blocks first, then merge the blocks into one long sorted order. 21

  22. Sorting 64 blocks of 10M records First, read each block and sort within: Quicksort takes 2N ln N expected steps In our case 2 x (10M ln 10M) steps Time to Quicksort each block = 320 seconds Total time to read each block from disk and write it back 120M x 2 x 10-6 = 240 seconds 64 times this estimate - gives us 64 sorted runs of 10M records each Total Quicksort time = 5.6 hours Total read+write time = 4.2 hours Total for this phase ~ 10 hours Need 2 copies of data on disk, throughout 22

  23. Merging 64 sorted runs Merge tree of log264= 6 layers. During each layer, read into memory runs in blocks of 10M, merge, write back. 1 3 2 4 2 1 Merged run. 3 4 Runs being merged. Disk 23

  24. Merge tree 1 run … ? 2 runs … ? 4 runs … ? 8 runs, 80M/run 16 runs, 40M/run … 32 runs, 20M/run Bottom level of tree. Sorted runs. … 1 2 63 64 24

  25. Merging 64 runs Time estimate for disk transfer: 6 x Time to read+write 64 blocks = 6 x 4.2 hours ~ 25 hours Time estimate for the merge operation: 6 x 640M x 10-6 = 1 hour Time estimate for the overall algorithm: Sort time + Merge time ~ 10 + 26 ~ 36 hours Lower bound (main memory sort): Time to read+write = 4.2 hours Time to sort in memory = 10.7 hours Total time ~ 15 hours 25

  26. Some indexing numbers 26

  27. How to improve indexing time? Compression of the sorted runs Multi-way merge Heap merge all runs Radix sort (linear time sorting) Pipelining reading, sorting, and writing phases 27

  28. Multi-way merge Heap Sorted runs … 1 2 63 64 28

  29. Indexing improvements Radix sort Linear time sorting Flexibility in defining the sort criteria Bigger sort buffers increase performance (contradicting previous literature) (see VLDB paper on the references) Pipelining read and sort + write phases B1 B1 B1 B1 Read B2 B2 B2 B2 Sort + Write time 29

  30. Positional indexing Given documents:D1: This is a testD2: Is this a testD3: This is not a test Reorganize by term: TERM DOC LOC DATA(caps)this 1 0 1is 1 1 0a 1 2 0test 1 3 0is 2 0 1this 2 1 0a 2 2 0test 2 3 0this 3 0 1is 3 1 0not 3 2 0a 3 3 0test 3 4 0 30

  31. Positional indexing In “postings list” format: a (1,2,0),(2,2,0),(3,3,0) is (1,1,0),(2,0,1),(3,1,0) not (3,2,0) test (1,3,0),(2,3,0),(3,4,0) this (1,0,1),(2,1,0),(3,0,1) Sort by <term, doc, loc>: TERM DOC LOC DATA(caps)a 1 2 0a 2 2 0a 3 3 0is 1 1 0is 2 0 1is 3 1 0not 3 2 0 test 1 3 0test 2 3 0test 3 4 0 this 1 0 1 this 2 1 0 this 3 0 1 31

  32. Positional indexing with radix sort Radix key = Token hash = 8 bytes Document ID = 8 bytes Location = 4 bytes, but no need to sort by location since Radix sort is stable! 32

  33. Distributed indexing Maintain a master machine directing the indexing job – considered “safe” Break up indexing into sets of (parallel) tasks Master machine assigns each task to an idle machine from a pool 33

  34. Parallel tasks We will use two sets of parallel tasks Parsers Inverters Break the input document corpus into splits Each split is a subset of documents Master assigns a split to an idle parser machine Parser reads a document at a time and emits (term, doc) pairs 34

  35. Parallel tasks Parser writes pairs into j partitions Each for a range of terms’ first letters (e.g., a-f, g-p, q-z) – here j=3. Now to complete the index inversion 35

  36. Data flow Master assign assign Postings Parser Inverter a-f g-p q-z a-f Parser a-f g-p q-z Inverter g-p Inverter splits q-z Parser a-f g-p q-z 36

  37. Inverters Collect all (term, doc) pairs for a partition Sorts and writes to postings list Each partition contains a set of postings Above process flow a special case of MapReduce. 37

  38. MapReduce Model for processing large data sets. Contains Map and Reduce functions. Runs on a large cluster of machines. A lot of MapReduce programs are executed on Google’s cluster everyday. 38

  39. Motivation Input data is large The whole Web, billions of Pages Lots of machines Use them efficiently 39

  40. A real example Term frequencies through the whole Web repository. Count of URL access frequency. Reverse web-link graph …. 40

  41. Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair Produces set of intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Combines all intermediate values for a particular key Produces a set of merged output values (usually just one) 41

  42. Example Page 1: the weather is good Page 2: today is good Page 3: good weather is good. 42

  43. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); 43

  44. Map output Worker 1: (the 1), (weather 1), (is 1), (good 1). Worker 2: (today 1), (is 1), (good 1). Worker 3: (good 1), (weather 1), (is 1), (good 1). 44

  45. Reduce Input Worker 1: (the 1) Worker 2: (is 1), (is 1), (is 1) Worker 3: (weather 1), (weather 1) Worker 4: (today 1) Worker 5: (good 1), (good 1), (good 1), (good 1) 45

  46. Reduce Output Worker 1: (the 1) Worker 2: (is 3) Worker 3: (weather 2) Worker 4: (today 1) Worker 5: (good 4) 46

  47. 47

  48. 48

  49. Fault tolerance Typical cluster: 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory Storage is on local IDE disks GFS: distributed file system manages data (SOSP'03) Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines Implementation is a C++ library linked into user programs) 49

  50. Fault tolerance On worker failure: Detect failure via periodic heartbeats Re-execute completed and in-progress map tasks Re-execute in progress reduce tasks Task completion committed through master Master failure: Could handle, but don't yet (master failure unlikely) 50

More Related