Inverted Index Creation and Evaluation Methods in Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #12 March 3, 1999

This lecture • Evaluation • Creating an inverted index file (sources Managing Gigabytes Witten, Moffat, and Bell, chapter 5. Information retrieval, Grossman and Frieder pages 137-142)

Fallout • Fallout= • A good system should have high recall and low fallout

Relevance judgment • Exhaustive? • Assume 100 queries • 750,000 documents in collection • Requires 75 million relevance judgments

Relevance judgment • Sampling • with average 200 and maximum 900 relevant docs per query the sample size of the collection needed for good results is still too large

Relevance judgment • Polling • 33 runs of 200 top documents, an average of 2398 docs per topic

Calculating recall & precision • 200 documents in collection • 6 relevant documents for query

Recall & precision

Interpolated values • The interpolated precision is the maximum precision at this and all higher recall levels

Precision 1.0 1 2 4 0.75 7 3 10 0.57 5 6 13 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall

Precision Interpolated Values 1.0 1 2 4 0.75 7 3 10 0.57 5 6 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall

Precision Query 1 1.0 Interpolation graphs for 2 queries Query 2 0.75 0.57 0.5 0.46 0.0 1.0 0.5 0.6 0.8 Recall

Averaging performance • Average recall/precision for a set of queries is either user or system oriented • User oriented • Obtain the recall/precision values for each query and • then average over all queries

Averaging performance • System oriented - use the following totals for all queries: • relevant documents, • relevant retrieved, • total retrieved • User oriented is commonly used

User oriented recall-level average • Average at each recall level after interpolation

Building an inverted file • Some size and time assumptions (Managing Gigabytes chapter 5) • The methods

Sizes

Times and main memory

Methods for Creating an inverted file • Sort based methods • Memory based inversion • Use external sort • Uncompressed • Compressing the temporary files • Multiway merge and compressed • In-place multiway merging

Additional Methods for Creating an inverted file • Lexicon-based partitioning (FAST-INV) • Text based partitioning

Inverted file - creating a temporary file • Each document is parsed • Stop words are removed * • Words are stemmed * • Every keyword with its document identifier, tf or location is stored in a record • A dictionary is generated

The dictionary • Binary search tree • Worst case O(dictionary-size) • Average O(lg(dictionary-size)) • Needs space for left and right pointers • A sorted list is generated by traversal

The dictionary • A sorted array • Binary search to find term in array O(log(size-dictionary)) • Insertion is slow O(size-dictionary)

The dictionary • A hash table • Search is fast O(1) • Does not generate a sorted dictionary

The parsed collection

The inverted lists • Data stored in inverted list: • The term, df, list of DocIds • government, 3, <5, 18, 26,> • List of pairs of DocId and tf • government, 3 <(5, 2), (18, 1)(26, 2)> • List of DocId and positions • government, 3 <5, 25, 56><18, 4><26, 12, 43>

After sorting

The inverted file • A dictionary • Stored in memory or • Secondary storage • Each record contains a pointer to inverted list, the term, (and possibly df, a term number) • A postings file - a sequential file with inverted lists sorted by term number

The dictionary and postings files Postings 2 2 2 1 1 1 1 1,2 1 1 Dictionary Doc-id

2,1 Memory Based Inversion • Creates a dictionary where each term points to a linked list • Each record in list contains <d, ft,d> and a pointer to next node 3,1 file search spider tool web 1,1

Pseudo Code for Memory Based Inversion Create an empty dictionary S /*Index collection and create data structure*/ for documents d = 1 to N Read d, index and compute ft,d s foreach term t of d Iftnot in dictionary insert into S Append a node <d, ft,d> to list for t

Pseudo Code for Memory Based Inversion /*Output of inverted file*/ for all terms t in S start a new inverted file entry copy all <d, ft,d> pairs compress and append

Time Time = B * tr + F * tp + (read and index 5.1 hrs) + I * (td + tr) (write compressed inverted file .6 hrs) ~ 5.75 hours Each node in list 10 bytes. Main memory needed: 10 * 400,000,000 = 4 gigabytes

Linked lists stored on disk • If we don’t have sufficient memory an alternative is to store the linked list records on disk • The problem is that when lists are traversed they are distributed in different locations on disk requiring a seek for each record

Time - linked lists on disk Time = B * tr + F * tp + (read and index ~5.1 hrs) + 10*f* tr + (store on disk, .6 hrs) + f* ts + (read lists from disk 6.1 weeks) + I*(td + tr) (write compressed inverted file .6 hrs) = 6.64 weeks.

Pseudo Code for External Sort Based Inversion Create empty dictionary S and temp file /*Index collection and store*/ for document d = 1 to N Read d, index and compute ft,d s for each term t of d Ift not in dictionary insert into S write <t, d, ft,d> to temp file

External Sort Based Inversion /*Create runs of k records*/ While “more unsorted records” read k records from temp file sort by t and d and write back to file /*merge*/ merge pairs of runs until one sorted run

External Sort Based Inversion /*Output inverted file*/ for all terms t in S start a new inverted file entry read all triples <t, d, ft,d> for t compress and append

Space for external sort • To do external sort we need a temporary file of 4Gbytes • At the peak of merge there are 2 copies of the temporary file requiring 8Gbytes

Merging the runs • Since main memory is 40Mbytes, each run is at most 40Mbytes, so 100 runs • lg100=7 merges of a 4 gigabyte file

Time 1. B * tr + F * tp + (read and index ~5.1 hrs) 2. + 10*f* tr + (store on disk .6 hrs) 3. + 20*f* tr + R(1.2klgk)tc (sort runs ~4 hrs) 4. +  lg R(20*f* tr + f* tc) (merge pairs of runs ~8 hrs, dominated by disk transfer time)

Time 5. + 10*f* tr + I*(td + tr) (read and write compressed inverted file, ~1 hr) Total time ~ 19.45 hours

Inverted Index Creation and Evaluation Methods in Information Retrieval

Inverted Index Creation and Evaluation Methods in Information Retrieval

Presentation Transcript

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval