430 likes | 451 Views
Learn about creating an inverted index file, evaluating systems for recall and fallout, and understanding relevance judgment in information retrieval. Explore various methods, considerations, and challenges in building and utilizing inverted files.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #12 March 3, 1999
This lecture • Evaluation • Creating an inverted index file (sources Managing Gigabytes Witten, Moffat, and Bell, chapter 5. Information retrieval, Grossman and Frieder pages 137-142)
Fallout • Fallout= • A good system should have high recall and low fallout
Relevance judgment • Exhaustive? • Assume 100 queries • 750,000 documents in collection • Requires 75 million relevance judgments
Relevance judgment • Sampling • with average 200 and maximum 900 relevant docs per query the sample size of the collection needed for good results is still too large
Relevance judgment • Polling • 33 runs of 200 top documents, an average of 2398 docs per topic
Calculating recall & precision • 200 documents in collection • 6 relevant documents for query
Interpolated values • The interpolated precision is the maximum precision at this and all higher recall levels
Precision 1.0 1 2 4 0.75 7 3 10 0.57 5 6 13 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall
Precision Interpolated Values 1.0 1 2 4 0.75 7 3 10 0.57 5 6 0.5 0.46 8 9 11 12 200 0.0 1.0 0.5 0.6 0.8 Recall
Precision Query 1 1.0 Interpolation graphs for 2 queries Query 2 0.75 0.57 0.5 0.46 0.0 1.0 0.5 0.6 0.8 Recall
Averaging performance • Average recall/precision for a set of queries is either user or system oriented • User oriented • Obtain the recall/precision values for each query and • then average over all queries
Averaging performance • System oriented - use the following totals for all queries: • relevant documents, • relevant retrieved, • total retrieved • User oriented is commonly used
User oriented recall-level average • Average at each recall level after interpolation
Building an inverted file • Some size and time assumptions (Managing Gigabytes chapter 5) • The methods
Methods for Creating an inverted file • Sort based methods • Memory based inversion • Use external sort • Uncompressed • Compressing the temporary files • Multiway merge and compressed • In-place multiway merging
Additional Methods for Creating an inverted file • Lexicon-based partitioning (FAST-INV) • Text based partitioning
Inverted file - creating a temporary file • Each document is parsed • Stop words are removed * • Words are stemmed * • Every keyword with its document identifier, tf or location is stored in a record • A dictionary is generated
The dictionary • Binary search tree • Worst case O(dictionary-size) • Average O(lg(dictionary-size)) • Needs space for left and right pointers • A sorted list is generated by traversal
The dictionary • A sorted array • Binary search to find term in array O(log(size-dictionary)) • Insertion is slow O(size-dictionary)
The dictionary • A hash table • Search is fast O(1) • Does not generate a sorted dictionary
The inverted lists • Data stored in inverted list: • The term, df, list of DocIds • government, 3, <5, 18, 26,> • List of pairs of DocId and tf • government, 3 <(5, 2), (18, 1)(26, 2)> • List of DocId and positions • government, 3 <5, 25, 56><18, 4><26, 12, 43>
The inverted file • A dictionary • Stored in memory or • Secondary storage • Each record contains a pointer to inverted list, the term, (and possibly df, a term number) • A postings file - a sequential file with inverted lists sorted by term number
The dictionary and postings files Postings 2 2 2 1 1 1 1 1,2 1 1 Dictionary Doc-id
2,1 Memory Based Inversion • Creates a dictionary where each term points to a linked list • Each record in list contains <d, ft,d> and a pointer to next node 3,1 file search spider tool web 1,1
Pseudo Code for Memory Based Inversion Create an empty dictionary S /*Index collection and create data structure*/ for documents d = 1 to N Read d, index and compute ft,d s foreach term t of d Iftnot in dictionary insert into S Append a node <d, ft,d> to list for t
Pseudo Code for Memory Based Inversion /*Output of inverted file*/ for all terms t in S start a new inverted file entry copy all <d, ft,d> pairs compress and append
Time Time = B * tr + F * tp + (read and index 5.1 hrs) + I * (td + tr) (write compressed inverted file .6 hrs) ~ 5.75 hours Each node in list 10 bytes. Main memory needed: 10 * 400,000,000 = 4 gigabytes
Linked lists stored on disk • If we don’t have sufficient memory an alternative is to store the linked list records on disk • The problem is that when lists are traversed they are distributed in different locations on disk requiring a seek for each record
Time - linked lists on disk Time = B * tr + F * tp + (read and index ~5.1 hrs) + 10*f* tr + (store on disk, .6 hrs) + f* ts + (read lists from disk 6.1 weeks) + I*(td + tr) (write compressed inverted file .6 hrs) = 6.64 weeks.
Pseudo Code for External Sort Based Inversion Create empty dictionary S and temp file /*Index collection and store*/ for document d = 1 to N Read d, index and compute ft,d s for each term t of d Ift not in dictionary insert into S write <t, d, ft,d> to temp file
External Sort Based Inversion /*Create runs of k records*/ While “more unsorted records” read k records from temp file sort by t and d and write back to file /*merge*/ merge pairs of runs until one sorted run
External Sort Based Inversion /*Output inverted file*/ for all terms t in S start a new inverted file entry read all triples <t, d, ft,d> for t compress and append
Space for external sort • To do external sort we need a temporary file of 4Gbytes • At the peak of merge there are 2 copies of the temporary file requiring 8Gbytes
Merging the runs • Since main memory is 40Mbytes, each run is at most 40Mbytes, so 100 runs • lg100=7 merges of a 4 gigabyte file
Time 1. B * tr + F * tp + (read and index ~5.1 hrs) 2. + 10*f* tr + (store on disk .6 hrs) 3. + 20*f* tr + R(1.2klgk)tc (sort runs ~4 hrs) 4. + lg R(20*f* tr + f* tc) (merge pairs of runs ~8 hrs, dominated by disk transfer time)
Time 5. + 10*f* tr + I*(td + tr) (read and write compressed inverted file, ~1 hr) Total time ~ 19.45 hours