600 likes | 635 Views
Learn about building inverted index files, compression methods, and memory-based inversion in information retrieval, including Elias γ encoding, Golomb code, and multiway merging. Explore fixed-length index compression, dictionary management, and efficient merging strategies.
E N D
CS533 Information Retrieval Dr. Michal Cutler Lecture #14 March 10, 1999
The university as seen from my window
This lecture • Creating an inverted index file
Building an inverted file • Some size and time assumptions (Managing Gigabytes chapter 5) • The methods
Methods for Creating an inverted file • Memory based inversion • Sort based methods • Use external sort • Uncompressed • Compressing the temporary files • Multiway merge and compressed • In-place multiway merging
Additional Methods for Creating an inverted file • Lexicon-based partitioning (FAST-INV) • Text based partitioning
Compression in IR • The dictionary • The inverted file
Fixed length index compression (Grossman) • Entries in inverted list are sorted by document number (4 bytes each) • To save space the offset between consecutive documents is stored • Compression • The two leftmost bits store the number of bytes. Then the offset is stored in the next 6, 14, 22, 30 bits
Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 Number bytes reduced from 4*5 to 6
Elias encoding • An integer is represented with 2lg x+1 bits • The first lg x bits are the unary representation of lg x as lg x ones • The next bit is a stop bit of 0. • At this point the highest power of 2 that does not exceed x is represented
Elias encoding • The next lg x bits represent the remainder of x - 2lg x in binary • Let x = 14. lg x =3 . • x - 2lg x= 14 - 8 = 6 • So 14 is represented by 111 0 110 • Let x = 1,000,000. lg x =19. • 19(1) then 0 then 1,000,000 - 219 in 19 bits
Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 Number bits reduced from 8*4*5 to 35
The unary prefix of the encoding is encoded in encoding More precisely 1 + lg x is encoded using encoding Let x = 9. After encoding it is 1110 001. Using encoding on the unary prefix 4, we get 11000 001 encoding
The number of bits is: 1+ 2lg (1 + lg x)+ (for encoding of (1 + lg x)) + lg x (for remainder of original encoding ) = 1+ 2lg lg 2x+ lg x Better than for large values of x encoding
First decode lg x+1 from 11000, which is 22+0=4. So lg x=4-1=3. Now compute 23+001=9 Let x = 1,000,000. lgx=19.93. So the number would start with 19 ones followed by a 0, followed by 19 bit remainder (39 bits) encoding 20, we get 11110 0100 requiring 9 bits. (9+19=28 bits) decoding
Golomb code • Given b. x>0 coded in 2 parts. First q+1 in unary where q = (x - 1) / b; Then r = x - bq - 1 is coded in binary requiring either lg b or lg b bits. • Let b = 3. Remainders are 0, 1 (10), 2 (11).
Local Benoulli Model • Frequent words are coded with small values of b • Words that appear in 10% of the collection have b=7 • Rare words appear with very large b
Compression of temp • Compress <t, d, ft,d> triples • Since runs are sorted by t, can compute t-gaps, and use encoding • The <d, ft,d> pairs can be compressed to 1 byte (on average) even with simple compression methods such as and
Compression • Internal sorting is done concurrently with parsing the text • Dictionary is stored in memory • Initial runs become smaller and there are more passes • Instead of 7 disk intensive passes there are 9 processor intensive passes
Compressing the temporary file Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr +td ) + (sort runs) + log R(2I’(tr + td)+ f* tc) (merge in log R passes) +(I’+ I)*(td + tr) ( recompress) ~ 26 hours I’~1.35*I, temp file 680 megabytes
Multiway merging • Merge is now processor and not disk intensive • R way merge (400 way merge of buffers of 100 Kbytes) • 540 Mbytes compressed file requires 5400 transfers and 5400 seeks.
Time - multiway merging Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr + td) + (sort runs, compress and write) + f log R tc +I’(ts/b + tr + td) (merge in one pass) +(I’+ I)*(td + tr) ( recompress) ~ 11 hours I’~1.35*I, temp file 540 megabytes
In-place multiway merging • All blocks are padded to be exactly b bytes • Each output block is written back into a vacant block of the temp file • To keep track of the output a block table is generated • The block table is used to create a sequentially sort file • Requires less additional memory
Time - in-place multiway merging Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr + td) + (sort runs, compress and write) + f log R tc +2I’(ts/b + tr + td) (merge and write into empty blocks) + 2I’(ts/b + tr) (permute) +(I’+ I)*(td + tr) ( recompress) ~ 13 hours I’~1.35*I
Large memory inversion • Machine has large memory (this method would need about 1.5 Gbytes memory instead of 4Gbytes, with better compression about 420Mbytes) • Saves space in 2 ways: 1. No need for pointers, 2. uses compression
Large memory inversion • The size of the inverted file is computed based on the size required for each inverted list • A pass over the collection will be needed to compute this data • An array of this size is allocated.
Large memory inversion • The lexicon has a pointer to the start of each inverted list and during inversion to the current empty location on the list • The size of each inverted list is: dft,* log N for its d component + dft,* log maxftt for its ftt component. • Better compressed sizes can be used
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 0 1 2 3 4 Term 1 Term 2 Term 12 Term 4 df start current 2 0 1 1 2 2 2 3 4 4 5 8 1 9 9 2 10 10 1 12 13 1 13 13 1 14 14 1 2 3 4 5 6 7 8 9 D1: 1, 4 D2: 12, 4 D3: 2, 4 ...
Time Time = B * tr + F * tp + (read and parse) B * tr + F * tp +2I’ td + I*(td + tr) ( invert) ~ 12 hours
Text based partitioning Inverted files are generated for chunks of text and merged Each chunk uses the previous method and is done completely in memory This method uses very little extra disk space (34 Mbytes)
Text based partitioning The merging for each new chunk can be done in place by copying the list for every term to its correct location on the disk About 16 hours
Lexical-based partitioning -FAST-INV • Developed by Fox • Inverts file without external sort • Main idea: dictionary based partitioning
FAST-INV • Divide input into j load files: • Each can be loaded into main memory • Each has about the same number of concepts
FAST-INV • j is as small as possible • concept numbers in load file i are greater than concept numbers in load file j for all j < i
Doc 1: New York slows its rate of taxgrowth But residents pay more than the other 49 states. State and local taxes went up less than the inflation rate in New York between 1994 and 1996, although they are still the highest in the nation, a new report shows…
Doc 2: Block that refund Incometaxrefunds will break last year's record of $114 billion. They are nice to get, but many Americans pay too much. Instead of loaning money to Uncle Sam for a year you could invest it.
Creating the temp file • Each document is converted into a list of Docid/ConceptIds pairs • All DocId/ConceptId pairs are stored in a temp file in secondary memory
The temp file Doc/ConceptId 1, 1 (growth) 1, 2 (New York) 1, 3 (pay) 1, 4 (resident) 1, 5 (state) 1, 6 (tax) 2, 7 (income) 2, 8 (loan) 2, 3 (pay) 2, 9 (refund) 2, 6 (tax) 1: 1, 2 3, 4, 5, 6 2: 7, 8, 3, 9, 6 DocId
Phase 1 • Initialize concept counts to 0. • Read temp file and increment counts • Compute the no. docs (df) per concept • Start building Conptr and Load-table
Complete Conptr • Add load numbers using • the counts and • the amount of free memory space • Compute the offsets
Build Load-Table • When a load is determined a row is added to Load-Table.