Creating Inverted Index: Compression Techniques in Information Retrieval

CS533 Information Retrieval Dr. Michal Cutler Lecture #14 March 10, 1999

The university as seen from my window

This lecture • Creating an inverted index file

Building an inverted file • Some size and time assumptions (Managing Gigabytes chapter 5) • The methods

Sizes

Times and main memory

Methods for Creating an inverted file • Memory based inversion • Sort based methods • Use external sort • Uncompressed • Compressing the temporary files • Multiway merge and compressed • In-place multiway merging

Additional Methods for Creating an inverted file • Lexicon-based partitioning (FAST-INV) • Text based partitioning

Compression in IR • The dictionary • The inverted file

Fixed length index compression (Grossman) • Entries in inverted list are sorted by document number (4 bytes each) • To save space the offset between consecutive documents is stored • Compression • The two leftmost bits store the number of bytes. Then the offset is stored in the next 6, 14, 22, 30 bits

Fixed length index compression

Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 Number bytes reduced from 4*5 to 6

Elias  encoding • An integer is represented with 2lg x+1 bits • The first lg x bits are the unary representation of lg x as lg x ones • The next bit is a stop bit of 0. • At this point the highest power of 2 that does not exceed x is represented

Elias  encoding • The next lg x bits represent the remainder of x - 2lg x in binary • Let x = 14. lg x =3 . • x - 2lg x= 14 - 8 = 6 • So 14 is represented by 111 0 110 • Let x = 1,000,000. lg x =19. • 19(1) then 0 then 1,000,000 - 219 in 19 bits

Example The inverted list is: 1, 3, 7, 70, 250 After computing gaps: 1, 2, 4, 63, 180 Number bits reduced from 8*4*5 to 35

The unary prefix of the  encoding is encoded in  encoding More precisely 1 + lg x is encoded using  encoding Let x = 9. After  encoding it is 1110 001. Using  encoding on the unary prefix 4, we get 11000 001  encoding

The number of bits is: 1+ 2lg (1 + lg x)+ (for  encoding of (1 + lg x)) + lg x (for remainder of original  encoding ) = 1+ 2lg lg 2x+ lg x Better than  for large values of x  encoding

First decode lg x+1 from 11000, which is 22+0=4. So lg x=4-1=3. Now compute 23+001=9 Let x = 1,000,000. lgx=19.93. So the number would start with 19 ones followed by a 0, followed by 19 bit remainder (39 bits)  encoding 20, we get 11110 0100 requiring 9 bits. (9+19=28 bits)  decoding

Golomb code • Given b. x>0 coded in 2 parts. First q+1 in unary where q =  (x - 1) / b; Then r = x - bq - 1 is coded in binary requiring either lg b or lg b bits. • Let b = 3. Remainders are 0, 1 (10), 2 (11).

Local Benoulli Model • Frequent words are coded with small values of b • Words that appear in 10% of the collection have b=7 • Rare words appear with very large b

Compression of temp • Compress <t, d, ft,d> triples • Since runs are sorted by t, can compute t-gaps, and use  encoding • The <d, ft,d> pairs can be compressed to 1 byte (on average) even with simple compression methods such as  and 

Compression • Internal sorting is done concurrently with parsing the text • Dictionary is stored in memory • Initial runs become smaller and there are more passes • Instead of 7 disk intensive passes there are 9 processor intensive passes

Compressing the temporary file Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr +td ) + (sort runs) +  log R(2I’(tr + td)+ f* tc) (merge in log R passes) +(I’+ I)*(td + tr) ( recompress) ~ 26 hours I’~1.35*I, temp file 680 megabytes

Multiway merging • Merge is now processor and not disk intensive • R way merge (400 way merge of buffers of 100 Kbytes) • 540 Mbytes compressed file requires 5400 transfers and 5400 seeks.

Time - multiway merging Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr + td) + (sort runs, compress and write) + f log R tc +I’(ts/b + tr + td) (merge in one pass) +(I’+ I)*(td + tr) ( recompress) ~ 11 hours I’~1.35*I, temp file 540 megabytes

In-place multiway merging • All blocks are padded to be exactly b bytes • Each output block is written back into a vacant block of the temp file • To keep track of the output a block table is generated • The block table is used to create a sequentially sort file • Requires less additional memory

Time - in-place multiway merging Time = B * tr + F * tp + (read and index) + R(1.2klgk)tc + I’*( tr + td) + (sort runs, compress and write) + f log R tc +2I’(ts/b + tr + td) (merge and write into empty blocks) + 2I’(ts/b + tr) (permute) +(I’+ I)*(td + tr) ( recompress) ~ 13 hours I’~1.35*I

Large memory inversion • Machine has large memory (this method would need about 1.5 Gbytes memory instead of 4Gbytes, with better compression about 420Mbytes) • Saves space in 2 ways: 1. No need for pointers, 2. uses compression

Large memory inversion • The size of the inverted file is computed based on the size required for each inverted list • A pass over the collection will be needed to compute this data • An array of this size is allocated.

Large memory inversion • The lexicon has a pointer to the start of each inverted list and during inversion to the current empty location on the list • The size of each inverted list is: dft,* log N for its d component + dft,* log maxftt for its ftt component. • Better compressed sizes can be used

0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 0 1 2 3 4 Term 1 Term 2 Term 12 Term 4 df start current 2 0 1 1 2 2 2 3 4 4 5 8 1 9 9 2 10 10 1 12 13 1 13 13 1 14 14 1 2 3 4 5 6 7 8 9 D1: 1, 4 D2: 12, 4 D3: 2, 4 ...

Time Time = B * tr + F * tp + (read and parse) B * tr + F * tp +2I’ td + I*(td + tr) ( invert) ~ 12 hours

Text based partitioning Inverted files are generated for chunks of text and merged Each chunk uses the previous method and is done completely in memory This method uses very little extra disk space (34 Mbytes)

Text based partitioning The merging for each new chunk can be done in place by copying the list for every term to its correct location on the disk About 16 hours

Lexical-based partitioning -FAST-INV • Developed by Fox • Inverts file without external sort • Main idea: dictionary based partitioning

FAST-INV • Divide input into j load files: • Each can be loaded into main memory • Each has about the same number of concepts

FAST-INV • j is as small as possible • concept numbers in load file i are greater than concept numbers in load file j for all j < i

Doc 1: New York slows its rate of taxgrowth But residents pay more than the other 49 states. State and local taxes went up less than the inflation rate in New York between 1994 and 1996, although they are still the highest in the nation, a new report shows…

Doc 2: Block that refund Incometaxrefunds will break last year's record of $114 billion. They are nice to get, but many Americans pay too much. Instead of loaning money to Uncle Sam for a year you could invest it.

Creating the temp file • Each document is converted into a list of Docid/ConceptIds pairs • All DocId/ConceptId pairs are stored in a temp file in secondary memory

The temp file Doc/ConceptId 1, 1 (growth) 1, 2 (New York) 1, 3 (pay) 1, 4 (resident) 1, 5 (state) 1, 6 (tax) 2, 7 (income) 2, 8 (loan) 2, 3 (pay) 2, 9 (refund) 2, 6 (tax) 1: 1, 2 3, 4, 5, 6 2: 7, 8, 3, 9, 6 DocId

Phase 1 • Initialize concept counts to 0. • Read temp file and increment counts • Compute the no. docs (df) per concept • Start building Conptr and Load-table

Complete Conptr • Add load numbers using • the counts and • the amount of free memory space • Compute the offsets

The Conptr file

Build Load-Table • When a load is determined a row is added to Load-Table.

Creating Inverted Index: Compression Techniques in Information Retrieval

Creating Inverted Index: Compression Techniques in Information Retrieval

Presentation Transcript

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval

CS533 Information Retrieval