280 likes | 407 Views
Using Fingerprints in n-Gram Indices. Stefan Selbach selbach@informatik.uni-wuerzburg.de. Digital Libraries: Advanced Methods and Technologies, Digital Collections. 17.09.2009. Using Fingerprints in n-Gram Indices. Overview Introduction Inverted Index N-Gram Index Bitmaps
E N D
Using Fingerprints in n-Gram Indices Stefan Selbach selbach@informatik.uni-wuerzburg.de Digital Libraries: Advanced Methods and Technologies, Digital Collections 17.09.2009
Using Fingerprints in n-Gram Indices Overview • Introduction • Inverted Index • N-Gram Index • Bitmaps • Signature Files • n-Gram Fingerprints • n-Gram Fingerprints in Combination with Posting Lists • Fingerprint Compression • Conclusion and Future Work
Inverted Index • Very common index structure • Term-oriented • Every term is linked to its postings
n-Gram Index • Uses n-Grams as indexing terms • Any kind of subsequence can be searched • n-Gram is a subsequence of a text with • Postings for longer subsequences can be calculated:
n-Gram Index • Index structure is very similar to an inverted index • Searching is more complex
Bitmaps • Bitmaps are occurrence maps • Each bit signals an occurrence of a specific term in a specific document
N-Gram Fingerprint The idea: Create fingerprintsthat: • Have a fixedsize • Containinformationaboutthepostings
N-Gram Fingerprint A 2D-Fingerprint is a bit-matrix
N-Gram Fingerprint • Given two 1-grams and their fingerprintsBw1 and Bw2 the fingerprint Bw1w2 can beaproximated: • B’w2 is constructed by cyclic shifting each column of Bw2 by one position to the left.
N-Gram Fingerprint Search Speed Results from the “Online Encyclopedia of Dermatology from P. Altmeyer”
Combining Fingerprints and Posting Lists By combining fingerprints and posting lists • No verification step is needed • Posting lists are partitioned into smaller subsets. Each bit of the fingerprint corresponds to a separate posting list • Costs for intersection of posting lists are being reduced
Managing n-Gram Posting Lists • Very large number of posting-subsets have to be managed:For example:1024 residue classes for the fileID 128 residue classes for the offset 14.000 different n-grams • Subsets are stored in a hash • The hash value is a function of the residue classes
hash collisions and collision resolving 40000 ... collisions ... comparisons 35000 ... comparisons after sorting 30000 25000 frequency 20000 15000 10000 5000 0 0 20 40 60 80 100 120 140 number of ... Managing n-Gram Posting Lists
Results • Performance improved by 40% compared to the setup without posting lists
Fingerprint Compression • Fingerprints with high or low densities do not contain much information • Fingerprints can be compressed by reducing the resolution • Dictionary based compression
Fingerprint Compression • Results: Fingerprint convolution • In combination with the dictionary based compression the index size is being reduced by additional 30%
Conclusion • Fingerprints improve the scalability of n-gram indices • Fingerprints improve the performance of n-gram indices • The index structure can be adjusted to user behavior, so that common queries can be processed more efficiently • The fingerprints can be stored in a compressed index with loosing only a minimum of performance
Future Work • Combination of term based inverted index and n-Gram fingerprint index • Profit from the advantages of both using terms and n-Grams as indexing terms • Substring search • Ranking • Thesaurus information
Thank You! Digital Libraries: Advanced Methods and Technologies, Digital Collections 17.09.2009