530 likes | 550 Views
This comprehensive guide delves into the principles and methods of indexing and searching in modern information retrieval systems, exploring concepts such as inverted files, signatures, PAT trees, and sequential searching. It covers the preprocessing of documents, association of keywords with relevance weights, and the construction of inverted file structures. Learn about vocabulary and dictionary structures, techniques for retrieval and manipulation of occurrences, and the advantages and disadvantages of using signature files. Discover the complexities of sequential signature files, false drop probabilities, and the classification of signature-based methods by storage overhead, response time, and insertion performance. Master techniques like vertical partitioning and explore various methods for storing and optimizing signature matrices.
E N D
Indexing and Searching(File Structures) Modern Information Retrieval (Chapter 8) With G. Navarro
File Struces • Inverted Files • Signatures • PAT Trees • Sequential Searching • Compression
Inverted Files Information Retrieval: Data Structures and Algorithms (Chapters 3) W.B. Frakes and R. Baeza-Yates (Eds.) 1992.
Inverted Files • Characteristics • A word-oriented mechanism based on sorted list of keywords, with each keyword having links to the documents containing that keyword. • Preprocessing • Each document is assigned a list of keywords or attributes. • Each keyword (attribute) is associated with relevance weights.
Inversion of Word List 1. The input text is parsed into a list of words along with their location in the text. (time and storage consuming operation) 2. This list is inverted from a list of terms in location order to a list of terms in alphabetical order. 3. Add term weights, or reorganize or compress the files.
Structure and Construction • Structure (split the index into two files) • Vocabulary: O(nb) according to Heaps’ Law • Occurrences : depends on the addressing granularity • Construction • The vocabulary is stored in lexicographical order and points to posting list. • Posting file:the lists of occurrences are stored contiguously
Dictionary and Postings File (document #, frequency)
Structures used in Inverted Files • Vocabulary • Sorted Arrays • Hashing Structures • Keyword Trees: Tries (digital search trees) • The Search Procedure • Vocabulary search • Retrieval of occurrences • Manipulation of occurrences
Size of an Inverted File Block addressing The text is divided in blocks, and the occurrences point to the blocks instead of full inverted indices where exact occurrences are recorded
Cost • Advantage • easy to implement • Disadvantage • updating the index is expensive
Signature Files Information Retrieval: Data Structures and Algorithms (Chapters 4) W.B. Frakes and R. Baeza-Yates (Eds.) Englewood Cliffs, NJ: Prentice Hall, 1992.
Signature Files • Characteristics • Word-oriented index structures based on hashing • Low overhead (10%~20% over the text size) at the cost of forcing a sequential search over the index • Suitable for not very large texts • Inverted files outperform signature files for most applications
Construction and Search • Word-oriented index structures base on hashing • Maps words to bit masks of B bits • Divides the text in blocks of b words each • The mask is obtained by bitwise ORing the signatures of all the words in the text block. • Search • Hash the query to a bit mask W • If W & Bi = W, the text block may contain the word
Block 4: 001100 OR 100001 101101 Example • Four blocks: • This is a text. A text has many words. Words are made from letters. 000101 110101 100100 101101 • Hash(text) = 000101 • Hash(many)= 110000 • Hash(words)= 100100 • Hash(made)= 001100 • Hash(letters)= 100001
False Drop • Assumes that m bits are randomly set in the mask • Let a=m/B • For b words, the probability that a given bit of the mask is set is 1-(1-1/B)bm1-e-ba • Hence, the probability that the l random bits are also set is Fd =(1-e-ba)aB False alarm • Fd is minimized for a=ln(2)/b • Fd = 2-mm = B ln2/b
Sequential Signature File (SSF) Assume documents span exactly one logical block the size of document signature F = the size of block signature B
Classification of Signature-Based Methods • Horizontal partitioningGrouping similar signatures together and/or providing an index on the signature matrix may result in better-than-linear search. • Vertical partitioningStoring the signature matrix column-wise improves the response time on the expense of insertion time.
Classification of Signature-Based Methods • Vertical partitioning • without compression bit-sliced signature files (BSSF, B’SSF) frame sliced (FSSF) generalized frame-sliced (GFSSF) • with compression compressed bit slices (CBS) doubly compressed bit slices (DCBS) no-false-drop method (NFD)
Classification of Signature-Based Methods • Sequential storage of the signature matrix • without compression sequential signature files (SSF) • with compression bit-block compression (BC) variable bit-block compression (VBC) • Horizontal partitioning • data independent partitioning Gustafson’s method partitioned signature files • data dependent partitioning 2-level signature files 5-trees
Criteria • The storage overhead • The response time on single word queries • The performance on insertion, as well as whether the insertion maintains the “append-only” property
Vertical Partitioning • Ideaavoid bringing useless portions of the document signature in main memory • Methods • store the signature file in a bit-sliced form or in a frame-sliced form • store the signature matrix column-wise to improve the response time on the expense of insertion time
Bit-Sliced Signature Files (BSSF) Transposed bit matrix documents (document signature) transpose documents represent
documents F bit-files search: (1) retrieve m bit-files. e.g., the word signature of free is 001 000 110 010 the document contains “free”: 3rd, 7th, 8th, 11th bit are set i.e., only 3rd, 7th, 8th, 11th files are examined. (2) “and” these vectors. The 1s in the result N-bit vector denote the qualifying logical blocks (documents). (3) retrieve text file through pointer file. insertion: require F disk accesses for a new logical block (document), one for each bit-file, but no rewriting
Frame-Sliced Signature File (FSSF) • Ideas • Random disk accesses are more expensive than sequential ones • Force each word to hash into bit positions that are closer to each other in the document signature • these bit files are stored together and can be retrieved with a few random accesses • Procedures • The document signature (F bits long) is divided into k frames of s consecutive bits each. • For each word in the document, one of the k frames will be chosen by a hash function. • Using another hash function, the word sets m bits in that frame.
Frame-Sliced Signature File (Cont.) documents frames Each frame will be kept in consecutive disk blocks.
FSSF (Continued) • Example (n=2, B=12, s=6, f=2, m=3)Word Signature free 000000 110010 text 010110 000000 doc. signature 010110 110010 • Search • Only one frame has to be retrieved for a single word query. I.E., only one random disk access is required.e.g., search documents that contain the word “free”->because the word signature of “free” is placed in 2nd frame,only the 2nd frame has to be examined. • At most k frames have to be scanned for an k word query. • Insertion • Only f frames have to be accessed instead of F bit-slices.
Horizontal Partitioning 1. Goal: group the signatures into sets, partitioning the signature matrix horizontally. 2. Grouping criterion documents
Partitioned Signature Files • Using a portion of a document signature as a signature key to partition the signature file. • All signatures with the same key will be grouped into a so-called “module”. • When a query signature arrives, • examine its signature key and look for the corresponding modules • scan all the signatures within those modules that have been selected
Suffix Trees and Suffix Arrays • Each position in the text is considered as a text suffix • Index points are selected form the text, which point to the beginning of the text positions which will be retrievable
Suffix arrays • The main drawbacks of Suffix Array are its costlyconstruction process. • Allow binary searches done by comparing the contents of each pointer. • Supra-indices (for large suffix array)
Algorithms • Brute Force • Knuth-Morris-Pratt • Boyer-Moore Family • Shift-Or • Suffix Automaton
Algorithms • Searching allowing errors • Dynamic Programming • Automaton • Regular Expressions and Extended patterns • Pattern Matching Using Indices • Inverted files • Suffix Trees and Suffix Arrays
Pattern Matching Using Indices • Inverted Files • The types of queries such as suffix or substring queries, searching allowing errors and regular expressions, are solved by a sequential search • The restriction is to find approximate matches or regular expressions that span many word.