220 likes | 283 Views
CS 430: Information Discovery. Lecture 4 Files Structures for Inverted Files. Course Administration. • Assignment 1 has been posted on the web site. Right Threaded Binary Tree. Threaded tree:
E N D
CS 430: Information Discovery Lecture 4 Files Structures for Inverted Files
Course Administration • Assignment 1 has been posted on the web site.
Right Threaded Binary Tree Threaded tree: A binary search tree in which each node uses an otherwise-empty left child link to refer to the node's in-order predecessor and an empty right child link to refer to its in-order successor. Right-threaded tree: A variant of a threaded tree in which only the right thread, i.e. link to the successor, of each node is maintained. Knuth vol 1, 2.3.1, page 325.
Right Threaded Binary Tree From: Robert F. Rossa
Definitions Keyword: A term that is used to describe the subject matter in a document. It is sometimes called an index term. In full text indexing, every word in the text is treated as a keyword (with the exception of stopwords). Keywords can be extracted automatically from a document or assigned by a human cataloguer or indexer. Controlled vocabulary: A list of words that can be used as keywords. For example, in a retrieval system used for research papers in medicine, the controlled vocabulary might be a list of medical terms.
Restrictions in Building Inverted Files • Underlying character set, e.g., printable ASCII, Unicode, UTF8. • Whether to use a controlled vocabulary. If so, what words to include. • List of stopwords. • Rules to decide the beginning and end of words, e.g., spaces or punctuation. • Character sequences not to be indexed, e.g., sequences of numbers.
Representation of Inverted Files Index file: Stores list of terms (keywords). Designed for rapid searching and processing range queries. May be held in memory. Postings file: Stores list of postings for each term. Designed for rapid evaluation of Boolean operators. May be stored sequentially. Document file: [Repositories for the storage of document collections are covered in CS 502.]
Sizes of Inverted Files Set Records Unique Terms A 2,653 5,123 B 38,304 c.25,000 Set A has an average of 14 postings per term and a maximum of over 2,000 postings per term. Set B has an average of 88 postings per record. Examples from Harman and Candela, 1990
B-trees B-tree of order m: A balanced, multiway search tree: • Each node stores many keys • Root has between 2 and 2m keys. All other internal nodes have between m and 2m keys. • If ki is the ith key in a given internal node -> all keys in the (i-1)th child are smaller than ki -> all keys in the ith child are bigger than ki • All leaves are at the same depth
B+-tree B+-tree: • A B-tree is used as an index • Data is stored in the leaves of the tree, known as buckets 50 65 10 25 55 59 70 81 90 ... D9 D51 ... D54 D66... D81 ... Example: B+-tree of order 2, bucket size 4
B-tree Discussion For a discussion of B-trees, see Frake, Section 2.3.1, pages 18-20. • B-trees combine fast retrieval with moderately efficient updating. • Bottom-up updating is usual fast, but may require recursive tree climbing to the root. • The main weakness is poor storage utilization; typically buckets are only 0.69 full. • Various algorithmic improvements increase storage utilization at the expense of updating performance.
Signature Files Inexact filter: A quick test which discards many of the non-qualifying items. Advantages • Much faster than full text scanning -- 1 or 2 orders of magnitude • Modest space overhead -- 10% to 15% of file • Insertion is straightforward Disadvantages • Sequential searching no good for very large files • Some hits are false hits
Signature Files Signature size. Number of bits in a signature, F. Word signature. A bit pattern of size F with m bits set to 1 and the others 0. The word signature is calculated by a hash function. Block. A sequence of text that contains D distinct words. Block signature. The logical OR of all the word signatures in a block of text.
Signature Files Example Word Signature free 001 000 110 010 text 000 010 101 001 block signature 001 010 111 011 F = 12 bits in a signature m = 4 bits per word D = 2 words per block
Signature Files A query term is processed by matching its signature against the block signature. (a) If the term is in the block, its word signature will always match the block signature. (b) A word signature may match the block signature, but the word is not in the block. This is a false hit. The design challenge is to minimize the false drop probability, Fd . Frake, Section 4.2, page 47 discussed how to minimize Fd. The rest of this chapter discusses enhancements to the basic algorithm.
Tries Basic concept The text is divided into unique semi-infinite strings, or sistrings. Each sistring has a starting position in the text, and continues to the right until it is unique. The sistrings are stored in (the leaves of) a tree, the suffix tree. Common parts are stored only once. Each sistring can be associated with a location within a document where the sistring occurs. Subtrees below a certain node represent all occurrences of the substring represented by that node. Suffix trees (and similar suffix arrays) have a size of the same order of magnitude as the input documents.
Tries: Suffix Tree Example: suffix tree for the following words: begin beginning between bread break b e rea gin tween d k _ ning
Tries: Sistrings A binary example String: 01 100 100 010 111 Sistrings: 1 01 100 100 010 111 2 11 001 000 101 11 3 10 010 001 011 1 4 00 100 010 111 5 01 000 101 11 6 10 001 011 1 7 00 010 111 8 00 101 11
Tries: Lexical Ordering 7 00 010 111 4 00 100 010 111 8 00 101 11 5 01 000 101 11 1 01 100 100 010 111 6 10 001 011 1 3 10 010 001 011 1 2 11 001 000 101 11 Unique remaining subtrie indicated in red
Trie: Basic Concept 1 0 1 0 1 0 2 0 1 0 1 0 7 5 1 1 0 0 6 3 0 1 4 8
Patricia Tree 4 3 3 2 2 5 1 1 0 1 0 1 00 2 0 1 1 0 0 10 7 5 1 6 3 0 1 4 8 Single-descendant nodes are eliminated. Nodes have bit number.