240 likes | 364 Views
Sets of Digital Data. CSCI 2720 Fall 2005 Kraemer. Digital Data . In earlier work with BSTs and various balanced trees, we compared keys for order or equality Here, we take advantage of structure of key Use it as an index, or Decompose string key into characters, or
E N D
Sets of Digital Data CSCI 2720 Fall 2005 Kraemer
Digital Data • In earlier work with BSTs and various balanced trees, we compared keys for order or equality • Here, we take advantage of structure of key • Use it as an index, or • Decompose string key into characters, or • Treat key as numerical quantity on which we can perform operations
Assumptions • We will construct and manipulate sets that • Are drawn from a universe U of size N • U = {u0, …uN-1} • A relatively simple procedure exists by which we can compute, for an element u U, the index i such that u = ui. • Easy if U is set of integers • Also easy if U is set of characters with character codes in a contiguous interval
Bit Vector • Used to represent a subset S U • A table of N bits, Bits[0.. N-1] • Bits[i] == 1 if ui S • Bits[i] == 0 if ui S • Example: today’s attendance 0 1 2 3 4 5 6 -- student number 1 1 0 1 0 1 1 1 = present 0 = absent
Bit Vectors • Assume: • determining element index takes constant time • accessing position in table takes constant time • May actually take several ops, and depend somewhat on N(size of universe), but not on size of set represented • Then: • Insert, Delete, Member are constant time ops
Bit Vectors • A subset of a set of size N always takes N bits to represent, independent of size of subset • Makes sense if: • N is not too large • need to represent sets of size comparable to N
Storage Efficiency • Bit Vector vs. Binary Trees • Binary Tree, set of size n • Requires n(2p + K) bits • K >= lg N, size of field to represent key value • p = number of bits in a pointer • Bit Vector, takes N bits • If n N, then bit vector more efficient • If p = K = 32, then tree becomes more space efficient when n/N 1% • Actually, when n(2p + K) = N, which is when n/N = 1/96
When to use Bit Vectors? • When universe is relatively small • When sets are large in relation to size of universe
Advantages of Bit Vectors • O(1) implementation of Insert, Delete, Member • Union and Intersection easy • Implement via Boolean and and or operations • May actually take less than one op/element, as operations are performed on full machine word • If machine word == 32, then one machine operation handles 32 potential elements of set
Disadvantages of Bit Vectors • On some computers access to individual bits can require shifting and masking operations (expensive) • Result is that Member may be much more expensive than Union • Initialization takes (N) -- zero all the bits in the vector • But can use constant time initialization algorithm • But that makes storage requirement go to 2p + 1 bits per element • So, in practice, just use machine ops to set to zero, which are efficient
Tries and Digital Search Trees • If the key can be decomposed into characters, then the characters of the key can be used as indices • Tries are based on this idea • “trie” is the middle symbol of retrieval, a pun on tree, but pronounced “try”
Tries • Assume k possible character values • A trie is a (k+1)-ary tree • each node a table of k+1 pointers • One pointer for each possible character • One for the end of string character,
Tries • Path for key of m characters is length m, with pointer at • Don’t need to store key itself .. It is the path followed. • Info field might be pointed to by element
Tries: Analysis • Let: • n be the number of keys stored in a trie • l be the length(in characters) of the longest key • s be the number of nodes in the trie • k be the size of the alphabet • Pro: • Access time is O(l), independent of k, n and s • Con: • Size -- requires (k+1) * s * p bits • Most pointers are null, so lots of wasted space
Strategies for reducing storage requirements of tries • Implement a k-ary trie with m nodes as a 2-D, m by k table A B C D E … M …. P …. T …. 0 1 2 3 4 5
Table approach • Number the nodes in the diagram of slide 13 from 1 to m • The table entry corresponding to jth child of ith node is the index of the child node • How does that save space? Just as many nodes and elements as on slide 13 • … need only ceil(lg(m)) bits to represent, smaller than a pointer …
Patricia Tree:Another strategy for reducing space in a trie • Patricia tree • Practical Algorithm to Retrieve Information Codedin Alphanumeric • Eliminate nodes with only one nonempty child • Can now skip right from T to in TURING in our example • Skip from MA …. To E or in the MENDEL , MENDELEEV chain • But need to store with each node the index of the character on which it discriminates • And need to store the key itself at the leaf
de la Briandais trees • Another strategy to save space vs. standard tries • Use a linked list instead of a table at the node level • Each pointer labeled with the character it indexes • longer search time than tries; depends on size of character set • saves significant amounts of memory
Another strategy … • Use tries at the first few levels • Use ordinary BSTs or de la Briandais at the lower levels • reasoning: • speed advantage at the top, but not too much extra memory required • save space at lower levels
Digital Search Trees • Treat keys as bit strings • (strings over the alphabet {0,1}) • Binary tree – search directed left on 0, right on 1 • Each node contains not only two pointers, but also contains a key that matches that string prefix • Compare for equality before searching left or right • If frequencies are known, store higher frequency keys nearer root • Can be grown dynamically • Expected Search time: O(log n)