200 likes | 423 Views
Efficient in-memory indexing with Generalized Prefix trees. Presented by: Poorna Chandra Veladi (pv13b) Date: April 17,2014. Outline. Motivation Generalized Prefix tree Optimization Techniques Conclusion. Motivation. Sorted Arrays:
E N D
Efficient in-memory indexing with Generalized Prefix trees Presented by: Poorna Chandra Veladi(pv13b) Date: April 17,2014
Outline • Motivation • Generalized Prefix tree • Optimization Techniques • Conclusion
Motivation • Sorted Arrays: • Often sorted arrays are used in combination with binary search • Although sorted arrays are advantageous for several application scenarios, they fall short on update-intensive workloads due to the need for maintaining the sorted order of records. • Tree-Based Structures: • Balanced tree-structures are used for in-memory indexing. • All of those balanced tree structures exhibit logarithmic time complexity of O(log N) in terms of accessed nodes for all operations. • Additionally, they require a total number of O(log N) key comparisons.
Motivation • Hash-based structures : • These structures rely on a hash function to determine the slot of a key within the hash table • No reorganization is required because the size of the hash table is fixed. However, in the worst case, it degenerates to a linked list and thus, the worst-case search time complexity is O(N) • Tree-Based Structures: • Prefix trees are mainly used for string indexing, where each node holds an array of references according to the used character alphabet and therefore they were neglected for in-memory indexing of arbitrary data types . • the worst-case time complexity is O(k) for all operations because, in general, the number of accessed nodes is independent of the number of records N • Compared to tree-based structures, more nodes are accessed because k >= log N, where k = log N only if N’ = N.
Generalized Trie • The generalized trie is a prefix tree with variable prefix length of k’ bits. We define that k’ = 2i, where iεZ, and k’ must be a divisor of the maximum key length k. • Given an arbitrary data type of length k and a prefix length k’, the trie exhibits a fixed height h = k/k’ • Each node of the trie includes an array of s = 2k’pointers (node size).
Generalized Trie • Deterministic Property: • Each key has only one path within the trie that is independent of any other keys. • Due to these deterministic trie paths, only a single key comparison and no dynamic reorganization are required for any operations • Worst-case Time Complexity: • Based on the given key type of length k, the generalized trie has a time complexity of O(h) • Due to the dynamic configuration of k’, we access at most h = k/k’ different nodes and compare at most O(1) keys for any operations.
Physical Data Structure- Generalized Trie • The IXByte is designed as a hierarchical data structure in order to leverage the deterministic property of the generalized trie. • A single index (L0Item) is created with the data type length as a parameter. • In order to support also duplicates of keys, we use a further hierarchical structure of key partitions ki (L1Item) and payloads αi(L2Items)
Generalized Trie- Problems • Trie Height: • a VARCHAR(256) index with k’ = 4 would result in a trie height h = 512. • To ensure an order-preserving deterministic structure, keys with a variable key length k’ smaller than h are logically padded with zeros at the end. • The advantage of tries - in the sense of constant time for all operations - might be a disadvantage because the constant number of nodes to be accessed is really high. • Memory Consumption: • Due to the fixed trie height, an inserted key can cause the full expansion (creation of trie nodes) of all levels of the trie if no prefix can be reused. • In the worst case, an insert causes the creation of h-1 nodes.
Bypass Jumper Array • The core concept of the technique bypass jumper array is to bypass trie nodes for leading zeros of a key. • Preallocate all direct 0-pointers of the complete trie • Create a jumper-array of size h, where each cell points to one of those preallocatednodes. • Use the jumper array to bypass higher trie levels if possible.
Bypass Jumper Array • First, for data types with variable-length |ki|(e.g., VARCHAR), we change the index layout. • Keys ki with a length smaller than h are now logically padded with zeros at the beginning and thus, we loose the property of being order-preserving. However, if this is acceptable the benefit is significant. • The number of accessed trie nodes is reduced from h to h/2 in expectation. • This causes significant performance improvements due to fewer accessed nodes
Bypass Jumper Array • This technique also works for fixed-length data types such as INT, by counting leading zeros and without any index layout modification. • Unfortunately, for uniformly distributed keys, the impact is lower because the probability of having a key length |ki| < k depends on k’ with P([|ki|/k’] < k/k’) < 1/2k’ . • The probability of a specific length is given by P([|ki|/k’] = k/k’-x) = (1/2k’)x , where x denotes the number of nodes that can be bypassed.
Trie Expansion • The core concept is to defer the access to and creation of trie nodes during insert operations until it is required with respect to the deterministic property. • It is possible to reference a record instead of an inner node if and only if there is only one record that exhibits a particular prefix. • For data types of variable length (e.g.,VARCHAR), the logical padding with zeros ensures the deterministic property even in case of trie expansion.
Trie Expansion • This technique does not only bypass existing nodes but influences the number of created nodes . • It significantly reduces the trie height and therefore also the memory consumption for sparsely populated subtries. • A node is now defined as an array of s = 2k’void* pointers and an array of [s/8] bytes to signal the expansion i.e., now a single node has a size of size(node) = 2k’ .size(ptr) + [2k’/8 ] = 2k’ .size(ptr) + [2k’-3/8 ]. • The downside of this technique are costs for splitting and merging of nodes. • The node size is no longer a factor or divisor of the cache line size
Evaluation Results • The IXByte achieves high performance for arbitrary data types, where an increasing key length causes only moderate overhead. • It turns out that the IXByte is particularly appropriate for skewed data. • A variable prefix length (specific to our IXByte) of k’ = 4 turned out to be most efficient and robust due to • a node size equal to the cache line size (64 byte) • a good trade-off between trie height and node utilization • The IXByte shows improvements compared with a B+-tree and a T-tree on different data sets. Most importantly, it exhibits linear scaling on skewed data. Even on uniform data (worst case) it is comparable to a hash map.
Conclusion • Prefix tree with certain modifications make them applicable for in-memory indexing. • With deterministic property that eliminates the need for dynamic reorganization (prefix splits) and multiple key comparisons and optimization techniques that reduce the number of accessed and created trie nodes these data structures are as good as their counterparts. • There is wide scope for research • adaptive determination of the optimal prefix length k0 during runtime. • hybrid solution with different prefix lengths on different levels of a generalized trie. • query processing on prefix trees, where database operations such as joins or set operations can exploit the deterministic trie paths.