Sets of Digital Data

Sets of Digital Data CSCI 2720 Fall 2005 Kraemer

Digital Data • In earlier work with BSTs and various balanced trees, we compared keys for order or equality • Here, we take advantage of structure of key • Use it as an index, or • Decompose string key into characters, or • Treat key as numerical quantity on which we can perform operations

Assumptions • We will construct and manipulate sets that • Are drawn from a universe U of size N • U = {u0, …uN-1} • A relatively simple procedure exists by which we can compute, for an element u U, the index i such that u = ui. • Easy if U is set of integers • Also easy if U is set of characters with character codes in a contiguous interval

Bit Vector • Used to represent a subset S U • A table of N bits, Bits[0.. N-1] • Bits[i] == 1 if ui  S • Bits[i] == 0 if ui  S • Example: today’s attendance 0 1 2 3 4 5 6 -- student number 1 1 0 1 0 1 1 1 = present 0 = absent

Bit Vectors • Assume: • determining element index takes constant time • accessing position in table takes constant time • May actually take several ops, and depend somewhat on N(size of universe), but not on size of set represented • Then: • Insert, Delete, Member are constant time ops

Bit Vectors • A subset of a set of size N always takes N bits to represent, independent of size of subset • Makes sense if: • N is not too large • need to represent sets of size comparable to N

Storage Efficiency • Bit Vector vs. Binary Trees • Binary Tree, set of size n • Requires n(2p + K) bits • K >= lg N, size of field to represent key value • p = number of bits in a pointer • Bit Vector, takes N bits • If n  N, then bit vector more efficient • If p = K = 32, then tree becomes more space efficient when n/N  1% • Actually, when n(2p + K) = N, which is when n/N = 1/96

When to use Bit Vectors? • When universe is relatively small • When sets are large in relation to size of universe

Advantages of Bit Vectors • O(1) implementation of Insert, Delete, Member • Union and Intersection easy • Implement via Boolean and and or operations • May actually take less than one op/element, as operations are performed on full machine word • If machine word == 32, then one machine operation handles 32 potential elements of set

Disadvantages of Bit Vectors • On some computers access to individual bits can require shifting and masking operations (expensive) • Result is that Member may be much more expensive than Union • Initialization takes (N) -- zero all the bits in the vector • But can use constant time initialization algorithm • But that makes storage requirement go to 2p + 1 bits per element • So, in practice, just use machine ops to set to zero, which are efficient

Tries and Digital Search Trees • If the key can be decomposed into characters, then the characters of the key can be used as indices • Tries are based on this idea • “trie” is the middle symbol of retrieval, a pun on tree, but pronounced “try”

Tries • Assume k possible character values • A trie is a (k+1)-ary tree • each node a table of k+1 pointers • One pointer for each possible character • One for the end of string character, 

Trie Example

Tries • Path for key of m characters is length m, with pointer at  • Don’t need to store key itself .. It is the path followed. • Info field might be pointed to by  element

Tries: Analysis • Let: • n be the number of keys stored in a trie • l be the length(in characters) of the longest key • s be the number of nodes in the trie • k be the size of the alphabet • Pro: • Access time is O(l), independent of k, n and s • Con: • Size -- requires (k+1) * s * p bits • Most pointers are null, so lots of wasted space

Strategies for reducing storage requirements of tries • Implement a k-ary trie with m nodes as a 2-D, m by k table A B C D E … M …. P …. T ….  0 1 2 3 4 5

Table approach • Number the nodes in the diagram of slide 13 from 1 to m • The table entry corresponding to jth child of ith node is the index of the child node • How does that save space? Just as many nodes and elements as on slide 13 • … need only ceil(lg(m)) bits to represent, smaller than a pointer …

Patricia Tree:Another strategy for reducing space in a trie • Patricia tree • Practical Algorithm to Retrieve Information Codedin Alphanumeric • Eliminate nodes with only one nonempty child • Can now skip right from T to  in TURING in our example • Skip from MA …. To E or  in the MENDEL , MENDELEEV chain • But need to store with each node the index of the character on which it discriminates • And need to store the key itself at the leaf

Patricia tree

de la Briandais trees • Another strategy to save space vs. standard tries • Use a linked list instead of a table at the node level • Each pointer labeled with the character it indexes • longer search time than tries; depends on size of character set • saves significant amounts of memory

de la Briandais

Another strategy … • Use tries at the first few levels • Use ordinary BSTs or de la Briandais at the lower levels • reasoning: • speed advantage at the top, but not too much extra memory required • save space at lower levels

Digital Search Trees • Treat keys as bit strings • (strings over the alphabet {0,1}) • Binary tree – search directed left on 0, right on 1 • Each node contains not only two pointers, but also contains a key that matches that string prefix • Compare for equality before searching left or right • If frequencies are known, store higher frequency keys nearer root • Can be grown dynamically • Expected Search time: O(log n)

Digital Search Tree

Sets of Digital Data

Sets of Digital Data

Presentation Transcript

Handling of High-Dimensional Data Sets

Private Analysis of Data Sets

MERGING DATA SETS OF SEPARATE ORIGIN

I2E Data Sets

Data Sets

Health Data Sets

Excel Data Sets

History Objectives Input Data Sets Output Data Sets Results

Example Data Sets

Inductive Sets of Data

Data Abstraction: Sets

Cluster data sets

INPUT DATA SETS

Inductive Sets of Data

Overview of Existing Data Sets

Chapter 1: Inductive Sets of Data

Data Sets Project

Representing Data Sets

Two Data Sets

Inductive Sets of Data

Cluster data sets