the BSTree<TE, KF> class

the BSTree<TE, KF> class • BSTreeNode has same structure as binary tree nodes • elements stored in a BSTree are a key-value pair • must be a class (or a struct) which has • a data member for the value • a data member for the key • a method with the signature: KF key( ) const; where KF is the type of the key

an example struct treeItem { int id; // key string data; // value int key( ) const { return id; } }; BSTree<treeItem, int> myBSTree;

basic BST search algorithm void search (bstree, searchKey) { if (bstree is empty) //base case: item not found // take needed action else if (key in bstree's root == search Key) // base case: item found // take needed action else if (searchKey < key in bstree's root ) search (leftSubtree, searchKey); else search (rightSubtree, searchKey); }

deletion cases • item to be deleted is in a leaf node • pointer to its node (in parent) must be changed to NULL • item to be deleted is in a node with one empty subtree • pointer to its node (in parent) must be changed to the non-empty subtree • item to be deleted is in a node with two non-empty subtrees

36 20 42 45 12 24 39 21 40 the easy cases

36 20 42 45 12 24 39 21 40 the “hard” case

36 20 42 45 12 24 39 21 40 the “hard” case replace with smallest in right subtree (inorder successor) replace with largest in left subtree (inorder predecessor)

traversing a binary search tree • can use any of the binary tree traversal orders – preorder, inorder, postorder • base case is reaching an empty tree • inorder traversal visits the elements in order of their key values • how would you visit the elements in descending order of key values?

big Oh of BST operations • measured by length of the search path • depends on the height of the BST • height determined by order of insertion • height of a BST containing n items is • minimum: floor (log2 n) • maximum: n - 1 • average: ?

faster searching • "balanced" search trees guarantee O(log2 n) search path by controlling height of the search tree • AVL tree • 2-3-4 tree • red-black tree (used by STL associative container classes) • hash table allows for O(1) search performance • search time does not increase as n increases

Hash Table • a hash table is an array of size Tsize • has index positions 0 .. Tsize-1 • two types of hash tables (Nyhoff – Ch.9.3) • open hash table • array element type is a <key, value> pair • all items stored in the array • chained hash table • element type is a pointer to a linked list of nodes containing <key, value> pairs • items are stored in the linked list nodes • keys are used to generate an array index • home address (0 .. Tsize-1)

Considerations • How big an array? • load factor of a hash table is n/Tsize • Hash function to use? • int hash(KeyType key) -> 0 .. Tsize-1 • Collision resolution strategy? • hash function is many-to-one

Hash Function • a hash function is used to map a key to an array index (home address) • search starts from here • insert, retrieve, update, delete all start by applying the hash function to the key

Some hash functions • if KeyType is int - key % TSize • if KeyType is a string - convert to an integer and then % Tsize • goals for a hash function • fast to compute • even distribution • cannot guarantee no collisions unless all key values are known in advance

An Open Hash Table Hash (key) produces an index in the range 0 to 6. That index is the “home address” 0 1 2 3 4 5 6 K3 K3info K1 K1info Some insertions: K1 --> 3 K2 --> 5 K3 --> 2 K2 K2info key value

Handling Collisions 0 1 2 3 4 5 6 K6 K6info Some more insertions: K4 --> 3 K5 --> 2 K6 --> 4 K3 K3info K1 K1info K4 K4info K2 K2info Linear probing collision resolution strategy K5 K5info

Search Performance Average number of probes needed to retrieve the value with key K? 0 1 2 3 4 5 6 K6 K6info K hash(K) #probes K1 3 1 K2 5 1 K3 2 1 K4 3 2 K5 2 5 K6 4 4 K3 K3info K1 K1info K4 K4info K2 K2info 14/6 = 2.33 (successful) K5 K5info unsuccessful search?

0 1 2 3 4 5 6 K3 K3info K5 K5info K1 K1info K4 K4info K6 K6info K2 K2info linked lists of synonyms A Chained Hash Table insert keys: K1 --> 3 K2 --> 5 K3 --> 2 K4 --> 3 K5 --> 2 K6 --> 4

0 1 2 3 4 5 6 K3 K3info K5 K5info K1 K1info K4 K4info K6 K6info K2 K2info Search Performance Average number of probes needed to retrieve the value with key K? K hash(K) #probes K1 3 1 K2 5 1 K3 2 1 K4 3 2 K5 2 2 K6 4 1 8/6 = 1.33 (successful) unsuccessful search?

successful search performance open addressing open addressing chaining (linear probing) (double hashing) load factor 0.5 1.50 1.39 1.25 0.7 2.17 1.72 1.35 0.9 5.50 2.56 1.45 1.0 ---- ---- 1.50 2.0 ---- ---- 2.00

Factors affecting Search Performance • quality of hash function • how uniform? • depends on actual data • collision resolution strategy used • load factor of the HashTable • N/Tsize • the lower the load factor the better the search performance

Traversal • Visit each item in the hash table • Open hash table • O(Tsize) to visit all n items • Tsize is larger than n • Chained hash table • O(Tsize + n) to visit all n items • Items are not visited in order of key value

Deletions? • search for item to be deleted • chained hash table • find node and delete it • open hash table • must mark vacated spot as “deleted” • is different than “never used”

Hash Table Summary • search speed depends on load factor and quality of hash function • should be less than .75 for open addressing • can be more than 1 for chaining • items not kept sorted by key • very good for fast access to unordered data with known upper bound • to pick a good TSize

the BSTree<TE, KF> class