460 likes | 638 Views
CS503: Eighth Lecture, Fall 2008 Binary Trees. Michael Barnathan. Project Ideas. General idea: something that won’t take more than a couple of weeks but that you can use in a portfolio. These are just my ideas; feel free to come up with your own. Web crawler. Sockets, Trees, Recursion.
E N D
CS503: Eighth Lecture, Fall 2008Binary Trees Michael Barnathan
Project Ideas • General idea: something that won’t take more than a couple of weeks but that you can use in a portfolio. These are just my ideas; feel free to come up with your own. • Web crawler. • Sockets, Trees, Recursion. • Caveats: dead links, status codes, redirects. • Fast file indexer based on frequent words. • File I/O, Trees, Analysis of Algorithms. • Compression (storing the index). • Works well because of Zipf/power law distributions. • Caveats: binary files, errors opening files. Make sure you open read-only. • Trend analysis tool using regression. • Sorting, Recursion, Analysis. • Predictors: I can help you with these. • AI opponent for a simple game (such as checkers or Reversi) • Trees, Storage, Recursion, Heuristic Search, Analysis of Algorithms. • Distributed command server (to order around a bunch of machines). • Sockets, File I/O, Priority Queues (especially for synchronization), Analysis of Algorithms.
Grading • Not having exams is going to require adjusting the percentages a bit. • Assignments 40%, Project 30%, Labs 20%, Participation 10%?
Here’s what we’ll be learning: • Data structures: • Binary Trees. • Binary Search Trees. • Theory: • Tree traversals. • Preorder. • Inorder. • Postorder. • Balanced and complete trees. • Recursion on Binary Trees.
Linear Structures • Arrays, Linked Lists, Stacks, and Queues are linear data structures. • Even circularly linked lists. • One element follows another: there is always just one “next” element. • As we mentioned, these usually yield recurrences of the form T(n) = T(n-1) + f(n). • What if we violate this assumption? What if a structure had two next elements? • In a way, we started with the most restrictive structure (arrays), which we are progressively relaxing.
Frankenstein’s Data Structures It hardly even makes sense to talk about a node having more than one successor in an array. What would data[2] be here? It’s not clear. 4 2 1 5 Linked lists make more sense, but what is 1->next now? We need more information. 6 3 7
Binary Trees • There are now two next nodes, not one. • Clearly, we need two pointers to model them. • So let’s rotate that list… • Let’s call the pointers “left” and “right”. • This structure is known as a binary tree. • Binary: branches into two nodes. • Higher-order trees exist too; we’ll talk about these later. • Why we call it a tree should be obvious. 1 Left Right 3 2 6 7 4 5
Nomenclature • Some from botany, some from genealogy. • Root: The “highest” node in the tree; that is, the one without a parent. • Almost all tree algorithms start at the root. • Child/Subtree: A node one level below the current node. Traversing the left or right pointers will bring you to a node’s “left” or “right” child. • Leaf: A node at the “bottom” of the tree; i.e. one without children (or really with two null children). • Parent: The node one level above. • Siblings: Nodes with the same parent. • “Complete” tree: a tree where every node has either 0 or 2 children and all leaves are at the same level. (Basically, it’s fully “filled in”).
Recursive Definition • Binary trees have a nice recursive definition: • A binary tree is a value, a left binary tree, and a right binary tree. • Thus, each individual node is itself a tree. • Base case: the empty tree. • Leaves’ left and right children are both empty. • We usually represent this with nulls.
Node Access • All you must store is a reference to the root. • You can get to the rest of the nodes by traversing the tree. • Example: Accessing 5. • There’s a problem. • Anyone see it? Root 1 3 2 6 7 4 5
Traversal • We have no way of knowing where 5 is. • In order to find it, we need to check every node in the tree. • So what’s the complexity of access? • “Check every” should ring alarm bells by now. • This is a nonlinear data structure, so we have more than one way to traverse, however. • There are three common tree traversals: • Preorder, inorder, and postorder. • There are some more exotic ones, too: traversals based on pointer inversion, threaded traversal, Robson traversal… • Since binary trees are recursive structures, tree algorithms are usually recursive. Traversals are no exception. • Remember how we reversed the output of printTo(n) by moving the output above or below the recursive call? • It turns out you can change the order of the traversal in the same way.
Traversals and “Visiting” • You can do anything to a node inside of a tree traversal algorithm! • You certainly can search for a value. • But you can also output its value, modify the its value, insert a node there, etc. • This generic action is simply called “visiting” the node when discussing traversals.
Preorder Traversal • If I gave you a binary tree and asked you to search for an element, how would you go about it? • You would check the value. • You would search the left subtree. • You would search the right subtree. • This is how preorder traversal works. • We check/output/do something with the node value. • We recurse on the left subtree. • We recurse on the right subtree. • We stop once we run out of nodes. • This is also called depth-first traversal, because it first focuses on traversing down specific nodes before broadly visiting others. • Like taking a lot of CS courses first before satisfying a core curriculum. • Preorder traversal is a special case of depth-first search on data structures called graphs, which we will discuss soon.
Preorder Traversal BinaryTree preorder(BinaryTree node, inttargetvalue) { if (node == null) //Base case. return null; else if (node.value == targetvalue) //Found it. return node; BinaryTree lchild = preorder(node.left); //Traverse left. if (lchild != null) return lchild; //Found on the left. return preorder(node.right); //Traverse right. }
Preorder Traversal: Illustration Root 1 3 2 6 7 4 5 Order: [1 2 4 5 3 6 7] The root is always the first node to be visited.
Inorder Traversal • We have two recursive calls in the preorder traversal: left and right. • In preorder, we checked the node before calling either of them. • In an inorder traversal, we check in-between the two calls. • We dive down all the way on the left before outputting, then we visit the right. • To use recursive stack language, we output after popping on the left but before pushing on the right. • There is no inherent advantage to choosing one traversal over another on a regular binary tree unless you deliberately want a certain ordering. • However, inorder traversal is important on a variation of the binary tree. More on that in just a moment.
Inorder Traversal BinaryTree inorder(BinaryTree node, inttargetvalue) { if (node == null) //Base case. return null; //All we did was swap the order of these two lines. BinaryTree lchild = inorder(node.left); //Traverse left. if (node.value == targetvalue) //Found it. return node; if (lchild != null) return lchild; //Found on the left. return inorder(node.right); //Traverse right. }
Inorder Traversal: Illustration Root 1 3 2 6 7 4 5 Order: [4 2 5 1 6 3 7] The root is always the middle node.
Postorder Traversal • The obvious next step: output after both recursive calls. • This causes the algorithm to dive down to the bottom of the tree and output/visit the node when going back up. • Similar to what we did in printTo(), actually. • We are outputting on the pop.
Postorder Traversal BinaryTree postorder(BinaryTree node, inttargetvalue) { if (node == null) //Base case. return null; BinaryTree lchild = postorder(node.left); //Traverse left. BinaryTree rchild = postorder(node.right); //Traverse right. if (node.value == targetvalue) //Found it. return node; if (lchild != null) return lchild; //Found on the left. //Found on the right or not at all. return rchild; }
Postorder Traversal: Illustration Root 1 3 2 6 7 4 5 Order: [4 5 2 6 7 3 1] The root is always the last node to be visited.
CRUD: Binary Trees. • Insertion: ? • Access: ? • Updating an element: ? • Deleting an element: ? • Search/Traversal: O(n). • All three traversals are linear: They visit every node in sequence. • They each just follow different sequences. • You can search by traversing, so search is also O(n). • How long would it take to access a node, though? • If I knew I wanted the left child’s left child, how many pointers would I need to follow to get to it?
Tree Height. • To analyze worst-case access, we need to talk about tree height. • The height of a tree is the number of vertical levels it contains, not including the root level. • Or you can think of it as the number of times you’d have to traverse down the tree to get from the root to the lowest leaf node. • Nodes in the tree are said to have a depth, based on how many vertical levels they are down from the root. • The root itself has a depth of 0. • The root’s children have a depth of 1. • Their children have a depth of 2… • Etc. • The height is thus also the depth of the lowest node.
Height Root 1 Depth = 0 3 2 Depth = 1 6 7 4 5 Depth = 2 Height = 2 Remember, don’t count the root level.
Height Balance • A tree is considered balanced (or height-balanced) if the depth of the highest and lowest leaves differs by no more than 1. • This turns out to be an important property because it forms a lower bound on the access time of the tree and lets us find the height. • Question: If we have n nodes in a balanced binary tree, what is the height of the tree? • floor(log2 n) • Note that we had 7 nodes in the previous tree, but a height of 2. The tree was full; adding an 8th node would take the height to 3. • The time to access a node depends on the height, thus we know it is O(log n) on a balanced tree.
Degeneracy • Trees with only left or right pointers degenerate into linked lists. • Which gives you another perspective on why Quicksort became quadratic with everything on one side of the pivot. • Access on linked lists is O(n). • Performance gets worse even as we approach this condition, so we want to keep trees balanced. 1 1 2 2 3 3
CRUD: Balanced Binary Trees. • Insertion: ? • Access: O(log n). • Updating an element: ? • Deleting an element: ? • Search/Traversal: O(n). • Once we know where to insert, insertion is simple. • Just add a new leaf there: O(1). • However, discovering where to insert is a bit trickier. • Anywhere that a null child used to be will work. • We don’t want to upset the balance of the tree. • A good strategy is to traverse down the tree based on the value of each node. This creates a partitioning at each level.
Binary Tree Insertion void insert(BinaryTree root, BinaryTree newtree) { //This can only happen now if the user passes in an empty tree. if (root == null) root = newtree; //Empty. Insert the root. else if (newtree.value < root.value) { //Go left if <. if (root.left == null) //Found a place to insert. root.left = newtree; else insert(root.left, newtree); //Keep traversing. } else { //Go right if >=. if (root.right == null) root.right = newtree; //Found a place to insert. else insert(root.right, newtree); //Keep traversing. } }
Insertion Analysis • This is similar to a traversal, but guided by the value of the node. • We choose left or right based on whether the node is < or >=. • We split into one subproblem of size n/2 each time we traverse. • What recurrence would we have for this? • What would be the solution?
CRUD: Balanced Binary Trees. • Insertion: O(log n). • Access: O(log n). • Updating an element: O(1). • Deleting an element: ? • Search/Traversal: O(n). • If we’re already at the element we need to update, we can just change the value, thus O(1). • Note that we can say the same for insertion, but finding a place to put the node is usually considered part of it. • Deletion is quite complex, on the other hand. • If there are no children, just remove the node – O(1). • If there is one child, just replace the node with its child – O(1). • If there are two… well, that’s the tricky case.
Deletion • If we need to delete a node with two children, we need to find a suitable node to replace it with. • One good choice is the inorder successor of the node, which will be the leftmost leaf of the right child we’re deleting from. • Inorder successor meaning the next node in an inorder traversal. • So our course is clear: inorder traverse, stop at the next node, swap.
Deletion void deleteWithTwoSubtrees(BinaryTreetargetnode) { if (targetnode == null) //Deleting a null is a no-op. return; //Find the inorder successor and its parent. BinaryTree inorder_succ; BinaryTree inorder_parent = targetnode; for (inorder_succ = targetnode.right; inorder_succ.left != null; inorder_succ = inorder_succ.left) inorder_parent = inorder_succ; //Keep track of the parent. //Set the value of the parent to that of the inorder successor… targetnode.value = inorder_succ.value; //Delete the inorder successor (here’s why we needed the parent): inorder_parent.left = null; }
CRUD: Balanced Binary Trees. • Insertion: O(log n). • Access: O(log n). • Updating an element: O(1). • Deleting an element: O(log n). • Search/Traversal: O(n). • Finding the inorder successor requires time proportional to the height of the tree. If the tree is balanced, this is O(log n).
CRUD: Unbalanced Binary Trees. • Insertion: O(n). • Access: O(n). • Updating an element: O(1). • Deleting an element: O(1). • Search/Traversal: O(n). • The worst sort of unbalanced tree is just a linked list. • The deletion algorithm would always hit the second case (only one child), so we’d never experience O(log n) behavior… • But the insertion algorithm is not as efficient as that of a linked list. • Unless we check for this condition explicitly, in which case we get O(1).
Binary Search Trees • Binary Search Trees (BSTs) capture the notion of “splitting into two”. • Or, to use the Quicksort term, partitioning. • The value of a node is the pivot. • The left tree contains elements < the pivot. • The right tree contains elements >= the pivot. • They are simply binary trees that are kept sorted in the manner stated above.
BSTs: What do they entail? • Like priority queues and sorted arrays, binary search trees are inherently sorted containers. • This means inserting a sequence of elements and then reading them back will get them out in sorted order. • Ah, but this time we have three ways to read them back out. All three can’t give us the same order. • The elements of an inorder traversal are sorted in binary search trees. • It also means that we’ll have to do some extra work to ensure that this guarantee is true. • But will this work influence the asymptotic performance?
A Binary Search Tree Root 4 6 2 5 7 1 3 Inorder traversal: [1 2 3 4 5 6 7] < on the left, >= on the right.
CRUD: Binary Search Trees. • Insertion: O(log n). • Access: O(log n). • Updating an element: ? • Deleting an element: O(log n). • Search: O(log n). • Traversal: O(n). • Search and traversal are no longer the same operation! • Traversal is analogous to linear search: look at every element, one at a time, and try to find the target. • Search on a BST is analogous to binary search: the data is sorted around the value of the node we’re at, so it guides us to eliminate half of the remaining elements at each step. • Just like other unsorted containers, we have to traverse to search a standard binary tree. And like other sorted containers, a BST lets us do a binary search. • Remember, BSTs are sorted in an inorder traversal. • Therefore, the deletion algorithm we previously specified will preserve the ordering.
Access on a BST • Use the same strategy we used in binary search: • Compare the node. • If the target value is less than the node’s value, go left (eliminates the right subtree). • If the target value is greater than the node’s value, go right (eliminates the left subtree). • If it’s equal, we’ve found the target. • If we hit a NULL, the target isn’t in the tree. • This exhibits the same performance: O(log n). • If the tree is balanced. In the degenerate case, we are binary searching a linked list, which is O(n).
Insertion on a BST • The algorithm I gave you for insertion was actually the BST insertion algorithm as well. • That was one of the reasons why I chose that strategy, although it does result in a fairly balanced tree if the data distribution is uniform. • In order to keep elements partitioned around the pivot, we need to traverse left when the new element has a value < the pivot and right when it’s >=. • It was O(log n) before, and it still is.
Deletion on a BST • I also gave you the BST deletion algorithm. • As the inorder traversal is in sorted order, the inorder successor is the next element after the one we’re deleting in sorted order. • If we replace the element we’re deleting with the next element in the sequence, the sequence is still sorted. • e.g., [ 1 3 5 8 13 ] after deleting 3 -> [ 1 5 8 13]. • It was O(log n) before, and it still is.
Updating a BST • Ah, here’s something different. • Updating unsorted containers is usually a constant-time operation, while updating sorted containers usually takes longer. • When we change the value of a node in a BST, we may be required to change the node’s position in the tree to preserve the ordering. • This is why updating sorted containers is usually a slow operation. • No one seems to want to deal with updating these, so most sources (including your textbook) just define it as “delete and reinsert”. • Which fully works and is very simple to do. • Don’t be afraid to do “quick and dirty” things if they don’t harm your performance. • So does this harm performance? • Insertion is O(log n). • Deletion is O(log n). • Unless we can update in O(1) on a BST (we can’t), then no.
CRUD: Binary Search Trees. • Insertion: O(log n). • Access: O(log n). • Updating an element: O(log n). • Deleting an element: O(log n). • Search: O(log n). • Traversal: O(n). • This is the ultimate compromise data structure. • Arrays, Lists, Stacks, and Queues all did some things in constant time and other things in linear time. • This does everything (except traversal, which is inherently a linear operation) in logarithmic time. • But remember, logarithmic time isn’t much worse than constant. • So these are pretty good data structures. • As usual, there’s a catch…
The Important of Balance • Every operation on a tree begins to degenerate when balance is lost. • And in the worst case, you end up with a less efficient linked list. • Keeping the tree balanced is thus important. • There is one who is prophesized to bring balance to the Force, but I don’t think that includes your trees. • So the burden falls on you, my young padawan. • Since BSTs are the structural analogue of Quicksort, you may have an idea of what insertion sequence will produce the worst case. • Yep, sorted or inverse-sorted, just as in Quicksort. • Most data is not arranged like this already, and on average, BSTs stay fairly well balanced. • But this is enough of a problem where various self-balancing structures have been invented. We will discuss these next week.
A General Note • Although I put numbers in most of my examples, any sort of data can go in these. • Strings, Objects, Employees. • Caveat: When using Java’s sorted containers, make sure your class implements Comparable. • Java doesn’t give you a BinaryTree class outright, but it does give you TreeSet and TreeMap. • TreeMap in particular is very neat; check it out. • We’ll do some things with these in Thursday’s lab.
“In all my life I’ll never see / a thing so beautiful as a tree.” • The study of trees goes very deep. • We’ve just scratched the surface. • We’ll come back to self-balancing trees, heaps, and perhaps splay trees. • The lesson: • Ideas are universal. They can come from your study. They can come from outside of your study. They can come from nature. They can come from anywhere. • Next class: Linear-time sorting, B+ trees, lab.