15-211 Fundamental Data Structures and Algorithms

More LZW / Midterm Review 15-211Fundamental Data Structures and Algorithms Margaret Reid-Miller 1 March 2005

Midterm Thursday, 12:00 noon, 3 March 2005 WeH 7500 Worth a total of 125 points Closed book, but you may have one page of notes. If you have a question, raise your hand and stay in your seat

Last Time…

Last Time:Lempel & Ziv

Reminder: Compressing We scan a sequence of symbols A = a1 a2 a3 …. ak where each prefix is in the dictionary. We stop when we fall out of the dictionary: A b

Reminder: Compressing Then send the code for A = a1 a2 a3 …. ak This is the classical algorithm.

LZW: Compress bad case …s…sssb… Input: ^ s Dictionary:   • - word (possibly empty) Output: ….

LZW: Compress bad case (time t) …s…sssb… Input: ^ s Dictionary:   Output: …. …

LZW: Uncompress bad case (time t) .… Input: ^ s Dictionary:   Output: ……

LZW: Compress bad case (step t+1) …s…sssb… Input: ^ Dictionary: s   s  Output: …….

LZW: Uncompress bad case (time t+1) .… Input: ^ s Dictionary:   Output: ……s

LZW: Compress bad case (time t+2) …s…sssb… Input: ^ Dictionary: s   s  b  +1 Output: …….

LZW: Uncompress bad case (time t+2) ….  Input: ^ What is ?? s Dictionary:   Output: ……s

LZW: Uncompress bad case (time t+2) .…  Input: ^ What is ?? It codes for ss! s Dictionary:   s  Output: ……sss

Example 0 0 1 5 3 6 7 9 5  aabbbaabbaaabaababb s  s  s  s Input Output add to D 0 a 0 + a 3:aa 1 + b 4:ab 5 - bb 5:bb 3 + aa 6:bba 6 + bba 7:aab 7 + aab 8:bbaa 9 - aaba 9:aaba 5 + bb 10:aabab s = a  = ab

LZW Correctness So we know that when this case occurs, decompression works. Is this the only bad case? How do we know that decompression always works? (Note that compression is not an issue here). Formally have two maps comp : texts  int seq. decomp : int seq.  texts We need for all texts T: decomp(comp(T)) = T

Getting Personal Think about Ann: compresses T, sends int sequence Bob: decompresses int sequence, tries to reconstruct T Question: Can Bob always succeed? Assuming of course the int sequence is valid (the map decomp() is not total).

How? How do we prove that Bob can always succeed? Think of Ann and Bob working in parallel. Time 0: both initialize their dictionaries. Time t: Ann determines next code number c, sends it to Bob. Bob must be able to convert c back into the corresponding word.

Induction We can use induction on t. The problem is: What property should we establish by induction? It has to be a claim about Bob’s dictionary. How do the two dictionaries compare over time?

The Claim At time t = 0 both Ann and Bob have the same dictionary. But at any time t > 0 we have Claim: Bob’s dictionary misses exactly the last entry in Ann’s dictionary after processing the last code Ann sends. (Ann can add Wx to the dictionary, but Bob won’t know x until the next message he receives.)

The Easy Case Suppose at time t Ann enters A b with code number C and sends c = code(A). Easy case: c < C-1 By Inductive Hypothesis Bob has codes upto and including C-2 in his dictionary. That is,c is already in Bob’s dictionary. So Bob can decode and now knows A. But then Bob can update his dictionary: all he needs is the first letter of A.

The Easy Case Suppose at time t Ann enters A b with code number C and sends c = code(A). Easy case: c < C-1 Sent: c … A b … C-1 Entered: C

The Hard Case Now suppose c = C-1. Recall, at time t Ann had entered A b with code number C and sent c = code(A). Sent: c … A b … C-1 Entered: C

The Hard Case Now suppose c = C-1. Recall, at time t Ann had entered A b with code number C and sent c = code(A). Sent: c … A’ s’ … b … c Entered: C A = A’ s’ a1 = s’

The Hard Case Now suppose c = C-1. Recall, at time t Ann had entered A b with code number C and sent c = code(A). Sent: c … s’ W s’ … b… c Entered: C A’ = s’ W

The Hard Case Now suppose c = C-1. Recall, at time t Ann had entered A b with code number C and sent c = code(A). Sent: c … s’ W s’ W s’ b … c Entered: C

The Hard Case Now suppose c = C-1. Recall, at time t Ann had entered A b with code number C and sent c = code(A). So we have Time t-1: entered c = code(A), sent code(A’), where A = A’ s’ Time t: entered C = code(A b), sent c = code(A), where a1 = s’ But then A’ = s’ W.

The Hard Case In other words, the text must looked like so …. s’ W s’ W s’ b …. But Bob already knows A’ and thus can reconstruct A. QED

Midterm Review

Basic Data Structures • List • Persistance • Tree • Height of tree, Depth of node, Level • Perfect, Complete, Full • Min & Max number of nodes

Recurrence Relations • E.g., T(n) = T(n-1) + n/2 • Solve by repeated substitution • Solve resulting series • Prove by guessing and substitution • Master Theorem • T(N) = aT(N/b) + f(N)

Solving recurrence equations Repeated substitution: t(n) = n + t(n-1) = n + (n-1) + t(n-2) = n + (n-1) + (n-2) + t(n-3) and so on… = n + (n-1) + (n-2) + (n-3) + … + 1

Incrementing series • This is an arithmetic series that comes up over and over again, because characterizes many nested loops: for (i=1; i<n; i++) { for (j=1; j<i; j++) { f(); } }

n0 “Big-Oh” notation cf(N) T(N) = O(f(N)) “T(N) is order f(N)” T(N) running time N

Upper And Lower Bounds f(n) = O( g(n) ) Big-Oh f(n) ≤ c g(n) for some constant c and n > n0 f(n) = ( g(n) ) Big-Omega f(n) ≥ c g(n) for some constant c and n > n0 f(n) = ( g(n) ) Theta f(n) = O( g(n) ) and f(n) = ( g(n) )

Upper And Lower Bounds f(n) = O( g(n) ) Big-Oh Can only be used for upper bounds. f(n) = ( g(n) ) Big-Omega Can only be used for lower bounds f(n) = ( g(n) ) Theta Pins down the running time exactly (up to a multiplicative constant).

Big-O characteristic • Low-order terms “don’t matter”: • Suppose T(N) = 20n3 + 10nlog n + 5 • Then T(N) = O(n3) • Question: • What constants c and n0 can be used to show that the above is true? • Answer: c=35, n0=1

Big-O characteristic • The bigger task always dominates eventually. • If T1(N) = O(f(N)) and T2(N) = O(g(N)). • Then T1(N) + T2(N) = max( O(f(N)),O(g(N) ). • Also: • T1(N)  T2(N) = O( f(N)  g(N) ).

Dictionary • Operations: • Insert • Delete • Find • Implementations: • Binary Search Tree • AVL Tree • Splay • Trie • Hash

Binary search trees • Simple binary search trees can have bad behavior for some insertion sequences. • Average case O(log N), worst case O(N). • AVL trees maintain a balance invariant to prevent this bad behavior. • Accomplished via rotations during insert. • Splay trees achieve amortized running time of O(log N). • Accomplished via rotations during find.

AVL trees • Definition • Min number of nodes of height H • FH+3 -1, where Fn is nth Fibonacci number • Insert - single & double rotations. How many? • Delete - lazy. How bad?

Depth increased by 1 Depth reduced by 1 X Y Z Single rotation • For the case of insertion into left subtree of left child: Z X Y Deepest node of X has depth 2 greater than deepest node of Z.

Double rotation • For the case of insertion into the right subtree of the left child. Z X X Y1 Y2 Z Y1 Y2

Splay trees • Splay trees provide a guarantee that any sequence of M operations (starting from an empty tree) will require O(Mlog N) time. • Hence, each operation has amortized cost of O(log N). • It is possible that a single operation requires O(N) time. • But there are no bad sequences of operations on a splay tree.

a b b a Z X X Y1 Y2 Z Y1 Y2 Splaying, case 3 • Case 3: Zig-zag (left). • Perform an AVL double rotation.

Splaying, case 4 • Case 4: Zig-zig (left). • Special rotation. a b Z W b Y a X X W Y Z

4 5 9 … I 4 6 6 5 8 8 3 3 you like love 5 … 9 lovely Tries • Good for unequal length keys or sequences • Find O(m), m sequence length • But: Few to many children

Hash Tables • Hash function h: h(key) = index • Desirable properties: • Approximate random distribution • Easy to calculate • E.g., Division: h(k) = k mod m • Perfect hashing • When know all input keys in advance

Collisions • Separate chaining • Linked list: ordered vs unordered • Open addressing • Linear probing - clustering very bad with high load factor • *Quadratic probing - secondary clustering, table size must be prime • Double hashing - table size must be prime, too complex

Hash Tables • Delete? • Rehash when load factor high - double (amortize cost constant) • Find & insert are near constant time! • But: no min, max, next,… operation • Trade space for time--load factors <75%

15-211 Fundamental Data Structures and Algorithms