590 likes | 609 Views
Learn about the greedy strategy of Greedy Algorithms using the centerstring problem as an example, exploring the concept of Hamming distance and its optimization. Understand how Huffman encoding works for data compression. Implement the variable length codes for efficient encoding in binary data representation.
E N D
Greedy Algorithms Amihood Amir Bar-Ilan University
Idea Simplest type of strategy: 1. Take a step that makes the problem smaller. 2. iterate. Difficulty: Prove that this leads to an optimal solution. This is not always the case!
Example: Centerstring Problem Input: kstrings s1,…,sk of length ℓ over alphabet Σ, distance d. Find: string s* such that max(Ham(s*,si)), i=1,…k is ≤ d. 3
Our Problem: lengthℓ kstrings Maximum distance is smallest 4
Example: s1: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 s2: 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 s3: 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 -------------------------------------------------- s*:0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 The Hamming distance of the consensus from any string:4 5
Suggestion: greedy strategycolumn majority? 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 1 1 0 1 0 1 0 0 1 0 1 0 1 0 0 1 0 1 0 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 1 1 0 0 0 1 0 1 0 1 1 0 0 1 0 0 1 0 0 1 0 1 0 0 1 1 0 --------------------------------------- 0 1 1 1 0 0 0 0 1 0 1 1 0 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 Problem: Works if we want to minimize average Not if we want to minimize maximum! 6
Why? 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ----------------------------------------- 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 (majority) Hamming distance from last string: 16 7
But: 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ----------------------------------------- 1 1 0 0 1 1 0 0 1 1 0 0 1 1 0 0 Hamming distance from any string: 8 8
Example (that works) –Huffman code Computer Data Encoding: How do we represent data in binary? Historical Solution: Fixed length codes. Encode every symbol by a unique binary string of a fixed length. Examples: ASCII (7 bit code), EBCDIC (8 bit code), …
ASCII Example: AABCAA AABC AA 1000001 1000001 1000010 1000011 1000001 1000001
Total space usage in bits: Assume an ℓ bit fixed length code. For a file of n characters Need nℓ bits.
Variable Length codes Idea: In order to save space, use less bits for frequent characters and more bits for rare characters. Example: suppose alphabet of 3 symbols: { A, B, C }. suppose in file: 1,000,000 characters. Need 2 bits for a fixed length code for a total of 2,000,000 bits.
Variable Length codes - example Suppose the frequency distribution of the characters is: Encode: Note that the code of A is of length 1, and the codes for B and C are of length2
Total space usage in bits: Fixed code: 1,000,000 x 2 = 2,000,000 Varable code: 999,000 x 1 + 500 x 2 500 x 2 1,001,000 A savings of almost 50%
How do we decode? In the fixed length, we know where every character starts, since they all have the same number of bits. Example:A = 00 B = 01 C = 10 000000010110101001100100001010 A A A B B C C C B C B A A C C
How do we decode? In the variable length code, we use an idea called Prefix code, where no code is a prefix of another. Example:A = 0 B = 10 C = 11 None of the above codes is a prefix of another.
How do we decode? Example:A = 0 B = 10 C = 11 So, for the string: A A A B B C C C B C B A A C C the encoding: 0 0 01010111111101110 0 01111
Prefix Code Example:A = 0 B = 10 C = 11 Decode the string 0 0 01010111111101110 0 01111 A A A B B C C C B C B A A C C
Desiderata: • Construct a variable length code for a given file with the following properties: • Prefix code. • Using shortest possible codes. • Efficient. • As close to entropy as possible.
Idea Consider a binary tree, with: 0 meaning a left turn 1 meaning a right turn. 0 1 A 0 1 B 0 1 C D
Idea Consider the paths from the root to each of the leaves A, B, C, D: A : 0 B : 10 C : 110 D : 111 0 1 A 0 1 B 0 1 C D
Observe: • Thisis a prefix code, since each of theleaves has a path ending in it, without continuation. • If the tree is full then we are not “wasting” bits. • If we make sure that • the more frequent • symbols are closer to • the root then they will • have a smaller code. 0 1 A 0 1 B 0 1 C D
Greedy Algorithm: 1. Consider all pairs: <frequency, symbol>. 2. Choose the two lowest frequencies, and make them brothers, with the root having the combined frequency. 3. Iterate.
Greedy Algorithm Example: Alphabet: A, B, C, D, E, F Frequency table: Total File Length: 210
Algorithm Run: B20 C30 D40 E50 F60 A10
Algorithm Run: X 30 C30 D40 E50 F60 B20 A10
Algorithm Run: Y 60 D40 E50 F60 X 30 C30 B20 A10
Algorithm Run: D40 E50 Y 60 F60 X 30 C30 B20 A10
Algorithm Run: Z 90 Y 60 F60 E50 D40 X 30 C30 B20 A10
Algorithm Run: Y 60 F60 Z 90 X 30 C30 E50 D40 B20 A10
Algorithm Run: W 120 Z 90 Y 60 F60 E50 D40 X 30 C30 B20 A10
Algorithm Run: Z 90 W 120 E50 D40 Y 60 F60 X 30 C30 B20 A10
Algorithm Run: V 210 0 1 Z 90 W 120 1 0 1 0 E50 D40 Y 60 F60 1 0 X 30 C30 0 1 B20 A10
The Huffman encoding: A: 1000 B: 1001 C: 101 D: 00 E: 01 F: 11 V 210 0 1 Z 90 W 120 1 0 1 0 E50 D40 Y 60 F60 1 0 X 30 C30 0 1 B20 A10 File Size: 10x4 + 20x4 + 30x3 + 40x2 + 50x2 + 60x2 = 40 + 80 + 90 + 80 + 100 + 120 = 510 bits
Note the savings: The Huffman code: Required 510 bits for the file. Fixed length code: Need 3 bits for 6 characters. File has 210 characters. Total:630 bits for the file.
Note also: For uniform character distribution: The Huffman encoding will be equal to the fixed length encoding. Why? Assignment.
Formally, the algorithm: Initialize trees of a single node each. Keep the roots of all subtrees in a priority queue. Iterate until only one tree left: Merge the two smallest frequency subtrees into a single subtree with two children, and insert into priority queue.
Algorithm time: Each priority queue operation (e.g. heap): O(log n) In each iteration:one less subtree. Initially: n subtrees. Total: O(n log n) time.
Algorithm correctness: Need to prove two things for greedy algorithms: Greedy Choice Property: The choice of local optimum is indeed part of a global optimum. Optimal Substructure Property: When we recurse on the remaining and combine it with the local optimum of the greedy choice, we get a global optimum.
Centerstring Agorithm correctness: Greedy Choice Property: The choice of majority at a column turns out not be necessarily a global optimum. Optimal Substructure Property: A global optimum means that the overall max distance including the first greedy choice is smallest.
Example: 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 ----------------------------------------- 1 For the optimum the second index needs to be 0, but if we ignore the first index, a global optimum may be 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 42
Huffman Algorithm correctness: Need to prove two things: Greedy Choice Property: There exists a minimum cost prefix tree where the two smallest frequency characters are indeed siblings with the longest path from root. This means that the greedy choice does not hurt finding the optimum.
Algorithm correctness: Optimal Substructure Property: An optimal solution to the problem once we choose the two least frequent elements and combine them to produce a smaller problem, is indeed a solution to the problem when the two elements are added.
Algorithm correctness: There exists a minimum cost tree where the minimum frequency elements are longest path siblings: Assume that is not the situation. Then there are two elements in the longest path. Say a,b are the elements with smallest frequency and x,y the elements in the longest path.
Algorithm correctness: CT We know about depth and frequency: da ≤ dy fa ≤ fy da dy a x y
Algorithm correctness: We also know about code tree CT: ∑fσdσ σ is smallest possible. CT da dy a x y Now exchange a and y.
Algorithm correctness: Cost(CT) = ∑fσdσ= σ ∑fσdσ+fada+fydy≥ σ≠a,y CT’ da dy (da ≤ dy, fa≤ fy Therefore fada ≥fydaand fydy ≥fady ) y ∑fσdσ+fyda+fady= σ≠a,y cost(CT’) x a
Algorithm correctness: CT Now do the same thing for b and x db dx b x a
Algorithm correctness: CT” And get an optimal code tree where a and b are sibling with the longest paths db dx x b a