110 likes | 239 Views
Homework #5. New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein. Homework #4 Review. Huffman coding is a variable-length binary encoding for text We implemented Huffman's optimal code finding algorithm (book 389-395)
E N D
Homework #5 New York University Computer Science Department Data Structures Fall 2008 Eugene Weinstein
Homework #4 Review • Huffman coding is a variable-length binary encoding for text • We implemented Huffman's optimal code finding algorithm (book 389-395) • Builds tree representing shortest possible code • Input for HW#4: letters, frequencies: • A 20 E 24 ... • Construct Huffman tree • Navigate tree to find code: • c: 0, a: 10, b: 11
Homework #5 Overview • Given a document • Calculate letter frequencies • Construct Huffman code • Encode document • Calculate memory savings of Huffman binary encoding vs 8-bit ASCII • Correctly decode document • We can use Huffman code building algorithm from HW#4 • So we will keep HuffmanTree and HuffmanNode
Organization • The new code for this assignment should go into HuffmanConverter.java • The filename of file to encode is passed as a parameter on the command line • So if my file is foo.txt, I should be able to run • java HuffmanConverter foo.txt • Then foo.txt show up in args[0] • If you use an IDE, specify command-line options through the menus • Test inputs and outputs linked from assignment page (2007 version)
HuffmanConverter Instance Vars • String contents - stores file to process • Lines are separated by '\n' - line break character • e.g., twoLines = line1 + '\n' + line2; • HuffmanTree huffmanTree - output of HW4 • int count[] - frequencies in input file • Indexed on ASCII value of characters, e.g., count[(int)'a'] is frequency of 'a' • String code[] - binary string per character • Also indexed on ASCII value, e.g., code[(int)'a'] == "10001"
To Implement • readContents() - reads in a file and stores in String contents • recordFrequencies() - process file stored in contents and store frequencies in count[] • frequenciesToTree() - use HW4 code to produce Huffman tree • treeToCode() - slight modification of HW4: traverse Huffman tree and populate code[] • encodeMessage() - use code[] to encode • decodeMessage() - use inverse of code[]
Implementation Notes • readContents() can use Scanner • Read a line at a time, and append to contents inserting '\n' to separate lines • recordFrequencies(): iterate over contents one character at a time • frequenciesToTree() • Very similar to main() method of HW4 • Create a BinaryHeap object • For every non-zero-count letter, create a HuffmanNode object, insert into heap • Then run Huffman algorithm
Implementation Notes, Cont'd • treeToCode() • Similar to printCode() of HW4 • Instead of printing code, store in code[] • encodeMessage() • For each character of contents, look up its binary string in code[], append
Implementation Notes, Cont'd • decodeMessage() • Need to implement inverse mapping of code[]: binary strings to characters • Several possible implementations • Traverse Huffman tree as you read binary string, output character when you reach a leaf • Build HashMap mapping strings to ASCII values of characters
HashMap • An array maps integers to Objects • e.g., String args[]: args[i] returns ith String • A HashMap maps Objects to Objects • Access with put() and get(), e.g., • HashMap ids = new HashMap(); • ids.put("Alice", 123456789); • ids.put("Ben", 321654987); • int id = (Integer) ids.get("Alice"); • // id gets 123456789 • For decode, map bit Strings to characters
Homework #5 Tips • Keep checking intermediate results • Make use of sample outputs here • Print out intermediate results! • You might need special cases for newline ('\n') • Your encoding might differ from the examples • Depends on the BinaryHeap implementation • Same-frequency items are returned in arbitrary order (e.g., in love_poem_58, 'N', '-', '.', 'W', and 'p' all have frequency one) • However, Huffman encoding length must match! • Guaranteed to be shortest-length encoding