|
||||||||||||||||
Problem Set 7: Huff Assigned: Wednesday November 16, 2016 Due: Monday November 28, 2016, midnight Points: 12 In this and the next problem set you will team up with one other person to design and develop a pair of programs Huff.java and Puff.java that perform Huffman encoding and decoding (resp.) of text files. Huffman coding is a lossless compression algorithm that achieves a very respectable rate of compression. Your team's pair of programs should have the property that for every text file F.txt, feeding F.txt as a command line argument to Huff should produce a compressed file F.zip. Then feeding F.zip as a command line argument to Puff should produce a file that is indistinguishable from the original F.txt.
Testing Your Huff ProgramYou can download the following reference implementation of HuffTest.zip program. Unzip it, then you can run the Puff program providing the output file produced by your Huff program as input. Puff should produce a file that is identical to the file you started with.GeneralitiesThis is a somewhat complicated program with two major parts (i.e., Huff and Puff). You should think carefully about the overall design of the program with an eye toward figuring out what ADTs might be useful and which what pieces of the implementation should be shared between the Huff and Puff programs. Obviously, the shared parts should be encapsulated in separate files with appropriate methods and documentation.You are going to make mistakes. In order to figure out what is wrong it will be important to be able to print out string representations of the various items in play. For this reason, it is really pretty essential that each ADT that you create have a reasonable toString method. Working with Binary FilesThe Huff and Puff programs write and read binary files. So it will be helpful to have a tool that lets you view the contents of binary files. If you are using a unix-based system such as OS X then you can always use the built-in emacs editor from the command line, as in:> emacs file.zipThe emacs editor is very powerful but super-arcane. You can put emacs into "hexadecimal mode" by typing esc x and then hexl-mode. Note that esc refers to the key labeled "esc", upper left on your keyboard. You can then move around the file using the arrow keys. You can exit emacs by typing control-x control-c (i.e., holding down the key labeled control and then hitting x and then c while still holding down the control key.) And if something goes wrong you can cancel things in emacs by typing control-g. If you are using Windows (or if emacs doesn't excite you) you can always troll around on the web for free hexeditors. I found HxD for Windows and HexEditor for the Mac. I tried the latter and it worked well enough. I didn't try the former but the reviews seemed OK. IngredientsInput/OutputYou can download FileIO.zip and use the four routines there to support the IO required of your program.Working with Bits and Variable Length Patterns of BitsHuffman coding represents text using variable length bit strings. Short bit strings can be represented in high-level programming languages like Java by using values of type int. Ints in Java are 32 bits long, but for the purposes of this explanation, lets say they are 4 bits. For example, given int x = 0;, we can "turn on" the rightmost bit of x by using the bitwise or operator '|' (NB: one bar, not two!) as in x = x | 1; After this line was executed, our 4-bit version of x would have the binary value 0001. By executing x = x * 2;, we can shift the bits in x to the left. The result of the previous assignment leaves x with the value 0010. Thus, we can turn bits on and move them around using bit-wise or and multiplication (or division) by 2.Huffman codes are variable length codes. For example, the letter 'A' may be represented by the 3-bit string 101 while the letter 'B' may be represented by the 2-bit string 11. So in order to represent a variable length bit string in Java it will be convenient to pair the int described above with second integer specifying the length of the string of bits. Of course, when we have two values that are related in this way, it usually makes sense to think about encapsulating them in an ADT with appropriate operations. As you know, in Java, an ADT can be specified by an interface and implemented by a class. Symbol Tables --- Representing Information about Input SymbolsBoth the Huff and Puff programs will require a table data structure that allows them to look up information about symbols (i.e., characters) that occur in the input text. Tables that associate symbols with information are usually called symbol tables. This is the subject of section 3.1 of our textbook. For the purposes of this project, the symbols are characters. As you know, characters are represented by small integers, usually 8 or 16 bits. For example, the ASCII-assigned integer representation of the letter 'A' is 65. This is a base 10 numeral, it is more common to use its hexadecimal equivalent 0x41. (Note that 0x41 = 4 x 16^1 + 1 x 16^0 = 4 x 16 + 1 x 1 = 64 + 1 = 65.) In order to print the character associated with one of these numbers, the number would need to be associated with the char type. For most purposes in this application though, you will find it more convenient to work with the characters in the source file as ints or even wrapped as Integers.There are many ways to implement symbol tables in Java. For the purposes of this problem set, you'll want to look at the java.util.Map interface. You can use any implementing class that you would like but I recommend using the java.util.HashMap implementation of the Map interface. What information will you need to store in the symbol table? Two different pieces of information about each input symbol will be required in the Huff program. First, an integer frequency will need to be computed that represents the number of occurrences of the given symbol in the input text. The second piece of information required for each symbol is the binary bit pattern assigned to the symbol by the algorithm. This latter piece will ultimately be written to the output file. The frequency information is easily computed from the input file simply by reading the characters in the file and counting their occurrences. The frequency information will be needed in order to construct the Huffman coding tree as described below. The binary bit pattern can be represented as an object encapsulating the pair of ints as described above. Since we now have another pair of related values (symbol frequency and bit string), it again makes sense to think of encapsulating these two items in an ADT of some kind. Huffman TreesAnother important ingredient for the program is the Huffman coding tree. A Huffman tree is a simple binary tree data structure in which each node has an integer weight. Huffman trees are Comparable: one tree is compared against another by comparing their weights.What other fields are required? Leaf nodes require an additional integer symbol field while interior nodes (i.e., non-leaf nodes) require two Huffman trees left and right. In order to simplify the trees it's probably best to just have one tree node type with all 4 fields. When processing leaf nodes the symbol and weight fields would be used and the left and right fields would be ignored. For interior nodes weight, left and right fields would be used and the symbol field would be ignored. I would like to stress the importance of having a reasonable toString method for Huffman trees. It will be essential for debugging. As we have discussed, the Huffman coding tree can be constructed from the information in the symbol table using a priority queue. See the description of the algorithm below. Priority QueuesSee the Java JRE documentation of the PriorityQueueProtocolsIn order to ensure that we are all working on the same problem, your program is required to conform to the following protocols:
Huff : the Huffman Coding Algorithm
|