The Digital World-Lab 4


Assigned: February 13
Due: Monday, February 23


These are some exercises on data compression.   You are to perform a number of simple experiments with the sample text, image, and audio files that I have provided, and then try as best you can to explain the results.  Huffman coding, which we explained in detail in class, can provide a good intuition for this and help you to explain most of the results in Exercise 1, but the actual lossless compression method that you have built in to your computer (zip) works on different principles.  Zip uses a dictionary-based approach and will compress well if it encounters long sequences of bytes that occur repeatedly in the file.  For example, if the bytes in the file are (represented here in decimal)

0 1 2 .....255 0 1 2 ......255 0 1 2 ....255

repeated over and over again, zip will compress these to a very small file, whereas Huffman coding will not compress at all, since it  gives equal weight to all possible byte values.  (On the other hand, Huffman coding in which each pair of successive bytes is treated as a single symbol will perform very well).

1.  The sample files for this exercise consist of (a) the text of The Telltale Heart; (b) a file consisting of ten thousand randomly-generated ASCII characters; (c) bitmapped images of three paintings, along with a bitmapped image that is entirely blue; (d) our wave file of a brief passage played on the flute; (e) a square wave (the solution to one of the problems in Lab 2).

Download each of these files, compress it, note both the original file size and the size of the compressed file, and compute the compression ratio.  Try to explain the different compression ratios obtained for the various text files, various image files, and various audio files.

Solution:  The exact results may vary depending upon exactly how the compression algorithm used by your computer functions.  I obtained the following results:

File
Original Size
Compressed Size
Compression Ratio
Telltale Heart
11,176 bytes
5,000 bytes
44.7%
Random Text
9,999 bytes
8,412 bytes
84.1%
Blue bitmap
645,102 bytes
1,157 bytes
0.18%
Flute
72,244 bytes
42,218 bytes
58.4%
Square Wave
28,204 bytes
338 bytes
1.2%

First let's try to account for the files that compress to very small sizes:  Zip looks for long sequences of bytes that occur repeatedly in files.  Since both the blue bitmap file and the square wave file consist of the same of bytes repeated many times over, there is a lot of compression.  Note that Huffman coding would also work well for this example, since the bitmap uses only three byte values (see 3 below) and the square wave only two.  But Zip works better.  The opposite extreme is the random text:  There is not a lot of repetition of long patterns, and every one of the 95 printable characters appears with equal frequency.  We could actually encode 95 characters with seven, rather than eight bits, and indeed, the compression ratio is close to 7/8.   Note that Huffman coding would do as badly.

For the intermediate examples,  the English text and the flute, there is a good deal of repetition (common words, the same cycle repeated for as long as a note is held) and so there is a respectable amount of compression.

2.  (a) What happens if you try to compress a file that already has been compressed?  Neither Mac OS nor Windows will let you take a file of type .zip and try to compress it, but you can trick the operating system by renaming the files---for instance, just change the '.zip' to .txt'.  This doesn't change the bytes in the file at all, but you can now go ahead and apply zip again.  Do this with several of the zip files you created in Exercise 1.  Note the results, and try to account for them.

Solution: When I tried this with the zipped version of the random text file, it actually INCREASED in size to about 8900 bytes;and when I tried it with the blue bitmap, it actually did decrease further, to 457 bytes.  But when I tried to compress this yet another time, the file size increased back to 573 bytes.

*(b) Explain why it is not possible to devise a data compression method that is guaranteed to reduce the size of every file.

If this were possible, you could repeatedly compress the file and reduce it to 0 bytes!

3. Suppose we applied Huffman coding to the .bmp file that is entirely blue.  What would the size of the compressed file be?  (For purposes of this problem, you can assume that the .bmp file consists EXCLUSIVELY of the encodings of the pixels, and ignore the 54-bit header.)

We would have 2/3 of the bytes equal to 255 and 1/3 of the bytes equal to 0.  Huffman coding would choose just one bit for each of the possible values, so the file would compress to 1/8 of its original size.  Note that Zip gives a better result because it works with long sequences of bytes.