The Digital World-Lab 4
Assigned: February 13
Due: Monday, February 23
These are some exercises on data compression. You are to perform
a number of simple experiments with the sample text, image, and audio
files that I have provided, and then try as best you can to explain the
results. Huffman coding, which we explained in detail in class,
can provide a good intuition for this and help you to explain most of
the results in Exercise 1, but the actual lossless compression method
that you have built in to your computer (zip) works on different
principles. Zip uses a dictionary-based approach and will
compress well if it encounters long sequences of bytes that occur
repeatedly in the file. For example, if the bytes in the file are
(represented here in decimal)
0 1 2 .....255 0 1 2 ......255 0 1 2 ....255
repeated over and over again, zip will compress these to a very small
file, whereas Huffman coding will not compress at all, since it
gives equal weight to all possible byte values. (On the other
hand, Huffman coding in which each pair of successive bytes is treated
as a single symbol will perform very well).
1. The sample files for this exercise consist of (a) the text of The Telltale Heart; (b) a file
consisting of ten thousand randomly-generated ASCII characters; (c)
bitmapped images of three paintings, along with a bitmapped image that
is entirely blue; (d) our wave file of a brief passage played on the
flute; (e) a square wave (the solution to one of the problems in Lab 2).
Download each of these files, compress it, note both the original file
size and the size of the compressed file, and compute the compression
ratio. Try to explain the different compression ratios obtained
for the various text files, various image files, and various audio
files.
Solution: The exact results may
vary depending upon exactly how the compression algorithm used by your
computer functions. I obtained the following results:
File
|
Original Size
|
Compressed
Size
|
Compression
Ratio
|
Telltale Heart
|
11,176 bytes
|
5,000 bytes
|
44.7%
|
Random Text
|
9,999 bytes
|
8,412 bytes
|
84.1%
|
Blue bitmap
|
645,102 bytes
|
1,157 bytes
|
0.18%
|
Flute
|
72,244 bytes
|
42,218 bytes
|
58.4%
|
Square Wave
|
28,204 bytes
|
338 bytes
|
1.2%
|
First let's try to account for the
files that compress to very small sizes: Zip looks for long
sequences of bytes that occur repeatedly in files. Since both the
blue bitmap file and the square wave file consist of the same of bytes
repeated many times over, there is a lot of compression. Note
that Huffman coding would also work well for this example, since the
bitmap uses only three byte values (see 3 below) and the square wave
only two. But Zip works better. The opposite extreme is the
random text: There is not a lot of repetition of long patterns,
and every one of the 95 printable characters appears with equal
frequency. We could actually encode 95 characters with seven,
rather than eight bits, and indeed, the compression ratio is close to
7/8. Note that Huffman coding would do as badly.
For the intermediate examples,
the English text and the flute, there is a good deal of repetition
(common words, the same cycle repeated for as long as a note is held)
and so there is a respectable amount of compression.
2. (a) What happens if you try to compress a file that already
has been compressed? Neither Mac OS nor Windows will let you take
a file of type .zip and try to compress it, but you can trick the
operating system by renaming the files---for instance, just change the
'.zip' to .txt'. This doesn't change the bytes in the file at
all, but you can now go ahead and apply zip again. Do this with
several of the zip files you created in Exercise 1. Note the
results, and try to account for them.
Solution: When I tried this with the
zipped version of the random text file, it actually INCREASED in size
to about 8900 bytes;and when I tried it with the blue bitmap, it
actually did decrease further, to 457 bytes. But when I tried to
compress this yet another time, the file size increased back to 573
bytes.
*(b) Explain why it is not possible to devise a data compression method
that is guaranteed to reduce the size of every file.
If this were possible, you could
repeatedly compress the file and reduce it to 0 bytes!
3. Suppose we applied Huffman coding to the .bmp file that is entirely
blue. What would the size of the compressed file be? (For
purposes of this problem, you can assume that the .bmp file consists
EXCLUSIVELY of the encodings of the pixels, and ignore the 54-bit
header.)
We would have 2/3 of the bytes equal
to 255 and 1/3 of the bytes equal to 0. Huffman coding would
choose just one bit for each of the possible values, so the file would
compress to 1/8 of its original size. Note that Zip gives a
better result because it works with long sequences of bytes.