## Description

Assignment 2

data compression with Hu?man codes

overview

Data compression involves reducing the amount of space taken up by ?les. Anyone who has listened to an

MP3 ?le or extracted a ?le from a zip archive has used compression. Reasons for compressing a ?le include

saving disk space and reducing the time required to transfer the ?le to another computer.

There are two broad classes of compression algorithms: lossy compression and lossless compression. Lossy

compression means that the ?le gets smaller, but that some information is lost. MP3 is a form of lossy

compression: it saves space by throwing away some audio information, and therefore the original audio ?le

can not be recovered. Lossless compression means that the ?le gets smaller and that the original ?le can be

recovered from the compressed ?le. FLAC is a form of lossless audio compression: it can’t save as much

space as MP3, but it does let you perfectly recover the original ?le.

In this assignment, we’ll be working with a lossless kind of compression called Hu?man coding. When

you’re ?nished the assignment, you’ll be able to compress a ?le, and then decompress that ?le to get back

the original ?le.

Background

Fixed- and Variable-Length Codes

Suppose that we had an alphabet of four letters: a, b, c, and d. Computers store only 0s and 1s (each 0 and

1 is known as a bit), not letters directly. So, if we want to store these letters in a computer, it is necessary

to choose a mapping from these letters to bits. That is, it’s necessary to agree on a unique code of bits for

each letter.

How should this mapping work? One option is to decide on the length of the codes, and then assign

letters to unique codes in whatever way we wish. Choosing a code length of 2, we could assign codes as

follows: a gets 00, b gets 01, c gets 10, and d gets 11. (Note that we chose a code length of 2 because it is

the minimum that supports an alphabet of four letters. Using just one bit gives you only two choices | 0

and 1 | and that isn’t su?cient for our four-letter alphabet.) How many bits are required to encode text

using this scheme? If we have the text aaaaa, for example, then the encoding takes 10 bits (two per letter).

Another option is to drop the requirement that all codes be the same length, and assign the shortest

possible code to each letter. This might give us the following: a gets 0, b gets 1, c gets 10, and d gets 11.

Notice that the codes for a and b are shorter than before, and that the codes for c and d are the same length

as before. Using this scheme, we’ll very likely use fewer bits to encode text. For example, the text aaaaa

now takes only 5 bits, not 10!

1

Unfortunately, there’s a catch. Suppose that someone gives us the encoding 0011. What text does this

code represent? It could be aabb if you take one character at a time from the code. However, it could also

equally be aad if you break up the code into the pieces 0 (a), 0 (a), and 11 (d). Our encoding is ambiguous!

: : : Which is really too bad, because it seemed that we had a way to reduce the number of bits required to

encode text. Is there some way we can still proceed with the idea of using variable-length codes?

To see how we can proceed, consider the following encoding scheme: a gets 1, b gets 00, c gets 010, and

d gets 011. The trick here is that no code is a pre?x of any other code. This kind of code is called a prefix

code, and it leads to unambiguous decoding. For example, 0000 decodes to the text bb; there is no other

possibility.

What we have here is seemingly the best of both worlds: variable-length codes (not ?xed-length codes),

and unambiguous decoding. However, note that some of our codes are now longer than before. For example,

the code for c is now 010. That’s a three-bit code, even longer than the 2-bit codes we were getting with

the ?xed-length encoding scheme. This would be OK if c letters didn’t show up very often. If c letters were

rare, then we don’t care much that they cause us to use 3 bits. If c letters did show up often, then we’d

worry because we’d be spending 3 bits for every occurrence of c.

In general, then, we want to have short codes for frequently-occurring letters and longer codes for letters

that show up less often. For example, if the frequency of a in our text is 10000 and the frequency of b in

our text is 50, we’d hope that the code for a would be much shorter than the code for b.

Goal of Algorithm

Our goal is to generate a best pre?x code for a given text. The best pre?x code depends on the frequencies

of the letters in the text. What do we mean by \best” pre?x code? It’s the one that minimizes the average

number of bits per symbol or, alternately, the one that maximizes the amount of compression that we get.

If we let C be our alphabet, f(x) denote the frequency of symbol x, and d(x) denote the number of bits

used to encode x, then we’re seeking to minimize the sum P

x2C

f(x)d(x). That is, we want to minimize

the sum of bits over all symbols, where each symbol contributes its frequency times its length.

Let’s go back to our four-letter alphabet: a, b, c, and d. Suppose that the frequencies of the letters in

our text are as follows: a occurs 600 times, b occurs 200 times, c occurs 100 times, and d occurs 100 times.

A best pre?x code for these frequencies turns out to be the one we gave earlier: a gets 1, b gets 00, c gets

010, and d gets 011. Calculate the above sum and you’ll get a value of 1600. There are other pre?x codes

that are worse; for example: a gets 110, b gets 0, c gets 10, and d gets 111. Calculate the sum for this and

convince yourself that it is worse than 1600!

2

Hu?man’s algorithm makes use of binary trees to represent codes. Each leaf is labeled with a symbol

from the alphabet. To determine the code for such a symbol, trace a path from the root to the symbol’s

leaf. Whenever you take a left branch, append a 0 to the code; whenever you take a right branch, append a

1 to the code. The Hu?man tree corresponding to the best pre?x code above is the following:

a

b

c d

Hu?man’s algorithm generates a tree corresponding to a best pre?x code. You’ll see proofs of this fact

in future courses; for now, you’ll implement Hu?man’s algorithm.

Preliminaries

For this assignment, we’re asking you to do some background reading on several topics that are not explicitly

taught in the course. (We don’t often give students su?cient opportunity to research on their own and build

on the available knowledge. This assignment is an attempt to remedy this.) Here’s what you’ll want to

learn:

? Hu?man’s algorithm. You’ll want to understand how Hu?man’s algorithm works on a conceptual level

before working on the implementation.

? Python bytes objects. We’ll be reading and writing binary ?les, so we’ll be using sequences of bytes

(not sequences of characters). bytes objects are like strings, except that each element of a bytes

object holds an integer between 0 and 255 (the minimum and maximum value of a byte, respectively).

We will consider each byte in an uncompressed ?le as a symbol.

? Binary numbers. Background on binary numbers will be useful, especially when debugging your code.

Although it’s not necessarily crucial for this assignment, you might want to also familiarize yourselves

with “endianness” (little endian in particular), to get an idea of how computers store a sequence of

bytes in memory.

You’re strongly encouraged to ?nd and use online resources to learn this background material. Be sure to

cite your sources in your assignment submission; include Python comments giving the locations of resources

that you found useful.

Your Task

Your task is to complete the functions in huffman.py. Ultimately, you will be able to call the compress and

uncompress functions to compress and uncompress a ?le, respectively.

3

The code in huffman.py uses two types of nodes from nodes.py. The ?rst, HuffmanNode, is a node in

a Hu?man tree. Hu?man trees are used to compress and uncompress data. The second, ReadNode, is the

representation of a node in a compressed ?le. It will be necessary to reconstruct the root HuffmanNode of a

Hu?man tree from the ReadNodes stored in the ?le.

We now give a suggested order for writing the functions in huffman.py.

Building the Tree

? Implement make_freq_dict. This function generates and returns a frequency dictionary.

? Implement huffman_tree. This function takes a frequency dictionary and returns the root

HuffmanNode of the Hu?man tree corresponding to a best pre?x code. (There’s a tricky case to watch

for here: you have to think of what to do when the frequency dictionary has only one symbol!)

? Implement get_codes. This function takes a Hu?man tree and returns a dictionary that maps each

byte in the tree to its code. The dictionary returned by get_codes is very useful for compressing data,

as it gives the mapping between a symbol and the encoding for that symbol.

? Implement number_nodes. This assigns a number to each internal node. These numbers will be written

when compressing a ?le to help represent relationships between nodes.

? Now is a good time to implement avg_length. This function returns the average number of bits

required to compress a symbol of text.

Compressing Data

Now we know how to generate a Hu?man tree, and how to generate codes from that tree. Function

generate_compressed uses the given frequency dictionary to generate the compressed form of the provided text.

Every byte comprises 8 bits. For example, 00001010 is a byte. Recall that each symbol in the text is

mapped to a sequence of bits, such as 00 or 111. generate_compressed takes each symbol in turn from

text, looks-up its code, and puts the code’s bits into a byte. Bytes are ?lled-in from bit 7 (leftmost bit) to

bit 0 (rightmost bit). Helper function bits_to_byte will be useful here. You might also ?nd it helpful to

use byte_to_bits for debugging; it gives you the bit-representation of a byte.

Note that a code may end up spanning multiple bytes. For example, suppose that you’ve generated six

bits of the current byte, and that your next code is 001. There’s room for only two more bits in the current

byte, so only the 00 can go there. Then, a new byte is required and its leftmost bit will be 1.

Implement generate_compressed.

Writing a Compressed File

generate_compressed gives us the compressed version of the text. Suppose that you send someone a

le containing that compressed text. How can they decompress that result to get back the original ?le?

Unfortunately, they can’t! They won’t have the Hu?man tree or codes used to compress the text, so they

won’t know how the bits in the compressed ?le correspond to text in the original ?le.

This means that some representation of the tree must also be stored in the compressed ?le. There are

many ways to do this: we have chosen one that is suboptimal but hopefully instructive (DanZ at UTM

recalls the technique from authors of old PC games who used Hu?man coding to compress a game onto a

single disc).

4

Follow along in function compress throughout this discussion.

Here is what we generate so that someone can later recover the original ?le from the compressed version:

? (1) The number of internal nodes in the tree

? (2) A representation of the Hu?man tree

? (3) The size of the uncompressed ?le

? (4) The compressed version of the ?le

Functions are already available (in starter code or written by you) for all steps except (2). Here is what

we do for step 2, in function tree_to_bytes:

? Internal nodes are written out in postorder.

? Each internal node of the tree takes up four bytes. Leaf nodes are not output, but instead are stored

along with their parent nodes.

? For a given internal node n, the four bytes are as follows. The ?rst byte is a 0 if the left subtree of n

is a leaf; otherwise, it is 1. The second byte is the symbol of the left child if the left subtree of n is

a leaf; otherwise, it is the node number of the left subtree. The third and fourth bytes are analogous

for the right subtree of n. That is, the third byte is a 0 if the right subtree of n is a leaf; otherwise, it

is 1. The fourth byte is the symbol of the right child if the right subtree of n is a leaf; otherwise, it is

the node number of the right subtree.

Implement tree_to_bytes.

Decompressing Text

There are two functions that work together to decompress text: a generate_tree_xxx function to generate

a tree from its representation in a ?le, and a generate_uncompressed function to generate the text itself.

First, the functions to generate a Hu?man tree from its ?le representation. Note that you are being asked

to write two different generate_tree_xxx functions. Once you get one of them working, you can successfully

uncompress text. But we’re asking for both, so don’t forget to do the other one!

? generate_tree_general takes a list of nodes representing a tree in the form that it is stored in a

le. It reconstructs the Hu?man tree and returns its root HuffmanNode. This function assumes nothing

about the order in which the nodes are given in the list. As long as it is given the index of the root

node, it will be able to reconstruct the tree. This means that the tree nodes could have been written

in postorder (as your assignment does), preorder, inorder, level-order, random-order, whatever. Note

that the number of each node in the list corresponds to the index of the node in the list. That is, the

node at index 0 in the list has node-number 0, the node at index 1 in the list has node-number 1, etc.

? generate_tree_postorder takes a list of nodes representing a tree in the form that it is stored in a

le. It reconstructs the Hu?man tree and returns its root HuffmanNode. This function assumes that

the tree is given in the list in postorder (i.e. in the form generated by your compression code). Unlike

in generate_tree_general, you are not allowed to use the node-numbers stored in the ReadNodes to

reconstruct the tree (in fact, these node-numbers may not even be set correctly | check the docstring

for an example). That is, if byte 1 of an internal node is 1 (meaning that the left subtree is not simply

a leaf), then you are not allowed to look at byte 2. Similarly, if byte 3 is 1 (meaning that the right

subtree is not simply a leaf), then you are not allowed to look at byte 4.

5

To test these two functions as you write them, notice that tree_to_bytes, combined with

bytes_to_nodes, generates tree representations in exactly the form required by generate_tree_postorder.

To test generate_tree_general, things are not so simple. (You have no function that generates a tree in

anything except postorder. And generate_tree_general is supposed to work no matter what, as long as

you have the index of the root node in the list.) You could, for example, write new number_nodes functions that number the trees in di?erent ways (e.g. preorder, random-order, etc.), and write corresponding

tree_to_bytes functions that write out the tree according to the order speci?ed by the numbering.

Next, the function to decompress text using a Hu?man tree:

? generate_uncompressed takes a Hu?man tree, compressed text, and number of bytes to produce,

and returns the decompressed text. There are two possible approaches: one involves generating a

dictionary that maps codes to symbols, and the other involves using the compressed text to traverse

the tree and discovering a symbol whenever a leaf is reached.

Another Function

For more practice, we are requiring you (as part of your mark) to implement improve_tree. This has

nothing to do with the rest of the program, so it won’t be called by any of your other functions. It is still a

required function for you to write, however.

Suppose that you are given a Hu?man tree and a frequency dictionary. We say that the Hu?man tree

is suboptimal for the given frequencies if it is possible to swap a higher-frequency symbol lower in the tree

with a lower-frequency symbol higher in the tree.

improve_tree performs all such swaps, improving the tree as much as possible without changing the

tree’s shape. Note, of course, that Hu?man’s algorithm will never give you a suboptimal tree, so you can’t

hope to improve those trees. But you can imagine that someone gives you a suboptimal tree, and you want

to improve it as much as possible without changing it’s shape.

Testing Your Work

In addition to the doctests provided in huffman.py, you are encouraged to test your functions on small

sample inputs. Make sure that your functions are doing the right thing in these small cases. Finding a

bug on a huge example tells you that you have a bug but doesn’t give you much information about what is

causing the bug.

Once you’ve suitably tested your functions, you can try the compressor and decompressor on arbitrary

les. You’ll be happy to know that you can compress and decompress arbitrary ?les of any type | although

the amount of compression that you get will vary widely!

When you run huffman.py, you have a choice of compressing or uncompressing a ?le. Choose an option

and then type a ?lename to compress or decompress that ?le. When you compress ?le a, a compressed

version will be written as a.huf. Similarly, when you decompress ?le a.huf, a decompressed version will be

written as a.huf.orig. (We can’t decompress ?le a.huf back to simply a, because then your original a ?le

would be overwritten!)

We’ve provided some sample ?les to get you started:

? book.txt is a public-domain book from Project Gutenberg. Text ?les like this compress extremely

well using Hu?man compression.

? dan.bmp is a picture of Dan in bitmap (bmp) format. This is an uncompressed image format and so

compresses very well.

6

? music.wav is a few seconds of a song from Mystic Ark (one of Dan’s favourite game soundtracks). The

song is in wav format; this ?le compresses terribly with Hu?man compression.

Throw other ?les at your compressor and see what kind of compression you can get!

Property Tests

In ?le test_huffman_properties.py, we have provided some property tests that you can run on your code

as you start completing functions. Please install the hypothesis Python package to use these property tests

at home (hypothesis is already installed for you in the labs).

What are property tests? Regular unittests work like this: you provide parameter values and expected

return value for each case, and unittest determines whether your function works appropriately in those

cases. Property tests, by contrast, work like this: you tell hypothesis about relationships between parameter

values and return values, and hypothesis automatically generates tons of examples that try to falsify the

correctness of your function. So, with hypothesis, you focus on general behaviour of your functions rather

than speci?c behaviour on prespeci?ed examples.

For a little on the syntax that hypothesis uses, take a look at the ?rst class in

test_huffman_properties.py. The @given line speci?es that integers between 0 and 255 should be passed

to the test_byte_to_bits function as parameter b. We then have two tests (assertXxx statements) that

must pass, no matter the value of b: byte_to_bits must return an object where everything in it is a 0 or

1, and it must return something of length 8. Of course, this particular property test is for a starter code

function, not one that you write; but later in the ?le there are some property tests for the functions that

you write.

The property tests that we’ve given you do not fully characterize what it means for a function to be

correct. They’re just a starting point for testing. You’re encouraged to write further property tests to help

increase your con?dence in your functions, though any tests that you write are not being marked.

Python TA

As in Assignment 1, we are requiring you to use python_ta to improve the quality of your code.

declaring your assignment team

You may do this assignment alone, with one other student, or with two other students. Your partner(s)

may be from any current section of the course at StG. You must declare your team (even if you are working

alone) using the MarkUs online system.

Navigate to the MarkUs page for the assignment and ?nd \Group Information”. If you are working alone,

click \work individually”. If you are working in a team:

First: one of you must invite” the other(s) to be partners, providing MarkUs with their utorids.

Second: the invited students must accept the invitation.

Important: there must be only one inviter; the other group member(s) accept after being invited.

To accept an invitation, ?nd \Group Information” on the Assignment page, ?nd the invitation listed there,

and click on \Join”.

7

submitting your work

Submit the huffman.py ?le on MarkUs. To submit the ?le, click on the \Submissions” tab near the top,

click \Add a New File”, and type a ?lename or click \Browse” to ?nd the ?le. Finally, click \Submit”. You

can submit a new version of a ?le later (before the deadline, of course); look in the \Replace” column.

In addition to huffman.py, you are free to submit a info.txt ?le that tells us what is working and

not working in your assignment, and anything else that you think we’d be interested in reading about your

assignment! For example, what other ?les did you compress, how good is the compression, how does the

compression ratio compare to other tools, etc.

If you are working in a group of 2 or 3, then only one team member should submit the assignment.

Everyone in the team will get credit for the work.

Marking Scheme

Here is how your assignment will be marked.

Passing the provided property tests is worth 20%. Passing new property tests that we will run (that

fully characterize the correctness of your code) is worth 30%. Passing our traditional unittests (where we

hard-code particular parameter and return values) is worth 20%. Therefore, in total, correctness is worth

70% of the assignment.

Method and function design is worth 20%. Avoid code duplication, use helper functions as appropriate,

design clean code, use and write ADTs to abstract implementation details.

Docstrings and python_ta compliance is worth 10%. You must include docstrings in all functions and

methods that you write. Docstrings include type contract, description, and examples (for functions/methods

that return non-None). Please also use consistent Python code style, and comment your code appropriately.

You can read the CSC108 docstring and code style rules here. You should use the class design recipe

and the function design recipe.

8