Project 3, CSC 202
1 Huffman coding
For this project, you will implement a program to encode and decode text using Huffman coding . Huffman coding is a method of lossless data compression. Each character is assigned a variable length code consisting of ’0’s and ’1’s. The length of the code is based on how frequently the character occurs; more frequent characters are assigned shorter codes.
The basic idea is that you use a binary tree, where each leaf node represents a character and frequency count. The Huffman code of a character is then determined by the unique path from the root of the binary tree to that leaf node, where each ’left’ accounts for a ’0’ and a ’right’ accounts for a ’1’. In order to encode frequent characters with shorter codes, frequent characters should have a shorter path from the root to their node than infrequent characters, i.e. nodes representing frequent characters are positioned less deep in the tree than nodes for infrequent characters. The key problem is therefore to construct such a binary tree based on the frequency of occurrence of characters. A detailed example of how to construct such a Huffman tree is provided here , with the input text file and the resulting encoded file in ’0’ and ’1’ characters.
Your program will use your list implementations from lab 3, project 2, respectively. Copy both your array_list.py and linked_list.py file into your repository for project 3.
Note: For this assignment and every assignment in the class, every function must come with a signature and a purpose statement. You do not need to provide test cases for functions that produce output or consume input, but you must test (directly or indirectly) all functions that do not involve I/O (Input/Output), and you should make an attempt to design (or refactor) your code so that the functions that involve I/O are as small as possible.
2 Detailed Instructions
The following bullet points provide a guide to implement the individual functions of your program. Always follow the design recipe, write data definitions, function signatures and purpose statements in order to get full credit. Pick useful names for your defined data structures and functions, if there are no explicit names specified. If not otherwise stated, you will not get full credit if you use built-in functions. Start by creating a file huffman.py and add functions and data definitions to this file, if not otherwise stated.
2.1 Count Occurrences
Implement a function that opens a text file with a given file name and counts the frequency of occurrences of all the characters within that file. Choose an efficient list implementation (array list or linked list) for counting the occurrences of characters, since your final program will be tested with large text files. You can assume that in your text file there are only 8-bit characters resulting in a total of 256 possible character values. Use the built-in function ord(c) to get an integer value between 0...255 for a given character c according to the ASCII table in order to calculate an index for your list. Return the list with the occurrences.
2.2 Data Definition for Huffman Tree
Write a data definition for a Huffman tree, which is either a Huffman node or a leaf.
Define a Huffman Leaf class that holds a character (or, perhaps better, its integer ASCII representation) and the occurrence count (frequency) of that character. Also define a Huffman Node class representing an internal node that contains a character value and an occurrence count, as well as a reference to the left and right subtree. The character value and occurrence count of an internal node are assigned according to the construction of a Huffman tree.
Define a function that creates a string from a given Huffman tree by traversing the tree in a pre-order traversal and appending the characters of the visited leaf nodes.
2.3 Build a Huffman Tree
Start by defining a function comes_before(a, b) for Huffman trees that returns true if tree a comes before tree b. A Huffman tree a comes before Huffman tree b if the occurrence count of a is smaller than that of b. In case of a tie (i.e. the same occurrence count) you have to fall back to the ASCII value of the character to determine the order. If, for example, the characters ’d’ and ’k’ appear exactly the same number of times in your file, then ’d’ comes before ’k’, since the ASCII value of character ’d’ is less than that of ’k’.
Add to your linked_list.py file an insert_sorted() function, that inserts a value into a sorted list in ascending order. This insert_sorted() function accepts in that order a (sorted) linked list, a value, and your implemented comes_before() function to determine the position of a value in the list. The insert_sorted() function returns the list including the inserted value.
Write a function that builds a Huffman tree from a given list of occurrences of characters from task 1.1 and returns the root node of that tree. Start by creating a sorted list of individual Huffman trees consisting of single Leaf nodes for each character and their occurrence counts. Insert these Leaf nodes into a sorted list using the insert_sorted() and comes_before() function implemented above. Building the actual tree involves removing the two nodes with the lowest frequency count from the sorted list and connecting them to the left and right field of a new created Huffman Node according to this example.
It is important to note, that when connecting two Huffman nodes to the left and right field of a new parent Node, that this new node is not a Leaf node, containing an actual character to encode. Instead this new parent node should contain an occurrence count that is the sum of the left and right child occurrence counts as well as the minimum of the left and right character representation in order to resolve ties in the comes_before() function.
Once a new parent node has been created from the two nodes with the lowest occurrence count as described above, that parent node is being inserted into the list of sorted nodes again using your insert_sorted() and comes_before() function.
This process of connecting nodes from the front of the sorted list is being continued until there is a single node left in the sorted list, which is the root node of the Huffman tree that you finally return.
2.4 Build a List for the Character Codes
We have completed our Huffman tree, but we are still lacking a way to get our Huffman codes. For that purpose you implement a simple function that traverses our Huffman tree from the root (that you need to pass into that function) to each node and add a ’0’ when we go ’left’ and ’1’ when we go ’right’. As always follow the design recipe and deploy the (recursive) template here to write that function. You may want to use an accumulator to collect the individual ’0’s and ’1’s while visiting each node in order to obtain the final string when reaching a leaf node. You may want to use the built-in ’+’ operator to concatenate strings of ’0’s and ’1’s here. In addition you may want to pass in a list (using your implementation) that finally stores for each character (or its respective integer ASCII representation) the resulting Huffman code, represented in a sequence of ’0’s and ’1’s in a string. Take a moment to decide, what type of list you deem appropriate for storing Huffman codes, which, once created, you need to access often.
2.5 Huffman Encoding
Write a function called huffman_encode() (use that exact name) that reads an input text file and writes the compressed text into an output file, according to Huffman encoding. The function just accepts two file names in that order: input file name and output file name, represented as strings, and returns a pre-order string representation of the Huffman tree of the input file. If the specified output file already exists, its old contents will be erased. The returned string representation should list the characters in the (leaf nodes of the) Huffman tree using pre-order traversal (cf. chapter 1.2).
- Writing the generated code for each character into a file so far would actually enlarge the file size instead compress it. The explanation is simple: although we encoded our input text characters in sequences of ’0’s and ’1’s, representing actual single bits, we write them as individual ’0’ and ’1’ characters again, i.e. 8 bits. The real benefit of a file compression can be achieved, by writing these ’0’ and ’1’ characters in fact as individual bits. Writing individual bits, however, is not supported by python or the operating system, so we have to do some bit accumulation into bytes and write them into a binary file. This involves some bit shifting operations that most of you will most likely not have used before. We therefore provide to you an implementation of a HufmannBitReader and HufmannBitWriter class in huffman_bits_io.py, that you may want to copy into your repository. Do not modify them for test purposes. Using these classes to write to and read from a compressed file is very simple:
from huffman_bits_io import *
hb_writer = HuffmanBitsWriter(fname) # open file 'fname' for writing in binary mode
hb_writer.write_byte(val) # write integer 'val' as a 1 byte
hb_writer.write_int(val) # write integer 'val' as a 4 bytes
hb_writer.write_code(str) # write a code string 'str' consisting of '0' and '1' characters
hb_writer.close() # clears output buffer and closes file
hb_reader = HuffmanBitsReader(fname) # open file 'fname' for reading in binary mode
val = hb_reader.read_byte() # read 1 byte as an unsigned integer value
val = hb_reader.read_int() # read 4 bytes represented as an integer value
bit = hb_reader.read_bit() # read a single bit and return a Boolean (1 = True, 0 = False)
hb_reader.close() # closes file
- Reading a compressed file (see below) requires knowledge of the code table that generated the bits. This is accomplished using a header that needs to be written before the actual compressed data. You will be using the HuffmanBitsWriter class to open an output file and write the following information first in that exact order):
The number of codes represented in the Huffman Tree as a single 1 byte integer value.
Each character c and the number of occurrences n of c in the original text file as pairs c n, where the character c is written as a 1 byte and the number of occurrences n as a 4 byte integer. The sequence of character and occurrence pairs need to be in ASCII order of the characters.
Finally write the sequence of bits as produced by the Huffman encoder.
- In summary your huffman_encode() function must do the following, using your implemented functions above:
Process the input file, count the number of occurrences of each ASCII character, and sort these occurrences in ascending order.
Create a Huffman Tree using the sorted counts and the algorithm as specified above.
When joining two trees, the lesser of the trees must go on the left (the ’0’ side).
When sorting Huffman trees, break any ties by comparing the ASCII values of the characters in each tree. The tree containing the smallest character (as determined by ASCII values) is considered the lesser of the two.
Write the header information as described above into the output file.
Use the tree to generate codes for each ASCII character that occurred in the file and write them as described above into the output file.
Return a string representation of the Huffman tree. The string consists of all the leafs of the tree in the order they are visited by a pre-order traversal. There are no delimiters between adjacent characters. For example, for the sample file0.txt file, this would return the string " bdca" (including the space, which is part of the characters).
- Since you are writing bits and bytes into a binary file, you may want to use a HEX editor to verify the result of your compressed file. For example, for the sample file0.txt file, the compressed binary file file0_encoded.bin should look like:
Address | 0| 1| 2| 3| 4| 5| 6| 7| 8| 9| a| b| c| d| e| f|
00000000: 05 20 00 00 00 03 61 00 00 00 04 62 00 00 00 03
00000010: 63 00 00 00 02 64 00 00 00 01 db 0d a6 98
The first byte says 5 codes. It follows ’ ’ (space character) = (ASCII) 20 with 00 00 00 03 = 3 occurrences, then ’a’ = (ASCII) 61 with 00 00 00 04 = 4 occurrences etc., and eventually the last four bytes represent the bit sequence "11011011 00001101 10100110 10011 000" of the compressed text "abcd abc ab a", including three padding 0 bits at the end.
2.6 Huffman Decoding
Write a function called huffman_decode() (use that exact name) that reads a compressed text file and writes the decompressed text into an output text file. The function just accepts, in this order, the compressed input file name and the output text file name. If the specified output file already exists, its old contents will be erased.
- This function must do the following, using your implemented functions above:
Open the input file using a HufmannBitReader and read the header information, i.e. the number of codes and the list of occurrences as described above.
Build a Huffman Tree from the number of codes and the list of occurrences.
Use that Huffman tree to decode the compressed data read bit by bit from the file and write it to the output file. Hint: when using a Windows platform for testing your program, you may want to avoid incompatibilities between Windows and Unix/Mac by opening your decoded output text file for writing also in "binary" mode using open(fname, ’wb’).
3 2. Tests
Write sufficient tests using unittest to ensure full functionality and correctness of your program. Start by using the supplied huffman_tests.py test and test file textfile.txt to compare encoded text files with a solution or decoded text file with your input text file. Place all your tests at the bottom of your huffman.py file.
Make sure that your own tests do call 100% of your code using the coverage tool.
When copying and enhancing array_list.py and linked_list.py from Lab 3 or project 2, don’t forget to also copy array_list_tests.py and linked_list_tests.py and enhance these with additional tests for your additional functions you implemented.
When testing, always consider edge conditions like empty files or files containing an arbitrary number of just a single character (think for a minute to conclude, what you actually are to expect in case of a file containing only something like "aaaaa"). In case of an empty input text file, your program should also produce an empty encoded file. If an input file does not exist, your program should abort with an error message.
4 3. Some Notes
When writing your own test files or using copy and paste from an editor, take into account that most text editors will add a newline character to the end of the file. If you end up having an additional newline character ’\n’ in your text file, that wasn’t there before, then this ’\n’ character will also be added to the Huffman tree and will therefore appear in your generated string from the Huffman tree. In addition, an actual newline character is treated different, depending on your computer platform. Different operating systems have different codes for "newline". Windows uses ’\r\n’ = 0x0d 0x0a, Unix and Mac use ’\n’ = 0x0a. It is always useful to use a hex editor to verify why certain files that appear identical in a regular text editor are actually not identical.
Since you are using and modifying your linked list and array list implementation of lab 3, you should make a copy of array_list.py and linked_list.py in your repository of project 3.
5 4. Submission
You must submit the following files:
|
|
linked_list.py — Linked List implementation |
linked_list_tests.py — Linked List test cases |
array_list.py — Array List implementation |
array_list_tests.py — Array List test cases |
huffman.py — Your huffman encoder and decoder application including tests |
<Any text file you used for testing> |
|
Note that the files you use for testing must end with .txt or .bin, or the handin server will cruelly destroy them.