Note: You are looking at a static copy of the former PineWiki site, used for class notes by James Aspnes from 2003 to 2012. Many mathematical formulas are broken, and there are likely to be other bugs as well. These will most likely not be fixed. You may be able to find more up-to-date versions of some of these notes at http://www.cs.yale.edu/homes/aspnes/#classes.

1. A table-based compression algorithm

For this assignment, you are to implement a data compressor and decompressor designed for files containing many duplicate lines.

The input to your compressor is a sequence of lines of text on stdin, each terminated by a newline. The state of the compressor is a table of all distinct lines that have previously been read. When a line is read, the compressor first checks to see if it is in the table.

If the line is already in the table, the compressor outputs to stdout the rank of the line, defined as the number of lines strictly less than it in the table (in strcmp order). For example, if the table contains a, aa, b, and c, then a has rank 0, aa rank 1, b rank 2, etc. This rank is formatted as follows:

If the maximum rank of any string in the table is less than 0xff, the rank is output as a single byte (e.g., using putchar).
If the maximum rank of any string in the table is less than 0xff00, the rank is output as two bytes, most significant byte first.
If the maximum rank of any string in the table is less than 0xff0000, the rank is output as three bytes, again with the most significant byte first.
Otherwise, the rank is output as four bytes, with the MSB first. In the unlikely event that the table comes to contain more than 0xff000000 entries, your program may do whatever it likes.

If the line is not in the table, the compressor adds it to the table, and outputs the character 0xff, followed by the line (not including the terminating newline), followed by the character 0x00 (i.e. '\0').

For example, on input

abc
abc
def
def
abc
aaa
abc

the output would be (in hexadecimal) ff 61 62 63 00 00 ff 64 65 66 00 01 00 ff 61 61 61 00 01. The actual output would be an unprintable 18-byte string.

2. Your task

Your task is to write both a compressor and a decompressor for this format. Your submission should consist of whatever source files you need together with a Makefile that creates two programs compress and decompress when run with make with no arguments. The compress program should implement the above algorithm. The decompress should take the output of compress and reconstruct the original input: it should be the case that the output of ./compress | ./decompress is equal to its input for any file in which all lines end with a newline. In the case that the last line ends with EOF, your compress program should process it as if it contained a final terminating newline.

3. Hints

It is probably best to build and thoroughly debug the table as a separate module before attempting to use it inside your compressor or decompressor. You may wish to write a test program just for the table.
To read the output of your compressor as hexadecimal bytes, feed it to od -t x1, as in echo foo | ./compress | od -t x1. You can do od -c if you want characters (with escape sequences for the non-printing characters) instead. You can even use both options (od -t x1 -c) and get both output formats, but they won't line up exactly right. If your compressor is working right, looking at the raw output will generally not work well.

4. Sample solution

Here is the sample solution in tar.bz2 format; unpack it with tar jxvf 223-05-09-solution.tar.bz2: 223-05-09-solution.tar.bz2.