diff options
Diffstat (limited to 'doc/compression.txt')
| -rw-r--r-- | doc/compression.txt | 55 |
1 files changed, 55 insertions, 0 deletions
diff --git a/doc/compression.txt b/doc/compression.txt new file mode 100644 index 0000000..539e8bc --- /dev/null +++ b/doc/compression.txt @@ -0,0 +1,55 @@ +ALF Compressed Data Stream +-------------------------- + +This describes the compressed data for each file in an ALF +archive. Each archive member has a 29-byte header (described in +fileformat.txt), followed by its compressed data. + +The compression algorithm is an implementation of LZW. See: + +https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch + +ALF uses a variable bit-width encoding, 9 to 12 bits per code. Codes +are numbered 0 to 4095 (aka 2^12-1). Each code represents either +an uncompressed data byte, a control code, or a multibyte string +made from previously-seen input, thus: + +Code Purpose +------------+---------------------------------------------- +0 to 255 | 1-byte string (predefined) +256 | Reset (clear token table, set bit size to 9) +257 | End of compressed data +258 to 4095 | Multibyte strings +------------+---------------------------------------------- + +Codes 0 to 255 are mapped to their values (e.g. code 65 represents +the byte 65 in the input). + +As the encoder reads the input, it builds a dictionary of codes. +Initially, codes are stored in the output as 9 bits each. When the +number of codes exceeds 511 (the maximum number that can be stored in +9 bits), the code bit size is increased to 10. Likewise, when the code +count exceeds a 10-bit value (code 1023), the bit size becomes 11, +then after code 2047 it becomes 12. When all 12-bit code numbers are +used (code 4095), the encoder writes a reset/clear code, clears the +dictionary, and resets the bit size to 9. Notice that, other than the +change from 12 back to 9, there is no explicit "next bit size" code +in the file. The encoder and decoder must agree upon when the bit size +increases. + +When the bit size changes, the last code is written at the current bit +size (including the reset/clear code), and the following code is at +the new bit size. This means ALF doesn't use an "early change" variant +of LZW. + +Bits are stored to the file in MSB-first order. If the compressed data +doesn't end on a byte boundary (has a number of bits not divisible +by 8), the least significant bits of the last byte are filled with +0 bits. + +ALF compressed streams always begin with a reset code (256), *except* +for empty files (length of 0 bytes). The stream always ends with +an end code (257). This means a 0-byte file consists of one token +(the end code), 9 bits long, which occupies 2 bytes. A 1-byte file +compresses to a reset code, the one byte, and an end code (9 bits +each, 27 bits total, or 4 bytes). |
