aboutsummaryrefslogtreecommitdiff
path: root/doc/compression.txt
diff options
context:
space:
mode:
Diffstat (limited to 'doc/compression.txt')
-rw-r--r--doc/compression.txt55
1 files changed, 55 insertions, 0 deletions
diff --git a/doc/compression.txt b/doc/compression.txt
new file mode 100644
index 0000000..539e8bc
--- /dev/null
+++ b/doc/compression.txt
@@ -0,0 +1,55 @@
+ALF Compressed Data Stream
+--------------------------
+
+This describes the compressed data for each file in an ALF
+archive. Each archive member has a 29-byte header (described in
+fileformat.txt), followed by its compressed data.
+
+The compression algorithm is an implementation of LZW. See:
+
+https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
+
+ALF uses a variable bit-width encoding, 9 to 12 bits per code. Codes
+are numbered 0 to 4095 (aka 2^12-1). Each code represents either
+an uncompressed data byte, a control code, or a multibyte string
+made from previously-seen input, thus:
+
+Code Purpose
+------------+----------------------------------------------
+0 to 255 | 1-byte string (predefined)
+256 | Reset (clear token table, set bit size to 9)
+257 | End of compressed data
+258 to 4095 | Multibyte strings
+------------+----------------------------------------------
+
+Codes 0 to 255 are mapped to their values (e.g. code 65 represents
+the byte 65 in the input).
+
+As the encoder reads the input, it builds a dictionary of codes.
+Initially, codes are stored in the output as 9 bits each. When the
+number of codes exceeds 511 (the maximum number that can be stored in
+9 bits), the code bit size is increased to 10. Likewise, when the code
+count exceeds a 10-bit value (code 1023), the bit size becomes 11,
+then after code 2047 it becomes 12. When all 12-bit code numbers are
+used (code 4095), the encoder writes a reset/clear code, clears the
+dictionary, and resets the bit size to 9. Notice that, other than the
+change from 12 back to 9, there is no explicit "next bit size" code
+in the file. The encoder and decoder must agree upon when the bit size
+increases.
+
+When the bit size changes, the last code is written at the current bit
+size (including the reset/clear code), and the following code is at
+the new bit size. This means ALF doesn't use an "early change" variant
+of LZW.
+
+Bits are stored to the file in MSB-first order. If the compressed data
+doesn't end on a byte boundary (has a number of bits not divisible
+by 8), the least significant bits of the last byte are filled with
+0 bits.
+
+ALF compressed streams always begin with a reset code (256), *except*
+for empty files (length of 0 bytes). The stream always ends with
+an end code (257). This means a 0-byte file consists of one token
+(the end code), 9 bits long, which occupies 2 bytes. A 1-byte file
+compresses to a reset code, the one byte, and an end code (9 bits
+each, 27 bits total, or 4 bytes).