diff options
| author | B. Watson <urchlay@slackware.uk> | 2025-12-15 05:58:43 -0500 |
|---|---|---|
| committer | B. Watson <urchlay@slackware.uk> | 2025-12-15 05:58:43 -0500 |
| commit | 9df70083b92e3f130ff34497d05585478e593f1f (patch) | |
| tree | d4d075bf1a96ce0135cf2f2a9656f678aa518100 | |
| parent | 3e9426452656267dec8544a4517208b90849e6c3 (diff) | |
| download | alftools-9df70083b92e3f130ff34497d05585478e593f1f.tar.gz | |
Add doc/compression.txt.
| -rw-r--r-- | doc/compression.txt | 55 | ||||
| -rw-r--r-- | src/crunch.c | 46 |
2 files changed, 80 insertions, 21 deletions
diff --git a/doc/compression.txt b/doc/compression.txt new file mode 100644 index 0000000..539e8bc --- /dev/null +++ b/doc/compression.txt @@ -0,0 +1,55 @@ +ALF Compressed Data Stream +-------------------------- + +This describes the compressed data for each file in an ALF +archive. Each archive member has a 29-byte header (described in +fileformat.txt), followed by its compressed data. + +The compression algorithm is an implementation of LZW. See: + +https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch + +ALF uses a variable bit-width encoding, 9 to 12 bits per code. Codes +are numbered 0 to 4095 (aka 2^12-1). Each code represents either +an uncompressed data byte, a control code, or a multibyte string +made from previously-seen input, thus: + +Code Purpose +------------+---------------------------------------------- +0 to 255 | 1-byte string (predefined) +256 | Reset (clear token table, set bit size to 9) +257 | End of compressed data +258 to 4095 | Multibyte strings +------------+---------------------------------------------- + +Codes 0 to 255 are mapped to their values (e.g. code 65 represents +the byte 65 in the input). + +As the encoder reads the input, it builds a dictionary of codes. +Initially, codes are stored in the output as 9 bits each. When the +number of codes exceeds 511 (the maximum number that can be stored in +9 bits), the code bit size is increased to 10. Likewise, when the code +count exceeds a 10-bit value (code 1023), the bit size becomes 11, +then after code 2047 it becomes 12. When all 12-bit code numbers are +used (code 4095), the encoder writes a reset/clear code, clears the +dictionary, and resets the bit size to 9. Notice that, other than the +change from 12 back to 9, there is no explicit "next bit size" code +in the file. The encoder and decoder must agree upon when the bit size +increases. + +When the bit size changes, the last code is written at the current bit +size (including the reset/clear code), and the following code is at +the new bit size. This means ALF doesn't use an "early change" variant +of LZW. + +Bits are stored to the file in MSB-first order. If the compressed data +doesn't end on a byte boundary (has a number of bits not divisible +by 8), the least significant bits of the last byte are filled with +0 bits. + +ALF compressed streams always begin with a reset code (256), *except* +for empty files (length of 0 bytes). The stream always ends with +an end code (257). This means a 0-byte file consists of one token +(the end code), 9 bits long, which occupies 2 bytes. A 1-byte file +compresses to a reset code, the one byte, and an end code (9 bits +each, 27 bits total, or 4 bytes). diff --git a/src/crunch.c b/src/crunch.c index 6f5dad6..30eb4f7 100644 --- a/src/crunch.c +++ b/src/crunch.c @@ -351,34 +351,38 @@ $ echo -n abababcabc > ABC.TXT $ ./alf -vv 1 ABC.TXT Crunching ABC.TXT as ABC.TXT <0 >29:0 #256 RESET -<1 >30:1 #97 'a' -<2 >31:2 #98 'b' -<4 >32:3 #258 -<6 >33:4 #258 -<7 >34:5 #99 'c' +<1 >30:1 #97 a new: #258 +<2 >31:2 #98 b new: #259 +<4 >32:3 #258 new: #260 +<6 >33:4 #258 new: #261 +<7 >34:5 #99 c new: #262 <10 >35:6 #261 <10 >36:7 #257 END final token table contents: -#97/'a' - #258/'b' - #260/'a' - (no kids) - #261/'c' - (no kids) - #258 has 2 kids +#97, used 1, len 1: a +| `-#258, used 2, len 2: ab +| | `-#260, used 0, len 3: aba +| | | `-(no kids) +| | `-#261, used 1, len 3: abc +| | | `-(no kids) +| `-#258 has 2 kids #97 has 1 kids -#98/'b' - #259/'a' - (no kids) + +#98, used 1, len 1: b +| `-#259, used 0, len 2: ba +| | `-(no kids) #98 has 1 kids -#99/'c' - #262/'a' - (no kids) + +#99, used 1, len 1: c +| `-#262, used 0, len 2: ca +| | `-(no kids) #99 has 1 kids -maxkidcount 2, maxlevel = 3, totalkidcount = 5 -nodeswithkids 4, avgkidcount: 1.25 + +max kid count 2, max length = 3, total kid count = 5 +tokens with kids 4, avg kid count: 1.25 +total tokens 263, used: 7, unused 256 -- 10/38 (-279%) @@ -387,7 +391,7 @@ Backed up old '1' to '1~'. The first part of the output (all the lines that start with <) is the actual compressed data, in human-readable form. Single-byte tokens -(ones that map to a single charater) are displayed as their ASCII +(ones that map to a single character) are displayed as their ASCII value, if printable. The tokens numbered 258 and above represent multiple bytes. |
