Add doc/compression.txt.

author: B. Watson <urchlay@slackware.uk> 2025-12-15 05:58:43 -0500
committer: B. Watson <urchlay@slackware.uk> 2025-12-15 05:58:43 -0500
commit: 9df70083b92e3f130ff34497d05585478e593f1f (patch)
tree: d4d075bf1a96ce0135cf2f2a9656f678aa518100
parent: 3e9426452656267dec8544a4517208b90849e6c3 (diff)
download: alftools-9df70083b92e3f130ff34497d05585478e593f1f.tar.gz
2 files changed, 80 insertions, 21 deletions
diff --git a/doc/compression.txt b/doc/compression.txt
new file mode 100644
index 0000000..539e8bc
--- /dev/null
+++ b/doc/compression.txt
@@ -0,0 +1,55 @@
+ALF Compressed Data Stream
+--------------------------
+
+This describes the compressed data for each file in an ALF
+archive. Each archive member has a 29-byte header (described in
+fileformat.txt), followed by its compressed data.
+
+The compression algorithm is an implementation of LZW. See:
+
+https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
+
+ALF uses a variable bit-width encoding, 9 to 12 bits per code. Codes
+are numbered 0 to 4095 (aka 2^12-1). Each code represents either
+an uncompressed data byte, a control code, or a multibyte string
+made from previously-seen input, thus:
+
+Code         Purpose
+------------+----------------------------------------------
+0 to 255    | 1-byte string (predefined)
+256         | Reset (clear token table, set bit size to 9)
+257         | End of compressed data
+258 to 4095 | Multibyte strings
+------------+----------------------------------------------
+
+Codes 0 to 255 are mapped to their values (e.g. code 65 represents
+the byte 65 in the input).
+
+As the encoder reads the input, it builds a dictionary of codes.
+Initially, codes are stored in the output as 9 bits each. When the
+number of codes exceeds 511 (the maximum number that can be stored in
+9 bits), the code bit size is increased to 10. Likewise, when the code
+count exceeds a 10-bit value (code 1023), the bit size becomes 11,
+then after code 2047 it becomes 12. When all 12-bit code numbers are
+used (code 4095), the encoder writes a reset/clear code, clears the
+dictionary, and resets the bit size to 9. Notice that, other than the
+change from 12 back to 9, there is no explicit "next bit size" code
+in the file. The encoder and decoder must agree upon when the bit size
+increases.
+
+When the bit size changes, the last code is written at the current bit
+size (including the reset/clear code), and the following code is at
+the new bit size. This means ALF doesn't use an "early change" variant
+of LZW.
+
+Bits are stored to the file in MSB-first order. If the compressed data
+doesn't end on a byte boundary (has a number of bits not divisible
+by 8), the least significant bits of the last byte are filled with
+0 bits.
+
+ALF compressed streams always begin with a reset code (256), *except*
+for empty files (length of 0 bytes). The stream always ends with
+an end code (257). This means a 0-byte file consists of one token
+(the end code), 9 bits long, which occupies 2 bytes. A 1-byte file
+compresses to a reset code, the one byte, and an end code (9 bits
+each, 27 bits total, or 4 bytes).
diff --git a/src/crunch.c b/src/crunch.c
index 6f5dad6..30eb4f7 100644
--- a/src/crunch.c
+++ b/src/crunch.c
@@ -351,34 +351,38 @@ $ echo -n abababcabc > ABC.TXT
 $ ./alf -vv 1 ABC.TXT
 Crunching ABC.TXT as ABC.TXT
 <0 >29:0 #256 RESET
-<1 >30:1 #97 'a'
-<2 >31:2 #98 'b'
-<4 >32:3 #258
-<6 >33:4 #258
-<7 >34:5 #99 'c'
+<1 >30:1 #97 a	new: #258
+<2 >31:2 #98 b	new: #259
+<4 >32:3 #258	new: #260
+<6 >33:4 #258	new: #261
+<7 >34:5 #99 c	new: #262
 <10 >35:6 #261
 <10 >36:7 #257 END
 
 final token table contents:
-#97/'a'
-  #258/'b'
-    #260/'a'
-      (no kids)
-    #261/'c'
-      (no kids)
-  #258 has 2 kids
+#97, used 1, len 1: a
+| `-#258, used 2, len 2: ab
+| | `-#260, used 0, len 3: aba
+| | | `-(no kids)
+| | `-#261, used 1, len 3: abc
+| | | `-(no kids)
+| `-#258 has 2 kids
 #97 has 1 kids
-#98/'b'
-  #259/'a'
-    (no kids)
+
+#98, used 1, len 1: b
+| `-#259, used 0, len 2: ba
+| | `-(no kids)
 #98 has 1 kids
-#99/'c'
-  #262/'a'
-    (no kids)
+
+#99, used 1, len 1: c
+| `-#262, used 0, len 2: ca
+| | `-(no kids)
 #99 has 1 kids
 
-maxkidcount 2, maxlevel = 3, totalkidcount = 5
-nodeswithkids 4, avgkidcount: 1.25
+
+max kid count 2, max length = 3, total kid count = 5
+tokens with kids 4, avg kid count: 1.25
+total tokens 263, used: 7, unused 256
 --
 
   10/38 (-279%)
@@ -387,7 +391,7 @@ Backed up old '1' to '1~'.
 
 The first part of the output (all the lines that start with <) is the
 actual compressed data, in human-readable form. Single-byte tokens
-(ones that map to a single charater) are displayed as their ASCII
+(ones that map to a single character) are displayed as their ASCII
 value, if printable. The tokens numbered 258 and above represent
 multiple bytes.
author	B. Watson <urchlay@slackware.uk>	2025-12-15 05:58:43 -0500
committer	B. Watson <urchlay@slackware.uk>	2025-12-15 05:58:43 -0500
commit	9df70083b92e3f130ff34497d05585478e593f1f (patch)
tree	d4d075bf1a96ce0135cf2f2a9656f678aa518100
parent	3e9426452656267dec8544a4517208b90849e6c3 (diff)
download	alftools-9df70083b92e3f130ff34497d05585478e593f1f.tar.gz