aboutsummaryrefslogtreecommitdiff
path: root/src
diff options
context:
space:
mode:
authorB. Watson <urchlay@slackware.uk>2025-12-11 04:30:47 -0500
committerB. Watson <urchlay@slackware.uk>2025-12-11 04:30:47 -0500
commitf65c9025dbb47cc252a5cecd1c750f46a36378b5 (patch)
tree2fa4b0e756d6c1637500b09569cd648059544d31 /src
parentae06ce5a0317b949a74562a09e74177e78267a9b (diff)
downloadalftools-f65c9025dbb47cc252a5cecd1c750f46a36378b5.tar.gz
Update and expand explanation at the bottom of alf.c. Perhaps I'll actually understand it, when I look at this again years later.
Diffstat (limited to 'src')
-rw-r--r--src/crunch.c95
1 files changed, 90 insertions, 5 deletions
diff --git a/src/crunch.c b/src/crunch.c
index 0458840..cb1ce5f 100644
--- a/src/crunch.c
+++ b/src/crunch.c
@@ -240,13 +240,14 @@ void crunch(void) {
/* ********************************************************************
-The tokens are stored in a tree with 256 roots (tokens[0 to 256]).
-There's one root per character code (0-255). This serves as the
-hard-coded initial dictionary (each token number at this level maps to
-its character code).
+The tokens are stored in a tree with 256 roots (tokens[0] to
+tokens[255]). There's one root per character code (0-255). This
+serves as the hard-coded initial dictionary (each token number at this
+level maps to its character code).
Each element of tokens[] has an array of 256 shorts, which are the
-token numbers for each possible character that follows it.
+token numbers for each possible character that follows it. These token
+numbers are just more indices into the tokens[] array.
When creating a token, you pass make_token() the index of an existing
token. For example if your new token is for the string "ab", you pass
@@ -261,4 +262,88 @@ and it avoids a time-consuming linear search of each level of the
tree to see if the current character matches an existing token. Array
lookups are O(1).
+An example:
+
+$ echo -n abababcabc > ABC.TXT
+$ ./alf -vv 1 ABC.TXT
+Crunching ABC.TXT as ABC.TXT
+<0 >29:0 #256 RESET
+<1 >30:1 #97 'a'
+<2 >31:2 #98 'b'
+<4 >32:3 #258
+<6 >33:4 #258
+<7 >34:5 #99 'c'
+<10 >35:6 #261
+<10 >36:7 #257 END
+
+final token table contents:
+#97/'a'
+ #258/'b'
+ #260/'a'
+ (no kids)
+ #261/'c'
+ (no kids)
+ #258 has 2 kids
+#97 has 1 kids
+#98/'b'
+ #259/'a'
+ (no kids)
+#98 has 1 kids
+#99/'c'
+ #262/'a'
+ (no kids)
+#99 has 1 kids
+
+maxkidcount 2, maxlevel = 3, totalkidcount = 5
+nodeswithkids 4, avgkidcount: 1.25
+--
+
+ 10/38 (-279%)
+Elapsed time: 0.001s (0.01MB/s).
+Backed up old '1' to '1~'.
+
+The first part of the output (all the lines that start with <) is the
+actual compressed data, in human-readable form. Single-byte tokens
+(ones that map to a single charater) are displayed as their ASCII
+value, if printable. The tokens numbered 258 and above represent
+multiple bytes.
+
+Where does the number 258 come from? Token numbers are sequential,
+in the order the tokens are created (based on the input data). Since
+#0-#255 are the predefined 1-byte tokens, and #256 and #257 are
+reserved for the RESET and END control codes, the first 2-byte token
+created will be numbered #258.
+
+And what does token #258 mean? Take a look at the next bit of output
+(final token table contents). #258 shows up as a child of #97. So,
+#258 represents the string "ab". #258 has 2 children itself, #260, which
+represents "aba" (not used in the compressed data) and #261 ("abc", which
+does get used).
+
+The way the token structure gets stored in memory... say, token
+#261, which represents "abc". "a" through "c" are ASCII 97 to 99.
+So, tokens[97][98] has the value 258 (the token for "ab"). Then,
+tokens[258][99] has the value 261, which is the token for "abc" (the
+previous token, with a "c" appended). If there were tokens of 4 or
+more bytes that begin with "abc" (e.g. "abcd"), tokens[261]'s array
+of 256 would have non-zero items... but in this case, there aren't,
+so token[261] has an array of all 0 (shows as "no kids", above).
+
+Notice that the first time "abc" appeared in the input, it was
+compressed as 2 tokens: #258 (for the "ab") and #99 (the "c"). When
+the encoder saw the string "abc" for the first time, it couldn't use
+the "abc" token because it didn't exist yet. Instead, it creates the
+"abc" token. The next time "abc" appears in the input, *then* it can
+use the "abc" token to represent it.
+
+If it makes it easier to visualize, here is the above token table
+represented as a tree, with crude ASCII art.
+
+Roots (1-byte tokens): a (97) b (98) c (99)
+ / / /
+2-byte tokens: b (258) a (259) a (262)
+ / \
+3-byte tokens: (260) a c (261)
+
+
********************************************************************* */