crunch.c: some commentary to remind myself how it works.

author: B. Watson <urchlay@slackware.uk> 2025-12-01 05:50:39 -0500
committer: B. Watson <urchlay@slackware.uk> 2025-12-01 05:50:39 -0500
commit: 884e281ffad76fd22709de1dda6a4634055f2bf9 (patch)
tree: 7353e59365dad7d7cf7b65859d8a02117fe52227 /src
parent: 313050769f98ad6159f7399e5418d6608be5c531 (diff)
download: alftools-884e281ffad76fd22709de1dda6a4634055f2bf9.tar.gz
1 files changed, 60 insertions, 0 deletions
diff --git a/src/crunch.c b/src/crunch.c
index 0d56723..2bef18d 100644
--- a/src/crunch.c
+++ b/src/crunch.c
@@ -1,3 +1,8 @@
+/*
+ * See the bottom of this file for an explanation of the data
+ * structure, complete with high-tech ASCII art.
+ */
+
 #include <stdio.h>
 #include <stdlib.h>
 #include <string.h>
@@ -185,3 +190,58 @@ void crunch(void) {
 
 	init_table();
 }
+
+/* ********************************************************************
+
+The tokens are stored in a tree with 256 roots (root_tokens[256]).
+There's one root per character code (0-255). This serves as the
+hard-coded initial dictionary (each token number at this level maps to
+its character code).
+
+Each node's siblings pointer points to the next node at that level.
+This makes each level of a tree a linked list.
+
+Each node that has children, its kids pointer points to the head of
+the siblings list (the first of its child nodes).
+
+When creating a token, you pass make_token() a pointer to an existing
+token. For example if your new token is for the string "ab", you pass
+make_token() the address of root_tokens['a']. The new token will be
+added to the 'a' token's kids as the sibling of the last kid in the
+list (if there are already kids) or the first kid if there are none.
+
+The number in the token_t increases sequentially as tokens are
+created, and is the compressed token that actually get written to the
+ALF file. The chr is the last character of the token (the previous
+ones are at higher levels of the tree).
+
+What match_token() does is walk the tree, starting from the current
+input position and from the root node for the byte at the current
+position, and search each level for a token whose chr matches the
+next byte in the input. It keeps searching until there's no match, and
+returns the last token it matched. It can return the root, which means
+no other tokens start with that character.
+
+An example: the input begins with "abcdaZ".
+
+
+Partial token structure:
+
+root_token[['a'] = token #97, no siblings
+ \- kids: token #130, b; token #<whatever>, Z
+    \- kids: token #131, c
+
+Later on in the file, if we run into the string "abcX", match_token()
+will start with the root token for 'a', walk through its kids looking
+for 'b' (which it finds), then walk through the 'b' token's kids
+looking for a 'c' (which it finds). Suppose the letter 'X' has never
+occured yet in the input. match_token() will return token #131 (follow
+the tree, that's the token for "abc").
+
+A new token is created for every character in the file that doesn't
+match one of the existing tokens.
+
+Walking the tree is fast, but it could be made even faster by sorting
+the kids at each level, and doing a binary search instead of linear.
+
+********************************************************************* */
author	B. Watson <urchlay@slackware.uk>	2025-12-01 05:50:39 -0500
committer	B. Watson <urchlay@slackware.uk>	2025-12-01 05:50:39 -0500
commit	884e281ffad76fd22709de1dda6a4634055f2bf9 (patch)
tree	7353e59365dad7d7cf7b65859d8a02117fe52227 /src
parent	313050769f98ad6159f7399e5418d6608be5c531 (diff)
download	alftools-884e281ffad76fd22709de1dda6a4634055f2bf9.tar.gz