diff options
| -rw-r--r-- | src/crunch.c | 60 |
1 files changed, 60 insertions, 0 deletions
diff --git a/src/crunch.c b/src/crunch.c index 0d56723..2bef18d 100644 --- a/src/crunch.c +++ b/src/crunch.c @@ -1,3 +1,8 @@ +/* + * See the bottom of this file for an explanation of the data + * structure, complete with high-tech ASCII art. + */ + #include <stdio.h> #include <stdlib.h> #include <string.h> @@ -185,3 +190,58 @@ void crunch(void) { init_table(); } + +/* ******************************************************************** + +The tokens are stored in a tree with 256 roots (root_tokens[256]). +There's one root per character code (0-255). This serves as the +hard-coded initial dictionary (each token number at this level maps to +its character code). + +Each node's siblings pointer points to the next node at that level. +This makes each level of a tree a linked list. + +Each node that has children, its kids pointer points to the head of +the siblings list (the first of its child nodes). + +When creating a token, you pass make_token() a pointer to an existing +token. For example if your new token is for the string "ab", you pass +make_token() the address of root_tokens['a']. The new token will be +added to the 'a' token's kids as the sibling of the last kid in the +list (if there are already kids) or the first kid if there are none. + +The number in the token_t increases sequentially as tokens are +created, and is the compressed token that actually get written to the +ALF file. The chr is the last character of the token (the previous +ones are at higher levels of the tree). + +What match_token() does is walk the tree, starting from the current +input position and from the root node for the byte at the current +position, and search each level for a token whose chr matches the +next byte in the input. It keeps searching until there's no match, and +returns the last token it matched. It can return the root, which means +no other tokens start with that character. + +An example: the input begins with "abcdaZ". + + +Partial token structure: + +root_token[['a'] = token #97, no siblings + \- kids: token #130, b; token #<whatever>, Z + \- kids: token #131, c + +Later on in the file, if we run into the string "abcX", match_token() +will start with the root token for 'a', walk through its kids looking +for 'b' (which it finds), then walk through the 'b' token's kids +looking for a 'c' (which it finds). Suppose the letter 'X' has never +occured yet in the input. match_token() will return token #131 (follow +the tree, that's the token for "abc"). + +A new token is created for every character in the file that doesn't +match one of the existing tokens. + +Walking the tree is fast, but it could be made even faster by sorting +the kids at each level, and doing a binary search instead of linear. + +********************************************************************* */ |
