diff options
Diffstat (limited to 'src/crunch.c')
| -rw-r--r-- | src/crunch.c | 95 |
1 files changed, 90 insertions, 5 deletions
diff --git a/src/crunch.c b/src/crunch.c index 0458840..cb1ce5f 100644 --- a/src/crunch.c +++ b/src/crunch.c @@ -240,13 +240,14 @@ void crunch(void) { /* ******************************************************************** -The tokens are stored in a tree with 256 roots (tokens[0 to 256]). -There's one root per character code (0-255). This serves as the -hard-coded initial dictionary (each token number at this level maps to -its character code). +The tokens are stored in a tree with 256 roots (tokens[0] to +tokens[255]). There's one root per character code (0-255). This +serves as the hard-coded initial dictionary (each token number at this +level maps to its character code). Each element of tokens[] has an array of 256 shorts, which are the -token numbers for each possible character that follows it. +token numbers for each possible character that follows it. These token +numbers are just more indices into the tokens[] array. When creating a token, you pass make_token() the index of an existing token. For example if your new token is for the string "ab", you pass @@ -261,4 +262,88 @@ and it avoids a time-consuming linear search of each level of the tree to see if the current character matches an existing token. Array lookups are O(1). +An example: + +$ echo -n abababcabc > ABC.TXT +$ ./alf -vv 1 ABC.TXT +Crunching ABC.TXT as ABC.TXT +<0 >29:0 #256 RESET +<1 >30:1 #97 'a' +<2 >31:2 #98 'b' +<4 >32:3 #258 +<6 >33:4 #258 +<7 >34:5 #99 'c' +<10 >35:6 #261 +<10 >36:7 #257 END + +final token table contents: +#97/'a' + #258/'b' + #260/'a' + (no kids) + #261/'c' + (no kids) + #258 has 2 kids +#97 has 1 kids +#98/'b' + #259/'a' + (no kids) +#98 has 1 kids +#99/'c' + #262/'a' + (no kids) +#99 has 1 kids + +maxkidcount 2, maxlevel = 3, totalkidcount = 5 +nodeswithkids 4, avgkidcount: 1.25 +-- + + 10/38 (-279%) +Elapsed time: 0.001s (0.01MB/s). +Backed up old '1' to '1~'. + +The first part of the output (all the lines that start with <) is the +actual compressed data, in human-readable form. Single-byte tokens +(ones that map to a single charater) are displayed as their ASCII +value, if printable. The tokens numbered 258 and above represent +multiple bytes. + +Where does the number 258 come from? Token numbers are sequential, +in the order the tokens are created (based on the input data). Since +#0-#255 are the predefined 1-byte tokens, and #256 and #257 are +reserved for the RESET and END control codes, the first 2-byte token +created will be numbered #258. + +And what does token #258 mean? Take a look at the next bit of output +(final token table contents). #258 shows up as a child of #97. So, +#258 represents the string "ab". #258 has 2 children itself, #260, which +represents "aba" (not used in the compressed data) and #261 ("abc", which +does get used). + +The way the token structure gets stored in memory... say, token +#261, which represents "abc". "a" through "c" are ASCII 97 to 99. +So, tokens[97][98] has the value 258 (the token for "ab"). Then, +tokens[258][99] has the value 261, which is the token for "abc" (the +previous token, with a "c" appended). If there were tokens of 4 or +more bytes that begin with "abc" (e.g. "abcd"), tokens[261]'s array +of 256 would have non-zero items... but in this case, there aren't, +so token[261] has an array of all 0 (shows as "no kids", above). + +Notice that the first time "abc" appeared in the input, it was +compressed as 2 tokens: #258 (for the "ab") and #99 (the "c"). When +the encoder saw the string "abc" for the first time, it couldn't use +the "abc" token because it didn't exist yet. Instead, it creates the +"abc" token. The next time "abc" appears in the input, *then* it can +use the "abc" token to represent it. + +If it makes it easier to visualize, here is the above token table +represented as a tree, with crude ASCII art. + +Roots (1-byte tokens): a (97) b (98) c (99) + / / / +2-byte tokens: b (258) a (259) a (262) + / \ +3-byte tokens: (260) a c (261) + + ********************************************************************* */ |
