1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
|
An email interview with Alfred, author of AlfCrunch for the
Atari 8-bit.
Date: Thu, 20 Nov 2025 12:35:25 -0500
From: Alfred
To: B. Watson <urchlay@slackware.uk>
Subject: Re: UnAlf
On 2025-11-20 12:37 a.m., B. Watson wrote:
> 1. Was AlfCrunch public domain, shareware, or...?
1. AlfCrunch was public domain, although I never did distribute the
source code, and as I recall nobody ever asked me for it. The programmer
at ICD, Mike Gustafson did the same as you. He disassembled the DZ and
added it to their SpartaDos X along with all the ARC routines so they
could handle almost anything. Bob Puff at CSS did the same, so he could
add it to his SuperUnArc. He phoned me one night to say his code was
faster than mine at decompressing an AlfCrunch file. We had a good laugh
about it.
> 2. Do you have any old disks (or disk images), or paper
> notes, anything that you used or referred to while developing
> AlfCrunch? Source would be awesome (if you still have it and are
> willing to share it). Even just the original distribution disk would
> be good to have.
2. I didn't distribute it on disk that I can recall, it was either the
two files posted on my bbs, or perhaps they were Arc'd, I just don't
recall now. Probably Arc'd because there was a doc file with it.
I've attached the source code for LZ/DZ. This isn't the original which
was done with Mac/65, it's broken up to use the Six Forks assembler
which I had just started using for a bit around then.
> 3. Why not ARC compatible compression? You said you ported a PC
> program you had the source to... was it just not a matter of having
> the source for ARC? Or did you decide to go for speed rather than
> compatibility?
3. I didn't have any source code for ARC and I didn't know what the
various compression algorithms were. I vaguely knew about Huffman from
work as one of the big software programs used it, but I had no idea how
it was implemented. I read the LZW paper but I didn't really understand
it then. Everyone hated Walden's ARC because it was so slow and it was
bugged, but it was all there was. One day somewhere I ran across some
guy's implementation of LZW for the pc, and I thought to try porting it
because it had to be faster than ARC. It was in assembler, so I could
kind of understand it. I'd seen some of the ARC source but the C code
was just gibberish to me. It's why my version is so clunky because I was
doing like you, just porting each x86 instruction to its sort of 6502
variant. I couldn't make changes to the code because I didn't understand
what it was doing back then.
After I released the first version someone called me and said their
Arcviewer didn't work on .alf files, so I quick fixed the header to be
Arc compatible to the extent you could see what the files were, and
that's the 1.4 version. So if you run across a 1.2, it's the same except
for the header. I don't think hardly anyone saw 1.2 except for some
local people because I released 1.4 so fast.
> 4. Did you ever work on AlfCrunch after the 1.4 release? You mention a
> couple of possibilities for the next version in your doc file. Did any
> of that ever materialize (even if unreleased)?
4. I did some work on a LZ 2.0 but I guess I quit on it, I only have
some source for the LZ part. I must have thought 1.4 was good enough and
moved on to something else.
> 5. Are you OK with me redistributing the decompression code from UnAlf
> under the WTFPL license?
>
> 6. Are you OK with me including your AtariAge handle in my unalf
> documentation (man page)?
5 & 6. Sure you can distribute whatever you like and mention me in it.
It's not like there's many people around who would remember me, heh.
LZW is fairly straightforward (now, lol) but it can be a bit hard to get
the idea from just reading code. The way it works is a single token is
output that represents a string, hopefully a really long one like:
token = $122= "went to the store and bought" string associated with that
token. However I think tokens start as 9 bit values, so you actually
output 9 bits, not just the 8.
So on the compress side, you start with a table of, I think, 8 bit
tokens, where the value is 0-$FF, which are every possible input value.
If were only doing say ASCII text, you could cut it down to maybe 100
tokens, not sure how many non-printables there are like SOL etc.
Anyway, you start reading in bytes of data. You check a byte against the
table and if you don't find it, you add the next token value which would
be $100 and save that byte to that token's entry. Now that can't happen
with the first token, because it has to be one of the starting 256
bytes. If you find the token, then you remember it and continue by
reading the next character. Now you're checking the table to see if
there's a token for 1st+2nd, which there isn't. So you create a new
token, $100, and add the 2 byte string as its value, and you output the
token for the first byte. Now the third byte probably doesn't match the
first byte, so it'll be the same process. Since there's no string of
3rd+4th, you'll output the token for the third byte, and add a new token
that represents those two bytes. Now with a good matching data file,
like text, you'll probably see 1st+2nd again. So when it sees that first
byte value, it says ok, I have a token for that, so it keeps reading,
and it sees the second byte and it goes, I have a token for 1+2 too, so
then it reads the third byte and now it goes, ok, I don't have a token
for 1+2+3, so it outputs the token for 1+2 and creates a new token and
stores the string 1+2+3 as it's value.
So this process just goes on until you run out of data. With a good
match you'll get longer and longer runs of bytes that match earlier
strings, so you can get to the point where one token is the same as 40
characters. That's why LZW is so good. However you run into trouble with
something like a GIF or JPG because they're all different, you don't get
runs of bytes. Especially not in JPG because it's already stripped out
all the runs, which is why JPG files are so small.
The decompress is similar, it just works backwards. You start with the
same 256 byte table. You read the first token, and it matches, so you
output the value (the token is the character initially). Since it
matched, what you would normally do is take the string that you output
just before this and concatenate the first letter of this string to the
last output string and add it to the table as a new token value. Since
there is no previous string when you read the first byte, you do
nothing. So even if you didn't know what the starting table was, you
could rebuild it here, because all the initial tokens will be <$100
because they didn't match anything longer in the beginning of the
compression, so eventually you will reconstruct the 256 entry table. You
short-circuit that part by starting with the known table.
So starting with the second token, you end up creating a bunch of second
level entries, which are the initial table value+something. As long as
the next token is in the table, you just keep going outputting strings
and adding new tokens. Now what happens if you get a token that isn't in
the table. This where the math becomes magic, I don't really understand
the theory, but it works. You know it had to have just been created by
the compressor. So the string that this new token represents has to at
least start with the last token to which some new character was added.
So you take the last string output and concatenate it's first character
to itself. So if the last string was the value "ABCD" you create new
token in the table and add "ABCDA" as it's value, and you output the
string "ABCDA". And so on.
Now you start with 9 bit tokens I think. At some point on the compress
side, and on the decompress, when you code to add a new token, it's
going to take more than 9 bits, you up the bitsize to 10, which also
changes the highest token value you can have, which I think is what the
MAXCDE value represents. Because of the limited memory, I think I send
the clear code at the end of filling a 12 bit table, and start over with
9. Fortunately on the Atari you don't have giant files, so it doesn't
reset too often.
There are a couple of special tokens, Clear which when you see it you
clear the table, and the End token which tells you that it's the last
token of the data.
A lot of the code in LZ/DZ is the bit twiddling to concatenate the
varying bitsize tokens in memory. I can't do that sort of math in my
head, so it's a lot brute force shifting bits left and right to align a
new token to the last one in memory. The other thing I didn't understand
is I don't think the code actually stores every full string, maybe it
does, but at the time I thought the guy was using some scheme whereby he
was only storing one character with each new token and it somehow was
chained to the previous string.
That's about all I can tell you about it.
|