aboutsummaryrefslogtreecommitdiff
path: root/doc/interview.txt
blob: e7d375e5720eeea10ac2a71efc8b223b997598b0 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
An email interview with Alfred, author of AlfCrunch for the
Atari 8-bit.

Date: Thu, 20 Nov 2025 12:35:25 -0500
From: Alfred
To: B. Watson <urchlay@slackware.uk>
Subject: Re: UnAlf

On 2025-11-20 12:37 a.m., B. Watson wrote:

> 1. Was AlfCrunch public domain, shareware, or...?

1. AlfCrunch was public domain, although I never did distribute the 
source code, and as I recall nobody ever asked me for it. The programmer 
at ICD, Mike Gustafson did the same as you. He disassembled the DZ and 
added it to their SpartaDos X along with all the ARC routines so they 
could handle almost anything. Bob Puff at CSS did the same, so he could 
add it to his SuperUnArc. He phoned me one night to say his code was 
faster than mine at decompressing an AlfCrunch file. We had a good laugh 
about it.

> 2. Do you have any old disks (or disk images), or paper
> notes, anything that you used or referred to while developing
> AlfCrunch? Source would be awesome (if you still have it and are
> willing to share it). Even just the original distribution disk would
> be good to have.

2. I didn't distribute it on disk that I can recall, it was either the 
two files posted on my bbs, or perhaps they were Arc'd, I just don't 
recall now. Probably Arc'd because there was a doc file with it.

I've attached the source code for LZ/DZ. This isn't the original which 
was done with Mac/65, it's broken up to use the Six Forks assembler 
which I had just started using for a bit around then.

> 3. Why not ARC compatible compression? You said you ported a PC
> program you had the source to... was it just not a matter of having
> the source for ARC? Or did you decide to go for speed rather than
> compatibility?

3. I didn't have any source code for ARC and I didn't know what the 
various compression algorithms were. I vaguely knew about Huffman from 
work as one of the big software programs used it, but I had no idea how 
it was implemented. I read the LZW paper but I didn't really understand 
it then. Everyone hated Walden's ARC because it was so slow and it was 
bugged, but it was all there was. One day somewhere I ran across some 
guy's implementation of LZW for the pc, and I thought to try porting it 
because it had to be faster than ARC. It was in assembler, so I could 
kind of understand it. I'd seen some of the ARC source but the C code 
was just gibberish to me. It's why my version is so clunky because I was 
doing like you, just porting each x86 instruction to its sort of 6502 
variant. I couldn't make changes to the code because I didn't understand 
what it was doing back then.

  After I released the first version someone called me and said their 
Arcviewer didn't work on .alf files, so I quick fixed the header to be 
Arc compatible to the extent you could see what the files were, and 
that's the 1.4 version. So if you run across a 1.2, it's the same except 
for the header. I don't think hardly anyone saw 1.2 except for some 
local people because I released 1.4 so fast.

> 4. Did you ever work on AlfCrunch after the 1.4 release? You mention a
> couple of possibilities for the next version in your doc file. Did any
> of that ever materialize (even if unreleased)?

4. I did some work on a LZ 2.0 but I guess I quit on it, I only have 
some source for the LZ part. I must have thought 1.4 was good enough and 
moved on to something else.

> 5. Are you OK with me redistributing the decompression code from UnAlf
> under the WTFPL license?
> 
> 6. Are you OK with me including your AtariAge handle in my unalf
> documentation (man page)?

5 & 6. Sure you can distribute whatever you like and mention me in it. 
It's not like there's many people around who would remember me, heh.

LZW is fairly straightforward (now, lol) but it can be a bit hard to get 
the idea from just reading code. The way it works is a single token is 
output that represents a string, hopefully a really long one like:

token = $122= "went to the store and bought" string associated with that 
token. However I think tokens start as 9 bit values, so you actually 
output 9 bits, not just the 8.

So on the compress side, you start with a table of, I think, 8 bit 
tokens, where the value is 0-$FF, which are every possible input value. 
If were only doing say ASCII text, you could cut it down to maybe 100 
tokens, not sure how many non-printables there are like SOL etc.

Anyway, you start reading in bytes of data. You check a byte against the 
table and if you don't find it, you add the next token value which would 
be $100 and save that byte to that token's entry. Now that can't happen 
with the first token, because it has to be one of the starting 256 
bytes. If you find the token, then you remember it and continue by 
reading the next character. Now you're checking the table to see if 
there's a token for 1st+2nd, which there isn't. So you create a new 
token, $100, and add the 2 byte string as its value, and you output the 
token for the first byte. Now the third byte probably doesn't match the 
first byte, so it'll be the same process. Since there's no string of 
3rd+4th, you'll output the token for the third byte, and add a new token 
that represents those two bytes. Now with a good matching data file, 
like text, you'll probably see 1st+2nd again. So when it sees that first 
byte value, it says ok, I have a token for that, so it keeps reading, 
and it sees the second byte and it goes, I have a token for 1+2 too, so 
then it reads the third byte and now it goes, ok, I don't have a token 
for 1+2+3, so it outputs the token for 1+2 and creates a new token and 
stores the string 1+2+3 as it's value.
   So this process just goes on until you run out of data. With a good 
match you'll get longer and longer runs of bytes that match earlier 
strings, so you can get to the point where one token is the same as 40 
characters. That's why LZW is so good. However you run into trouble with 
something like a GIF or JPG because they're all different, you don't get 
runs of bytes. Especially not in JPG because it's already stripped out 
all the runs, which is why JPG files are so small.

The decompress is similar, it just works backwards. You start with the 
same 256 byte table. You read the first token, and it matches, so you 
output the value (the token is the character initially). Since it 
matched, what you would normally do is take the string that you output 
just before this and concatenate the first letter of this string to the 
last output string and add it to the table as a new token value. Since 
there is no previous string when you read the first byte, you do 
nothing. So even if you didn't know what the starting table was, you 
could rebuild it here, because all the initial tokens will be <$100 
because they didn't match anything longer in the beginning of the 
compression, so eventually you will reconstruct the 256 entry table. You 
short-circuit that part by starting with the known table.

So starting with the second token, you end up creating a bunch of second 
level entries, which are the initial table value+something. As long as 
the next token is in the table, you just keep going outputting strings 
and adding new tokens. Now what happens if you get a token that isn't in 
the table. This where the math becomes magic, I don't really understand 
the theory, but it works. You know it had to have just been created by 
the compressor. So the string that this new token represents has to at 
least start with the last token to which some new character was added. 
So you take the last string output and concatenate it's first character 
to itself. So if the last string was the value "ABCD" you create new 
token in the table and add "ABCDA" as it's value, and you output the 
string "ABCDA". And so on.

Now you start with 9 bit tokens I think. At some point on the compress 
side, and on the decompress, when you code to add a new token, it's 
going to take more than 9 bits, you up the bitsize to 10, which also 
changes the highest token value you can have, which I think is what the 
MAXCDE value represents. Because of the limited memory, I think I send 
the clear code at the end of filling a 12 bit table, and start over with 
9. Fortunately on the Atari you don't have giant files, so it doesn't 
reset too often.

   There are a couple of special tokens, Clear which when you see it you 
clear the table, and the End token which tells you that it's the last
token of the data.

A lot of the code in LZ/DZ is the bit twiddling to concatenate the 
varying bitsize tokens in memory. I can't do that sort of math in my 
head, so it's a lot brute force shifting bits left and right to align a 
new token to the last one in memory. The other thing I didn't understand 
is I don't think the code actually stores every full string, maybe it 
does, but at the time I thought the guy was using some scheme whereby he 
was only storing one character with each new token and it somehow was 
chained to the previous string.

That's about all I can tell you about it.