Add doc/interview.txt.

author: B. Watson <urchlay@slackware.uk> 2025-11-22 05:48:59 -0500
committer: B. Watson <urchlay@slackware.uk> 2025-11-22 05:48:59 -0500
commit: d3755243556975550188bf659787ba3599dcca0f (patch)
tree: a4c42c33c8f23e0789ce913b357b5bdff3dcad30 /doc/interview.txt
parent: ceb07d961981f1e553bb6ab0f6145c2aa84f9f5e (diff)
download: unalf-d3755243556975550188bf659787ba3599dcca0f.tar.gz
1 files changed, 166 insertions, 0 deletions
diff --git a/doc/interview.txt b/doc/interview.txt
new file mode 100644
index 0000000..e7d375e
--- /dev/null
+++ b/doc/interview.txt
@@ -0,0 +1,166 @@
+An email interview with Alfred, author of AlfCrunch for the
+Atari 8-bit.
+
+Date: Thu, 20 Nov 2025 12:35:25 -0500
+From: Alfred
+To: B. Watson <urchlay@slackware.uk>
+Subject: Re: UnAlf
+
+On 2025-11-20 12:37 a.m., B. Watson wrote:
+
+> 1. Was AlfCrunch public domain, shareware, or...?
+
+1. AlfCrunch was public domain, although I never did distribute the 
+source code, and as I recall nobody ever asked me for it. The programmer 
+at ICD, Mike Gustafson did the same as you. He disassembled the DZ and 
+added it to their SpartaDos X along with all the ARC routines so they 
+could handle almost anything. Bob Puff at CSS did the same, so he could 
+add it to his SuperUnArc. He phoned me one night to say his code was 
+faster than mine at decompressing an AlfCrunch file. We had a good laugh 
+about it.
+
+> 2. Do you have any old disks (or disk images), or paper
+> notes, anything that you used or referred to while developing
+> AlfCrunch? Source would be awesome (if you still have it and are
+> willing to share it). Even just the original distribution disk would
+> be good to have.
+
+2. I didn't distribute it on disk that I can recall, it was either the 
+two files posted on my bbs, or perhaps they were Arc'd, I just don't 
+recall now. Probably Arc'd because there was a doc file with it.
+
+I've attached the source code for LZ/DZ. This isn't the original which 
+was done with Mac/65, it's broken up to use the Six Forks assembler 
+which I had just started using for a bit around then.
+
+> 3. Why not ARC compatible compression? You said you ported a PC
+> program you had the source to... was it just not a matter of having
+> the source for ARC? Or did you decide to go for speed rather than
+> compatibility?
+
+3. I didn't have any source code for ARC and I didn't know what the 
+various compression algorithms were. I vaguely knew about Huffman from 
+work as one of the big software programs used it, but I had no idea how 
+it was implemented. I read the LZW paper but I didn't really understand 
+it then. Everyone hated Walden's ARC because it was so slow and it was 
+bugged, but it was all there was. One day somewhere I ran across some 
+guy's implementation of LZW for the pc, and I thought to try porting it 
+because it had to be faster than ARC. It was in assembler, so I could 
+kind of understand it. I'd seen some of the ARC source but the C code 
+was just gibberish to me. It's why my version is so clunky because I was 
+doing like you, just porting each x86 instruction to its sort of 6502 
+variant. I couldn't make changes to the code because I didn't understand 
+what it was doing back then.
+
+  After I released the first version someone called me and said their 
+Arcviewer didn't work on .alf files, so I quick fixed the header to be 
+Arc compatible to the extent you could see what the files were, and 
+that's the 1.4 version. So if you run across a 1.2, it's the same except 
+for the header. I don't think hardly anyone saw 1.2 except for some 
+local people because I released 1.4 so fast.
+
+> 4. Did you ever work on AlfCrunch after the 1.4 release? You mention a
+> couple of possibilities for the next version in your doc file. Did any
+> of that ever materialize (even if unreleased)?
+
+4. I did some work on a LZ 2.0 but I guess I quit on it, I only have 
+some source for the LZ part. I must have thought 1.4 was good enough and 
+moved on to something else.
+
+> 5. Are you OK with me redistributing the decompression code from UnAlf
+> under the WTFPL license?
+> 
+> 6. Are you OK with me including your AtariAge handle in my unalf
+> documentation (man page)?
+
+5 & 6. Sure you can distribute whatever you like and mention me in it. 
+It's not like there's many people around who would remember me, heh.
+
+LZW is fairly straightforward (now, lol) but it can be a bit hard to get 
+the idea from just reading code. The way it works is a single token is 
+output that represents a string, hopefully a really long one like:
+
+token = $122= "went to the store and bought" string associated with that 
+token. However I think tokens start as 9 bit values, so you actually 
+output 9 bits, not just the 8.
+
+So on the compress side, you start with a table of, I think, 8 bit 
+tokens, where the value is 0-$FF, which are every possible input value. 
+If were only doing say ASCII text, you could cut it down to maybe 100 
+tokens, not sure how many non-printables there are like SOL etc.
+
+Anyway, you start reading in bytes of data. You check a byte against the 
+table and if you don't find it, you add the next token value which would 
+be $100 and save that byte to that token's entry. Now that can't happen 
+with the first token, because it has to be one of the starting 256 
+bytes. If you find the token, then you remember it and continue by 
+reading the next character. Now you're checking the table to see if 
+there's a token for 1st+2nd, which there isn't. So you create a new 
+token, $100, and add the 2 byte string as its value, and you output the 
+token for the first byte. Now the third byte probably doesn't match the 
+first byte, so it'll be the same process. Since there's no string of 
+3rd+4th, you'll output the token for the third byte, and add a new token 
+that represents those two bytes. Now with a good matching data file, 
+like text, you'll probably see 1st+2nd again. So when it sees that first 
+byte value, it says ok, I have a token for that, so it keeps reading, 
+and it sees the second byte and it goes, I have a token for 1+2 too, so 
+then it reads the third byte and now it goes, ok, I don't have a token 
+for 1+2+3, so it outputs the token for 1+2 and creates a new token and 
+stores the string 1+2+3 as it's value.
+   So this process just goes on until you run out of data. With a good 
+match you'll get longer and longer runs of bytes that match earlier 
+strings, so you can get to the point where one token is the same as 40 
+characters. That's why LZW is so good. However you run into trouble with 
+something like a GIF or JPG because they're all different, you don't get 
+runs of bytes. Especially not in JPG because it's already stripped out 
+all the runs, which is why JPG files are so small.
+
+The decompress is similar, it just works backwards. You start with the 
+same 256 byte table. You read the first token, and it matches, so you 
+output the value (the token is the character initially). Since it 
+matched, what you would normally do is take the string that you output 
+just before this and concatenate the first letter of this string to the 
+last output string and add it to the table as a new token value. Since 
+there is no previous string when you read the first byte, you do 
+nothing. So even if you didn't know what the starting table was, you 
+could rebuild it here, because all the initial tokens will be <$100 
+because they didn't match anything longer in the beginning of the 
+compression, so eventually you will reconstruct the 256 entry table. You 
+short-circuit that part by starting with the known table.
+
+So starting with the second token, you end up creating a bunch of second 
+level entries, which are the initial table value+something. As long as 
+the next token is in the table, you just keep going outputting strings 
+and adding new tokens. Now what happens if you get a token that isn't in 
+the table. This where the math becomes magic, I don't really understand 
+the theory, but it works. You know it had to have just been created by 
+the compressor. So the string that this new token represents has to at 
+least start with the last token to which some new character was added. 
+So you take the last string output and concatenate it's first character 
+to itself. So if the last string was the value "ABCD" you create new 
+token in the table and add "ABCDA" as it's value, and you output the 
+string "ABCDA". And so on.
+
+Now you start with 9 bit tokens I think. At some point on the compress 
+side, and on the decompress, when you code to add a new token, it's 
+going to take more than 9 bits, you up the bitsize to 10, which also 
+changes the highest token value you can have, which I think is what the 
+MAXCDE value represents. Because of the limited memory, I think I send 
+the clear code at the end of filling a 12 bit table, and start over with 
+9. Fortunately on the Atari you don't have giant files, so it doesn't 
+reset too often.
+
+   There are a couple of special tokens, Clear which when you see it you 
+clear the table, and the End token which tells you that it's the last
+token of the data.
+
+A lot of the code in LZ/DZ is the bit twiddling to concatenate the 
+varying bitsize tokens in memory. I can't do that sort of math in my 
+head, so it's a lot brute force shifting bits left and right to align a 
+new token to the last one in memory. The other thing I didn't understand 
+is I don't think the code actually stores every full string, maybe it 
+does, but at the time I thought the guy was using some scheme whereby he 
+was only storing one character with each new token and it somehow was 
+chained to the previous string.
+
+That's about all I can tell you about it.
author	B. Watson <urchlay@slackware.uk>	2025-11-22 05:48:59 -0500
committer	B. Watson <urchlay@slackware.uk>	2025-11-22 05:48:59 -0500
commit	d3755243556975550188bf659787ba3599dcca0f (patch)
tree	a4c42c33c8f23e0789ce913b357b5bdff3dcad30 /doc/interview.txt
parent	ceb07d961981f1e553bb6ab0f6145c2aa84f9f5e (diff)
download	unalf-d3755243556975550188bf659787ba3599dcca0f.tar.gz