For the .xex build (but not the cartridge), the title screen uses
a crude form of compression, which I'll call ZRLE (zero-run length
encoding). It's a special-purpose compression scheme I came up with,
to meet the following requirements:

- Must be able to compress the Taipan title screen by at least 33%
  (66% compression ratio, where the decompress code is counted as part
  of the compressed file size).

- Must be able to decompress in less than 250 bytes of 6502 asm
  code (2 single-density sectors on disk).

- Must decompress title screen in less than 1/4 sec on the Atari.

All 3 requirements are exceeded slightly: the screen is compressed at a
60% ratio, the decompressor is around 160 bytes of object code, and it
runs in approximately 1/5 of a second.

Things that are NOT requirements: it doesn't need to compress anything
else well, just the title screen. It turns out that it works pretty well
for bitmapped graphics like the Atari uses, or any kind of file that's
likely to have large areas of all 0 bytes, but e.g. it can't compress an
ASCII or ATASCII text file at all (because there are pretty much never
any null bytes in a human-readable text file).

Apologia:
---------

I'm aware that cc65 has a zlib implementation, but it doesn't quite meet
the requirements: it's slow, large (linking it adds 781 bytes to the
executable size), and the best DEFLATE compression I could manage for
my title screen was 2806 bytes (using 7z). Add to that the 781 bytes of
zlib code and the other required libs for a C program, and I end up with
something like a 65% compression ratio (since I'm counting the code size
as part of the file size), and it takes 2/3 of a second to decompress. So,
it meets one of my requirements, but not all of them. Which should not be
taken as a criticism of cc65's zlib implementation: It's amazingly small,
efficient, and easy to use. It's just that deflate is a general-purpose
compression algorithm, and I thought I could do a better job with a
purpose-built one... which I think I've done.

I didn't know exomizer existed, before I started on my compression scheme.
If I'd known, and if I'd tried it out, I probably would have just used
it instead. It has a better compression ratio (55% for my title screen,
including decomp code) and decompresses fast enough.

Theory of operation:
--------------------

Unlike real RLE, ZRLE only compresses runs of 0-bytes. Non-zero bytes
are stored as-is, not compressed at all. Also unlike real RLE, ZRLE
won't work on arbitrary input data. Why?

ZRLE relies on some byte values being unused in the input. In other words,
if the file contains every byte value 0 to 255, it can't be compressed
with ZRLE. There needs to be at least one unused value per length of
null run found in the file, because the unused values are used as markers
telling the decoder how long each run of 0-bytes is.

There is no "escape" byte in ZRLE, or any sort of block structure. Each
byte is either a byte of pixels, or a marker indicating a run of 0-bytes.

In other words, if the input looks like:

01 00 00 02 00 00 03 00 00 00 04

...there are 2 lengths of zero-run: 2 and 3. The fact that there are
two runs of length 2 doesn't affect the algorithm (in fact it helps
the compression ratio). Here are the runs, illustrated in glorious
ASCII art:

01 00 00 02 00 00 03 00 00 00 04
   \___/    \___/    \______/
   run of   run of    run of
  2 zeroes  2 zeroes  3 zeroes

For each length of zero-run found in the file, an otherwise-unused byte
is assigned as a marker for that length. The above input might be encoded
as:

01 05 02 05 03 06 04

...where 05 represents a run of two zeroes, and 06 represents a run of
three zeroes.

This means there are no escape bytes, markers, or counts stored in
the file. Each run of zero-bytes collapses into one byte of data,
which we'll call a code value. All bytes that aren't code values, are
just uncompressed data bytes, used as-is.

If there aren't enough unused values, the encoder will fail, saying it's
out of codes.

But how can this be decoded?

The encoder creates a table for the decoder to use, of course. The
table is written to the file comptitle.s, along with the rest of the
decoder code from comptitle.s.in. This means the decoder is specific to
the encoded file, not general-purpose. If we weren't on such a limited
system as the Atari 800, the table would be part of the output file,
not the code (like the "dictionary" in a gzipped file, etc).

Each byte value in the compressed data either represents itself (is a
plain data value) or some number of consecutive zero bytes (is a code
value). The table has an entry for each byte value 0-255 (conceptually
anyway; see below), and if the byte value represents itself, the table
entry will be zero. If it represents a zero-run, the table entry will
be non-zero, and will be the number of bytes in the run.

For a "run" of one zero-byte, the encoder doesn't bother to encode it as
a run: a plain zero is stored in the output file instead. This doesn't
affect the file size, but does save one encoded value.

Here's the above example again:

01 00 00 02 00 00 03 00 00 00 04

It encoded to:

01 05 02 05 03 06 04

The table built into the decoder will look like:

byte value | definition
-----------+-----------
01         | 0
02         | 0
03         | 0
04         | 0
05         | 2
06         | 3

This tells the decoder that 01, 02, 03 represent themselves (they're
copied from compressed input to decompressed output as-is)... the value
05 represents a run of 2 consecutive zeroes, and 06 represents a run
of 3 zeroes.

The decoder will look at the input (the compressed data), one byte at a
time, and consult the table to decide what to do with each byte. "Output"
here is the decompressed data, which you can think of as a list of bytes,
which starts out empty.

input| table lookup |                 |
data | result       | action          | output after action is taken
-----+--------------+-----------------+---------------------------------
01   | 0            | append 01       | 01
     |              |                 |
05   | 2            | append 2 zeroes | 01 00 00
     |              |                 |
02   | 0            | append 02       | 01 00 00 02
     |              |                 |
05   | 2            | append 2 zeroes | 01 00 00 02 00 00
     |              |                 |
03   | 0            | append 03       | 01 00 00 02 00 00 03
     |              |                 |
06   | 3            | append 3 zeroes | 01 00 00 02 00 00 03 00 00 00
     |              |                 |
04   | 0            | append 04       | 01 00 00 02 00 00 03 00 00 00 04

As you can see, the final output after the last byte of data is read,
matches the original input exactly.

In this dumb example, it's pretty obvious that the encoded data plus
the table size is bigger than the original input, so the "compression"
is actually making it bigger. But it serves to illustrate, I hope.

The table lookup values are stored as one byte, so a run of up to 255
consecutive zeroes can be stored as a single byte. A run of more than
that could be stored as 2 or more runs in a row, but the current encoder
just aborts if it encounters a run of 256 zeroes (Taipan doesn't need
it so I didn't implement it).

The astute reader will have noticed a that a table lookup result of 1,
meaning a zero-run of length 1, is possible to encode, but useless
(might as well just store the single zero as plain data), and including
an entry in the table for 1 makes the table larger by one entry for
no purpose. The actual encoder (see below) doesn't encode zero-runs
of length 1, it just stores a 0 as-is.

For Taipan, the title screen (newtitle.png) contains large black areas.
A black pixel is a 0 bit. The title screen data ends up packed 8 pixels
per byte (1 bit per pixel), so those black areas are represented by runs
of consecutive zero bytes.

Each row of pixels in the image has 256 bits, or 32 bytes. The bytes
are arranged in consecutive order: the leftmost 8 visible pixels are the
first byte of the first row... after enough bytes to fill up the row, the
next byte in the file is the leftmost 8 pixels of the 2nd line, etc etc.

The algorithm treats the input as a stream of bytes, and doesn't know
nor care how they're arranged on the screen, but if you know how they're
stored you can see that a black area at the right-hand end of one line
is "joined" with a black area at the left-hand end of the next line,
as far as the compression is concerned.

Specifics:
----------

The original image newtitle.png is the authoritative source for the title
screen that ends up being built into the game binary. Its resolution is
256x184 pixels. If this image is ever modified, the Makefile will cause
all the stuff mentioned below to be rebuilt.

First, a script called newtitle.pl reads the PNG image and creates an
Atari loadable file called titledata.xex. It contains a 6-byte Atari
header followed by the pixel data, one bit per pixel. The pixel data
is 5888 bytes. Formerly, before the compression scheme was invented,
titledata.xex was used as-is in creating the game binary taipan.xex.

The compression code is written in Perl, in the file titlecomp.pl. It
reads titledata.xex, applies the compression, and writes a file called
comptitle.dat, containing the raw compressed data (minus the table
definition). It also creates an assembly source file comptitle.s,
containing the table definition and the rest of the decoding code.

If you look at titlecomp.pl, you'll see it takes an optional argument,
forcing it to use a particular byte value as the first code value. This
number is determined by:

$ perl titlecomp.pl < titledata.dat
200 unique byte values
36 available run codes >= 128
1st code 128, last 189, table size 62
3437 bytes compressed data, 58.3% ratio
used 25 codes

The table size worked out to 62 bytes. We can likely do better than that,
but the process isn't automated. The "used 25 codes" bit is important:
we call titlecomp.pl with that number as an argument to its -l option:

$ perl titlecomp.pl -l 25 < titledata.dat
200 unique byte values
36 available run codes >= 128
133 57
147 57
148 62
149 65
151 64
154 62
162 57
163 67
164 71
166 70
== optimum firstcode value is 133

So now we compress the file for real, using 133 as the first code:

$ perl titlecomp.pl 133 < titledata.dat
200 unique byte values
36 available run codes >= 133
1st code 133, last 189, table size 57
3437 bytes compressed data, 58.3% ratio
used 25 codes

Plug the number 133 into the Makefile (under the comptitle.xex target) and
we're done. Whew, that was a lot of work just to save 5 bytes... the good
news is that this procedure only needs to be repeated if newtitle.png is
edited. The other good news is, those 5 bytes wouldn't have hurt anything
anyway. Worst case scenario, they would have added 1 sector to the file
size (and made it take ever so slightly longer to load).

Wait, what do I mean by smallest table size? Well, I said the table
contains one entry for each byte value 0 to 255... This isn't really
true. Byte values less than 128 *always* represent themselves, so I
don't bother storing them in the table. Also, the perl script knows the
lowest and highest numbered code values, so it only emits the portion
of the table between these two numbers (and modifies the code to not do
a table lookup, if it's outside the range). If you can read 6502
assembly, looking at the code will make this much clearer.

comptitle.s is then assembled, to create comptitle.xex.

When comptitle.s is assembled, it includes the contents of comptitle.dat
with an ".incbin", so the data ends up in comptitle.xex along with the
table and decoder. This file contains 2 Atari binary load segments:
the data + table + decoder, and an initialization segment telling the
Atari to run the decoder subroutine immediately after loading.

comptitle.xex ends up being the first part of the complete taipan.xex game
binary, so after its init routine exits, the rest of the game continues
to load. In particular, the next segment is another init routine that
tells the Atari to actually display the decoded title screen data.

Compression Ratio:
------------------

If we want to calculate the compression ratio, the size of comptitle.xex
is the compressed file size, and the size of titledata.xex is the original
file size:

$ ls -l comptitle.xex titledata.xex 
-rw-r--r-- 1 urchlay users 3602 Jan  6 02:08 comptitle.xex
-rw-r--r-- 1 urchlay users 5894 Jan  6 01:55 titledata.xex

The percentage is:

$ perl -e 'printf "%.1f\n", 3602/5894*100'
61.1

Actually, we have to do the calculations in disk sectors. A standard Atari
floppy disk holds 125 data bytes per sector (or less, but we assume full
sectors because that's how taipan.xex is generated). It's impossible
to read less than a full sector from the drive. So titledata.xex is 48
sectors, and comptitle.xex is 29.

What we really care about is loading time, not disk space: the title
screen should be visible as soon as possible after the program starts
loading (to give the user something to look at, while the huge bloated
game code loads). Standard 810 or 1050 drives load around 10 sectors
per second, we saved 19 sectors or around 2 seconds.  The decompression
takes 1/5 of a second, so it's a net gain.