dla.xex notes
-------------

The way I use ca65 is unusual: I build Atari code with "-t none"
instead of "-t atari", and I generate my own XEX file headers (see
xex.inc) instead of using the cc65 linker. This is because cc65's
default atari linker script doesn't support multi-segment executables,
and it's a royal PITA to write a custom cc65 linker scripts. Also
because it makes ca65 behave more like Mac/65, which was my go-to
assembler back in the old days.

It might be possible to optimize this a bit further, maybe shave a
few percent off the run time. drunkwalk.s contains the innermost loop,
which has been unrolled and cycle-counted.

During generation, the ANTIC chip's DMA is disabled, to speed things
up. There wouldn't be anything to see anyway: the generation process
works with unpacked pixels (one per byte), so the ANTIC couldn't
display them properly. At the end, when all the particles are done,
the unpacked pixels are packed into bytes for display. Using unpacked
pixels is faster than doing all the shift-and-mask operations needed
for packed pixels, but it also uses a lot of memory (48K required, so
it won't run on my poor old 400).

Packing the pixels is a slow process, takes about 0.6 seconds. Since
it happens only once at the end of a 3+ minute process, it's probably
not worth trying to optimize. See render.s. Also, at the start of
generation, 28K of memory has to be cleared, which takes 0.3 seconds.

There might be a quick way to limit the particles' movement outside
the initial circle's radius. Right now, it's limited to a square area;
width and height are the diameter of the circle plus 10 pixels. The
corners of this square waste a lot of time; it'd be better to come up
with a way to do an octagon (the square with the corners cut off),
which shouldn't slow down the inner loop too much... I actually did
implement this, but it was too slow (the time spent in calculations
was longer than the time saved by doing them).

Rather than calculate points on a circle in asm code, the tables of
points for the 4 circle sizes are pre-calculated by a perl script
and included in the executable verbatim. The tables bloat the code
some (2KB), but the speed boost is well worth it. Also, the graphics
mode used is "graphics 8", but in ANTIC narrow playfield mode, so
the X resolution is 256... meaning I don't need two bytes for the X
cursor position (which saves a good bit of time). The code that plots
pixels doesn't use CIO to do so (it writes directly to the screen
memory), which also saves time. There's no floating point math in the
generation process: if there were, the asm version wouldn't be all
that much faster than the BASIC one...

It *does* use floating point to print integers (the default number of
particles in the prompt) and calculate the elapsed time in mmss.s. I
thought it would be easier to code that way; I'd forgot what a PITA
the FP ROM is. It works now, so I won't change it.