dla.xex notes ------------- The way I use ca65 is unusual: I build Atari code with "-t none" instead of "-t atari", and I generate my own XEX file headers (see xex.inc) instead of using the cc65 linker. This is because cc65's default atari linker script doesn't support multi-segment executables, and it's a royal PITA to write a custom cc65 linker scripts. Also because it makes ca65 behave more like Mac/65, which was my go-to assembler back in the old days. It might be possible to optimize this a bit further, maybe shave a few percent off the run time. drunkwalk.s contains the innermost loop, which has been unrolled and cycle-counted. During generation, the ANTIC chip's DMA is disabled, to speed things up. There wouldn't be anything to see anyway: the generation process works with unpacked pixels (one per byte), so the ANTIC couldn't display them properly. At the end, when all the particles are done, the unpacked pixels are packed into bytes for display. Using unpacked pixels is faster than doing all the shift-and-mask operations needed for packed pixels, but it also uses a lot of memory (48K required, so it won't run on my poor old 400). Packing the pixels is a slow process, takes about 0.6 seconds. Since it happens only once at the end of a 3+ minute process, it's probably not worth trying to optimize. See render.s. Also, at the start of generation, 28K of memory has to be cleared, which takes 0.3 seconds. There might be a quick way to limit the particles' movement outside the initial circle's radius. Right now, it's limited to a square area; width and height are the diameter of the circle plus 10 pixels. The corners of this square waste a lot of time; it'd be better to come up with a way to do an octagon (the square with the corners cut off), which shouldn't slow down the inner loop too much... I actually did implement this, but it was too slow (the time spent in calculations was longer than the time saved by doing them). Rather than calculate points on a circle in asm code, the tables of points for the 4 circle sizes are pre-calculated by a perl script and included in the executable verbatim. The tables bloat the code some (2KB), but the speed boost is well worth it. Also, the graphics mode used is "graphics 8", but in ANTIC narrow playfield mode, so the X resolution is 256... meaning I don't need two bytes for the X cursor position (which saves a good bit of time). The code that plots pixels doesn't use CIO to do so (it writes directly to the screen memory), which also saves time. There's no floating point math in the generation process: if there were, the asm version wouldn't be all that much faster than the BASIC one... It *does* use floating point to print integers (the default number of particles in the prompt) and calculate the elapsed time in mmss.s. I thought it would be easier to code that way; I'd forgot what a PITA the FP ROM is. It works now, so I won't change it.