Atari 8-Bit Self Relocator
--------------------------

This is a modified form of a technique I saw in Bill Wilkinson's
Insight: Atari column in Compute! magazine (Issue 21, Feb 1982).
It creates Atari executables that relocate themselves to just
above MEMLO.

To build the relocator and run the demo, you'll need:

- cc65 from https://cc65.github.io/
- axe from https://slackware.uk/~urchlay/repos/bw-atari8-tools

...as well as standard Linux packages like make and gcc.

To build the demo, just type "make". The result is "reloc.atr", which
is an Atari disk image with DOS 2.0S and the relocatable program as
AUTORUN.SYS. Boot the disk on an Atari or emulator to see it run.
There's also "reloc25.atr", which is the same thing except it's DOS
2.5 (with MEMLO a bit higher).

The demo shows "Hello World" with changing colors, along with its own
load address, end address, and the current MEMLO. The important part
is that it got relocated to MEMLO and run from there. The code isn't
relocatable (see the source, "hello.s"). The relocator adjusted all
the absolute addresses on the fly, at load time.

There's also a "native.atr", which is a DOS 2.0S bootable disk with
the relocator compiled for the Atari, as MKRELOC.XEX. This will load
with DOS's L command, and will read LO.XEX and HI.XEX (which are
non-relocatable) and create a relocatable AUTORUN.SYS. Reboot
to see the demo run.

Usage
-----

To create relocating executables of your own software, you can
use either a modern system running a 6502 cross-assembler (atasm,
xa65, ca65, dasm, etc) or an Atari 8-bit.

First, write your code. There are some limitations:

- All your code and data must be in a single segment. Generally
  this means, only set the origin once, and don't use *= or .org
  again until the end (for RUNAD and/or INITAD).
- Your code's origin (start address) must begin on an even
  page boundary, $2800 or higher.
- You can use only one init address.

Once your code is written and tested:

- Assemble the code at a start address of $2800 or higher, as a
  regular Atari executable (.xex/.com/.bin file). The executable must
  be called "lo.xex" if you're cross-assembling, or "D:LO.XEX" if
  you're using the Atari.
- Change the start address (*= or .org direcive) so that it's
  one page higher. If you used $2800, you'd change it to $2900.
- Assemble the code again, to an executable called "hi.xex",
  or "D:HI.XEX" on the Atari.
- Make sure you have the reloc.xex (D:RELOC.XEX) and mkreloc
  (D:MKRELOC.XEX) files in the same directory (or on the same disk).
- Run the relocator. On a modern system the command will be
  "./mkreloc" (or possibly just "mkreloc" if you installed it
  somewhere on your $PATH). On the Atari, load D:MKRELOC.XEX
  from the DOS menu.
- If you're using an Atari, wait a bit. Listen to the disk I/O
  beeps... when it's finished, you'll be back at the DOS menu.
  You will have a brand-new AUTORUN.SYS, which is the self
  relocating version of your program. You can reboot to run it.
- If you're on a modern system, you'll have a (lowercase)
  autorun.sys, which you can copy to a DOS disk image. You
  can also test-run it by directly loading it with an
  emulator (e.g. "atari800 autorun.sys"), if it can run
  without DOS.

How it works
------------

You assemble the code twice. The 2nd time around, you set the origin
one page higher than the first. You have two executables that are
identical except for the high bytes of absolute addresses within the
code (which differ by one). Based on this information, the relocator
can move the code to just above MEMLO and adjust all the addresses so
it'll actually run in its new location.

Unfortunately, the code can only be relocated by multiples of 256
bytes. The low bytes aren't adjusted. So unless MEMLO happens to
contain $FF in its low byte, some memory will be wasted (up to 255
bytes).

The code from Insight: Atari is doesn't produce self-relocating
executables. What it produces is BASIC programs that have the
relocatable object code as DATA statements, POKEd into memory when
run. The relocator presented here gets appended to your standard
executable and relocates it "on the fly", then jumps to the
(relocated) run and/or init address of the relocated code.

Example: a subroutine call to within our own code:

 JSR print_banner

This is the first instruction in our program. Say we assemble at
$4000, so it will be found at $4000 for the first assembly pass, and
$4100 for the second.

Say print_banner ends up at $4123 when we assemble at $4000, and $4223
when assembling at $4100. Further, we determine MEMLO has $1D80. So,
when we relocate the program, it ends up at $1E00 (the start of the
next page). The target of the JSR instruction has to be adjusted
to match the new location where print_banner is going to be. After
relocation, the JSR $4123 reads JSR $1E23.

The code that does the relocation, we'll call the relocator. The term
"relocating loader" is used elsewhere, but it's not accurate here: DOS
is the loader, and we're not replacing it.

The relocator is a small routine that gets appended to the first
executable (the $4000 one) as a segment, plus two data tables. The
first is 8 bytes, and has the original ORG, code length, init, and run
address. The other is a bitmap of the addresses in the program, one
bit per byte in the program. The bit is set to 1 if that address needs
relocating, or 0 if not.

The tables are followed by an INITAD segment that runs the relocator
code. The relocator and the tables have to load at a fixed address,
but once they're finished running, they won't be needed again.

The relocator has to know the load address and the length of the
"payload" segment of the program (the part it's going to relocate). At
load time, it gets run via INITAD. What it does:

1. Subtract the high byte of MEMLO from the high byte of the load address
   ($4000 in the example), then add 1. This gives us a positive number
   (we hope!) that is the amount each address's high byte in the
   program should have subtracted from it.

3. Loop over the code to be relocated, copying it to the new
   address (start of the first page above MEMLO). As each byte is
   moved, it's also adjusted (has the offset subtracted from it) if
   its bit in the relocation table is set.

4. Set MEMLO to point to the byte after the end of the program
   to protect it from being overwritten by e.g. BASIC or ASM/ED.

5. If the program has an init address, subtract the offset from it,
   then jump to it. This runs the payload program's init routine.

6. If the program has a run address, subtract the offset from it,
   storing the result in RUNAD. Then do an RTS to hand control back
   to DOS. DOS will run the relocated code by jumping to the altered
   RUNAD, in the usual way.

Notes:

- If your program is a device driver or a "TSR", you should use an
  init address, NOT a run address. This allows users to append your
  program to e.g. an RS-232 driver, and maybe a RAMdisk driver too,
  etc. Each driver should have an init address, because Atari
  executables can have multiple init addresses.

- If your program is an application, it's usually better to use a run
  address. If you use an init address, your program will run, but DOS
  will still be "in the middle of" loading the executable, meaning
  IOCB #1 will still be open for reading.

- The program's end address must be below $71C0, since that's where
  the relocator and tables load. The reason for this restriction
  is to allow the relocatable executable to work with a 16K cartridge.
  The lowest sane start address for the program is probably $2800,
  which allows the program to be 18.5KB in size... though $3000 is
  a lot safer (16.5KB max).

- Whatever start address (ORG) you use for the program, it has to
  be higher than the current MEMLO when the relocation is done.
  That's why I said $3000 is safer than $2800: if someone uses a fancy
  DOS and/or has lots of device drivers loaded, MEMLO could exceed
  $2800, which would cause your program to crash when loaded.

- Also, the start address has to start on a page boundary ($xx00).
  Since it gets relocated to another page boundary, this means
  JMP (indirect) is safe to use: if the operand doesn't cross a
  page boundary, it still won't after it's relocated.

- The original Wilkinson scheme was done entirely in Atari BASIC.
  I use a C program to create the relocation tables and the
  relocator itself becomes part of the relocatable program, so BASIC
  is not required. The relocator-generator will run on either the
  Atari or on a modern POSIX system.

- The Insight: Atari article mentions that OSS languages use a scheme
  like this to relocate themselves when loaded. The sources for the
  OSS languages that have been released have a BASIC XL program that
  generates the bitmaps.

- If your code has really tight cycle-counted timing loops, the timing
  might get thrown off due to relocation causing a branch to cross a
  page boundary, when it was originally not supposed to. This kind of
  code generally only belongs in games and demos. Relocatable code is
  usually used for things like device drivers or programming utilities.
  Games "take over" the whole machine and don't have to care about MEMLO
  or other software needing free RAM.

Format of the relocatable executable
------------------------------------

- Segment with the original code, at the original load address. This is
  a copy of the first segment of lo.xex, actually.
- Segment with the relocator code (from reloc.xex) and relocation tables.
- INITAD segment that runs the relocator code.

Note that the original RUNAD and INITAD segments (if any) don't appear
in the relocatable file as segments.

Relocation tables start immediately after the last byte of the relocator.

First table is 8 bytes (4 words):
- Original load address
- Original end address
- Original run address (or 0 for none)
- Original init address (or 0 for none)

The next N bytes are the relocation bitmap table. See below.

For the init address, if it's not zero, the relocator jumps to it (at
its new location). As usual, when the init code is done, it exits with
an RTS, which will hand control back to DOS.

For the run address, if it's not zero, the relocator adjusts RUNAD,
and DOS uses RUNAD as usual when the program's done loading. Again,
an RTS returns to DOS.

Example:

 *=$4000
start:
 jsr set_color    ; $4000 JSR $4007
 jsr set_cursor   ; $4003 JSR $400E
 rts              ; $4006
set_color:
 lda bgcolor       ; $4007 LDA $4015
 sta COLOR2        ; $400A
 rts               ; $400D
set_cursor:
 lda cursor        ; $400E LDA $4016
 sta CRSINH        ; $4011
 rts               ; $4014
bgcolor: .byte $00 ; $4015
cursor:  .byte $01 ; $4016
 *=INITAD
 .word start

The address table for the above program:

$00 40 - code_start
$16 40 - code_end
$00 00 - code_run (no run address)
$00 40 - code_init

Relocation bitmap table, in binary:

table byte:   addresses:
00100100      $4000 to $4007
01000000      $4008 to $400F
10000000      $4010 to $4017

The bits are read left to right. The first 1 bit is for
address $4002, which is the high byte of the JMP operand.

The last byte of the table actually extends past the end
of the program. Extra bits in the last byte are set to 0.

The bitmap table is always 1/8 the size of the code, rounded up to
the next byte. It might be possible someday to save space by letting
the table end early, if e.g. the last part of the program is fully
relocatable code (or data). Currently this isn't done, and I'm not
sure it's worth the extra complexity to implement.

Program loads from $4000 to $4016. If MEMLO was $1CFC, the relocator
will move the program to $1D00 - $1D16 and set MEMLO to $1D17. The
operand of the first instruction (was JSR $4007) will be altered
to $1D07 (aka $4007 - $4000 + $1D00), which is the address that the
subroutine got relocated to.

The original program assembled to a 32-byte file. The relocatable
version will be around 400 bytes: 28 bytes for the original file
(minus its INITAD segment), ~300 bytes for the relocator code, 8
bytes for the address table, and 10 bytes for the relocation table.
However, the relocator and tables are only used once, and can be
overwritten afterwards (so they count as free memory).

Relocation Table Format
-----------------------

Bitmap. One bit per byte in the file, read from high bit to low. 1 if
the address needs adjusting, 0 if not.

The relocator is 256 bytes long or less.
The GR.0 display list with a 16K cart in is at $7C20.
We want to end the bitmap at $7C00.
Bitmap table will always be 1/8 the code size.

If your code is 18880 bytes, the bitmap size is 2360 bytes.
Supposing you ORG at $2800:

code - $2800 to $71BF
relocator - $71C0 to $72BF
8-byte table: $72C0 to $72C7
bitmap - $72C8 to $7C00

18880 bytes is the maximum size. Actually, the relocator is only 183
bytes, and the table could extend to $7C1F without overwriting the
display list.