Atari 8-Bit Self Relocator
--------------------------

This is a modified form of a technique I saw in Bill Wilkinson's
Insight: Atari column in Compute! magazine (Issue 21, Feb 1982).

To build the relocator and run the demo, you'll need:

- cc65 from https://cc65.github.io/
- axe from https://slackware.uk/~urchlay/repos/bw-atari8-tools

...as well as standard Linux packages like make and perl.

To build, just type "make". The result is "reloc.atr", which is
an Atari disk image with DOS 2.0S and the relocatable program as
AUTORUN.SYS. Boot the disk on an Atari or emulator to see it run.

The demo shows "Hello World" with changing colors, along with its own
load address, end address, and the current MEMLO. The important part
is that it got relocated to MEMLO and run from there. The code isn't
relocatable (see the souce, "hello.s"). The relocator adjusted all the
absolute addresses on the fly (at load time).

How it works
------------

You assemble the code twice. The 2nd time around, you set the origin
one page higher than the first. You have two executables that are
identical except for the high bytes of absolute addresses within the
code (which differ by one). Based on this information, the relocator
can move the code to just above MEMLO and adjust all the addresses so
it'll actually run in its new location.

Unfortunately, the code can only be relocated by multiples of 256
bytes. The low bytes aren't adjusted. So unless MEMLO happens to
contain $FF in its low byte, some memory will be wasted (up to 255
bytes).

The code from Insight: Atari is doesn't produce self-relocating
executables. What it produces is BASIC programs that have the
relocatable object code as DATA statements, POKEd into memory when
run. The relocator presented here gets appended to your standard
executable and relocates it "on the fly", then jumps to the start of
the relocated code.

Example: a subroutine call to within our own code:

 JSR print_banner

This is the first instruction in our program, so it will be found
at $4000 for the first assembly pass, and $4100 for the second.

Say print_banner ends up at $4123 when we assemble at $4000, and $4223
when assembling at $4100. Further, we determine MEMLO has $1D80. So,
when we relocate the program, it ends up at $1E00 (the start of the
next page). The target of the JSR instruction has to be adjusted
to match the new location where print_banner is going to be. After
relocation, the JSR $4123 reads JSR $1E23.

The code that does the relocation, we'll call the relocator. The term
"relocating loader" is used elsewhere, but it's not accurate here: DOS
is the loader, and we're not replacing it.

The relocator is a small routine that gets appended to the first
executable (the $4000 one) as a segment, plus two data tables (one for
the original ORG, code length, init, and run addresses, the other with
the addresses that need adjusting), plus an INITAD segment that runs
the relocator code. These all have to load at a fixed address, but
once they're finished running, they won't be needed again.

The relocator has to know the load address and the length of the main
segment of the program (the part it's going to relocate). What it
does:

1. Subtract the high byte of MEMLO from the high byte of the load address
   ($4000 in the example), then add 1. This gives us a positive number
   (we hope!) that is the amount each address's high byte in the
   program should have subtracted from it.

2. Iterate over the relocation data table, subtracting the
   offset. Each table entry is the two-byte address of a byte that
   needs to be changed (an absolute address that's "baked" into the
   program).

3. Move the main segment to the start of the first page above MEMLO.

4. Set MEMLO to point to the byte after the end of the program
   to protect it from being overwritten by e.g. BASIC or ASM/ED.

5. If the program has an init address, subtract the offset from it,
   then jump to it. This runs the payload program's init routine.

5. If the program has a run address, subtract the offset from it,
   storing the result in RUNAD. Then do an RTS to hand control back
   to DOS. DOS will run the relocated code by jumping to the altered
   RUNAD, in the usual way.

Notes:

- To keep things simple, the program must consist of a single
  segment of code and data, followed by an init address and/or an run
  address.

- If your program is a device driver or a "TSR", you should use an
  init address, NOT a run address. This allows users to append your
  program to e.g. an RS-232 driver, and maybe a RAMdisk driver too,
  etc. Each driver should have an init address, because Atari
  executables can have multiple init addresses.

- If your program is an application, it's usually better to use a run
  address. If you use an init address, your program will run, but DOS
  will still be "in the middle of" loading the executable, meaning
  IOCB #1 will still be open for reading.

- The program's end address must be below $6C00, since that's where
  the relocator and tables load. The reason for this restriction
  is to allow the relocatable executable to work with a 16K cartridge.
  The lowest sane start address for the program is probably $2000,
  which allows the program to be 19KB in size... though $3000 is
  a lot safer (15KB max).

- Whatever start address (ORG) you use for the program, it has to
  be higher than the current MEMLO when the relocation is done.
  That's why I said $3000 is safer than $2000: if someone uses a fancy
  DOS and/or have lots of device drivers loaded, MEMLO could exceed
  $2000, which would cause your program to crash when loaded.

- Also, the start address has to start on a page boundary ($xx00).

- The data table size must not exceed 4K. The table is compressed; see
  "Relocation Table Format", below.

- The original Wilkinson scheme was done entirely in Atari BASIC.
  I use a perl script to create the relocation tables and the
  relocator itself becomes part of the relocatable program, so BASIC
  is not required. The perl script will be rewritten in C at some
  point, and the the C program will run on either the Atari or on
  a modern POSIX system.

- Indirect JMP instructions should always be used with care on the
  6502. The two operand bytes have to be in the same page, due to a
  6502 bug. Most 6502 asm programmers know how to handle this... but
  with dynamically relocatable code, there's not really a good way to
  do it. Best to avoid indirect JMPs. One simple workaround is to use
  self-modifying code: Have an absolute JMP instruction in your code,
  and store the indirect jump's destination there. Example:

 JMP (VECTOR)

...becomes:

 LDA VECTOR
 STA TRAMPOLINE+1
 LDA VECTOR+1
 STA TRAMPOLINE+2
 JMP TRAMPOLINE
 ; somewhere in the code you have this:
TRAMPOLINE JMP $0000

  Another way to do it would be to use call-by-RTS (push the jump
  address minus one on the stack, then execute RTS).

- If your code has really tight cycle-counted timing loops, the timing
  might get thrown off due to relocation causing a branch to cross a
  page boundary, when it was originally not supposed to. This kind of
  code generally only belongs in games and demos. Relocatable code is
  usually used for things like device drivers or programming utilities.
  Games "take over" the whole machine and don't have to care about MEMLO
  or other software needing free RAM.

Format of the relocatable executable:

- Segment with the original code, at the original load address.
- Segment with the relocator code and relocation tables.
- INITAD segment that runs the relocator code.

Note that the original RUNAD and INITAD segments (if any) don't appear
in the relocatable file as segments.

Relocation tables start immediately after the last byte of the relocator.

First table is 8 bytes (4 words):
- Original load address
- Original end address
- Original run address (or 0 for none)
- Original init address (or 0 for none)

The next N bytes are the high-byte relocation table. See below.

For the init address, if it's not zero, the relocator JSR's to it (at its
new location).

For the run address, if it's not zero, the relocator adjusts RUNAD,
and DOS uses RUNAD as usual when the program's done loading.

Example:

 *=$4000
start:
 jsr set_color    ; $4000 JSR $4007
 jsr set_cursor   ; $4003 JSR $400E
 rts              ; $4006
set_color:
 lda bgcolor       ; $4007 LDA $4015
 sta COLOR2        ; $400A
 rts               ; $400D
set_cursor:
 lda cursor        ; $400E LDA $4016
 sta CRSINH        ; $4011
 rts               ; $4014
bgcolor: .byte $00 ; $4015
cursor:  .byte $01 ; $4016
 *=INITAD
 .word start

The address table for the above program:

$00 40 - code_start
$16 40 - code_end
$00 00 - code_run (no run address)
$00 40 - code_init

High byte relocation table:

$02 $40 ; hi byte of JSR $4007 operand
$05 $40 ; hi byte of JSR $400E operand
$09 $40 ; hi byte of LDA $4015 operand
$10 $40 ; hi byte of LDA $4016
$00 $00 ; terminator

Program loads from $4000 to $4016. If MEMLO was $1CFC, the relocator
will move the program to $1D00 - $1D16 and set MEMLO to $1D17. The
operand of the first instruction (was JSR $4007) will be altered
to $1D07 (aka $4007 - $4000 + $1CFC), which is the address that the
subroutine got relocated to.

The original program assembled to a 32-byte file. The relocatable
version will be around 400 bytes: 28 bytes for the original file
(minus its INITAD segment), ~300 bytes for the relocator code, 8
bytes for the address table, and 10 bytes for the relocation table.
However, the relocator and tables are only used once, and can be
overwritten afterwards (so they count as free memory).

Relocation Table Format

Current implementation:

A list of addresses that need to be adjusted (high bytes of absolute
addresses), 2 bytes each, terminated with $00 $00.

Possible future implementation:

Bitmap. One bit per byte in the file. 1 if the address needs
adjusting, 0 if not. This *probably* will actually be smaller than
the list of addresses. Also has the advantage of being a fixed size,
easily calculated/predicted.

The relocator is 256 bytes long or less.
The GR.0 display list with a 16K cart in is at $7C20.
We want to end the bitmap at $7C00.
Bitmap table will always be 1/8 the code size.

If your code is 18880 bytes, the bitmap size is 2360 bytes.
Supposing you ORG at $2800:

code - $2800 to $71BF
relocator - $71C0 to $71CF
8-byte table: $71D0 to $71D7
bitmap - $71D8 to $7B10