aboutsummaryrefslogtreecommitdiff
path: root/doc/compression.txt
blob: 539e8bcff62dfd666667b353877fea540fc6642d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
ALF Compressed Data Stream
--------------------------

This describes the compressed data for each file in an ALF
archive. Each archive member has a 29-byte header (described in
fileformat.txt), followed by its compressed data.

The compression algorithm is an implementation of LZW. See:

https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch

ALF uses a variable bit-width encoding, 9 to 12 bits per code. Codes
are numbered 0 to 4095 (aka 2^12-1). Each code represents either
an uncompressed data byte, a control code, or a multibyte string
made from previously-seen input, thus:

Code         Purpose
------------+----------------------------------------------
0 to 255    | 1-byte string (predefined)
256         | Reset (clear token table, set bit size to 9)
257         | End of compressed data
258 to 4095 | Multibyte strings
------------+----------------------------------------------

Codes 0 to 255 are mapped to their values (e.g. code 65 represents
the byte 65 in the input).

As the encoder reads the input, it builds a dictionary of codes.
Initially, codes are stored in the output as 9 bits each. When the
number of codes exceeds 511 (the maximum number that can be stored in
9 bits), the code bit size is increased to 10. Likewise, when the code
count exceeds a 10-bit value (code 1023), the bit size becomes 11,
then after code 2047 it becomes 12. When all 12-bit code numbers are
used (code 4095), the encoder writes a reset/clear code, clears the
dictionary, and resets the bit size to 9. Notice that, other than the
change from 12 back to 9, there is no explicit "next bit size" code
in the file. The encoder and decoder must agree upon when the bit size
increases.

When the bit size changes, the last code is written at the current bit
size (including the reset/clear code), and the following code is at
the new bit size. This means ALF doesn't use an "early change" variant
of LZW.

Bits are stored to the file in MSB-first order. If the compressed data
doesn't end on a byte boundary (has a number of bits not divisible
by 8), the least significant bits of the last byte are filled with
0 bits.

ALF compressed streams always begin with a reset code (256), *except*
for empty files (length of 0 bytes). The stream always ends with
an end code (257). This means a 0-byte file consists of one token
(the end code), 9 bits long, which occupies 2 bytes. A 1-byte file
compresses to a reset code, the one byte, and an end code (9 bits
each, 27 bits total, or 4 bytes).