From c205a7ea2a7171b61dae4ac51a3a251cceb1dde1 Mon Sep 17 00:00:00 2001 From: "B. Watson" Date: Wed, 18 Dec 2024 05:47:07 -0500 Subject: detect UTF-16 surrogates as bad, use red for overlong --- uxd.rst | 16 ++++++++++------ 1 file changed, 10 insertions(+), 6 deletions(-) (limited to 'uxd.rst') diff --git a/uxd.rst b/uxd.rst index 535177d..1789efe 100644 --- a/uxd.rst +++ b/uxd.rst @@ -234,14 +234,10 @@ changed with the **-c** option (see above). dumpers do. The Unicode BOM (byte order marker, U+FEFF) is printed as a purple letter B. - Note: Overlong encodings (e.g. codepoints U+0000 to U+007F encoded - as 2 or more bytes) are rendered as � (U+0FFD) in reverse video - purple. - **red** Invalid UTF-8 sequences. These are rendered as � (U+0FFD) with - a red background, to make them stand out. Examples of invalid - sequences: + a red background, to make them stand out. Invalid + sequences are: - Prefix bytes (>= 0x80) which are not followed by the correct number of continuation bytes (with their high 2 bits set to **10**). @@ -250,8 +246,16 @@ changed with the **-c** option (see above). - Truncated UTF-8 sequence at EOF. + - UTF-16 surrogates (codepoints U+D800 to U+DFFF). + - Codepoints above U+10FFFF, which are disallowed by RFC 3629. + - Overlong encodings (e.g. codepoints U+0000 to U+007F encoded + as 2 or more bytes). + + Each occurrence of any of the above will increment the "Bad + Sequences" count, if the **-i** option is used. + TERMINAL SUPPORT ================ -- cgit v1.2.3