From d0b8532b703ef515b89eb8f34c0402262f3d3f7e Mon Sep 17 00:00:00 2001 From: "B. Watson" Date: Wed, 18 Dec 2024 07:05:01 -0500 Subject: add -j/-p/-w options. --- uxd.rst | 40 +++++++++++++++++++++++++++++++++++++--- 1 file changed, 37 insertions(+), 3 deletions(-) (limited to 'uxd.rst') diff --git a/uxd.rst b/uxd.rst index 459de77..2220174 100644 --- a/uxd.rst +++ b/uxd.rst @@ -98,6 +98,15 @@ as *K*, *M*, and *G* for power-of-10 based (e.g. *1K* is 1000 bytes). .. print number of bytes/chars/ascii/multibyte/bad sequences. +-j + Java mode (aka MUTF-8). Identical to UTF-8 except it allows the + overlong **0xc0 0x80** encoding for codepoint U+0000 (aka NUL), + which normally would be considered an error. + This may be useful for looking at serialized data created by Java + programs. + +.. java (MUTF-8) mode: allow 0xc0 0x80 for U+0000. + -l length Stop dumping after *length* bytes (not characters). If the limit is reached in the middle of a multibyte character, the entire character @@ -126,6 +135,11 @@ as *K*, *M*, and *G* for power-of-10 based (e.g. *1K* is 1000 bytes). .. added to hex offsets (decimal, 0x hex, 0 octal). +-p + Permissive mode. Turns off error highlighting for overlongs, codepoints + above **U+10FFFF**, and surrogates. Only malformed sequences will be + highlighed in red. + -r Highlight multi-byte sequences in reverse video, in the hex output. Ignored if **-m** given. @@ -171,6 +185,11 @@ as *K*, *M*, and *G* for power-of-10 based (e.g. *1K* is 1000 bytes). .. print version of uxd. +-w + WTF-8 mode. Surrogates **U+D800** to **U+D8FF** will not be considered errors. + +.. WTF-8 mode (allow surrogates). + OUTPUT FORMAT ============= @@ -340,12 +359,27 @@ Failure status will only be returned if **uxd** failed to open the input file. Invalid input (non-UTF-8) doesn't count as an error; it'll just have lots of red in the output. +LIMITATIONS +=========== + +There are not bugs, because they're part of the design. + +Only UTF-8 and a couple of variants (WTF-8, MUTF-8) are supported. +There is no support for UTF-16, UTF-32, UTF-EBCDIC, or any other +non-UTF-8 encoding. + +There's no support for any number base except hex. + +The input is read one byte at a time, so a search or regex match +option would be difficult or impossible to implement. + +Seeking backwards from the end of the file is impossible when reading +from standard input. The only way to fake this would be to read the +whole file into memory at startup, which **uxd** doesn't do. + BUGS ==== -There should be options and/or a config file to change the colors, -rather than baking them into the binary. - Combining characters are not handled well. Or at all, really: the 2 characters being combined will have an ANSI color code in between. urxvt at least ignores the color code, so the composite character -- cgit v1.2.3