diff options
author | B. Watson <urchlay@slackware.uk> | 2024-12-12 06:21:05 -0500 |
---|---|---|
committer | B. Watson <urchlay@slackware.uk> | 2024-12-12 06:21:05 -0500 |
commit | 4df7bb4d762ff945fb7a823cb4c153cab7e3c273 (patch) | |
tree | bd58ead44eb3ff2c3e0d2935144bbe663d845b1a /uxd.rst | |
download | uxd-4df7bb4d762ff945fb7a823cb4c153cab7e3c273.tar.gz |
initial commit
Diffstat (limited to 'uxd.rst')
-rw-r--r-- | uxd.rst | 189 |
1 files changed, 189 insertions, 0 deletions
@@ -0,0 +1,189 @@ +.. |version| replace:: 0.0.1 +.. |date| date:: + +=== +uxd +=== + +---------------- +UTF-8 hex dumper +---------------- + +:Manual section: 1 +:Manual group: Urchlay's Utilities +:Date: |date| +:Version: |version| + +SYNOPSIS +======== + +uxd [*file* | *-*] + +DESCRIPTION +=========== + +**uxd** is a hex dump utility that's aware of UTF-8 multibyte sequence +semantics. + +Input is read from *file*, or standard input if *file* is missing or +given as **-**. The input is treated as UTF-8 encoded Unicode. Since +ASCII is a subset, **uxd** works fine on plain ASCII files too. Other +encodings such as UTF-16, ISO-8859-*, Shift-JIS, etc, can be used, but +**uxd** won't handle these any better than a regular hex-dump utility +such as **xxd**. + +Output is written to standard output, which is normally a +terminal. It's assumed that the terminal supports ANSI-style color and +UTF-8. See **TERMINAL SUPPORT** below. + +Each line of output consists of eighteen columns: the offset from the +start of the file (in hex; minimum 4 digits), 16 bytes of hex +data (or empty cells, if the last line of the dump is for fewer than +16 bytes), and the human-readable form of the same data. + +The hex bytes and human-readable data are colorized to make it obvious +which bytes make up each character. Since UTF-8 is a variable-width +encoding, this means that one character may be composed of up to +4 bytes. + +OPTIONS +======= + +There are no options yet. + +EXAMPLE +======= + +It's hard to give a proper example, since man pages don't support +color. You'll have to use your imagination. Also, this section of +the man page requires your man command to support UTF-8 embedded in +the man page. If the examples looks mangled, try viewing the source +(uxd.rst) in a text editor. + +Japanese text example:: + + $ echo ¥ǥ£¥ | uxd + 0000: c2 a5 c7 a5 c2 a3 c2 a5 0a ¥ǥ£¥↵ + GG GG YY YY GG GG YY YY PP GYGYP + +The colors are indicated by G/Y/P, for green, yellow, and purple. The +character above each letter is displayed in that color. + +From the colorization, it's obvious that the "c2 a5" is the hex +representation of the first ¥ character, and that the ǥ is +represented by "c7 a5". + +The newline is displayed in purple because it's not a regular +printable character. Its human-readable representation is ↵. Note +that if a regular ↵ character appears in the input, it'll be +rendered in either green or yellow (as a regular character). + +COLORS +====== + +**green**, **yellow** + Printable characters (except the space, U+0020) alternate between green and yellow. + +**purple** + Spaces and unprintable characters ("control" characters, newlines, tabs, etc). + These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline. + This is an improvement over the usual practice of printing these as periods, like + standard hex dumpers do. + +**red** + Invalid UTF-8 sequences. These are rendered with a red foreground, to make them + stand out. Examples of invalid sequences: + + - Prefix bytes (>= 0x80) which are not followed by the correct number of continuation + bytes (with their high 2 bits set to **10**). + + - Continuation bytes that aren't preceded by a valid prefix byte. + + - Truncated UTF-8 sequence at EOF. + +TERMINAL SUPPORT +================ + +**uxd** should work with any modern terminal that supports color, +ANSI-style escape sequences, Unicode, and UTF-8 rendering. + +The author's testing is done primarily with **urxvt**\(1). Other +terminals aren't tested as often. + +Known to work: urxvt, xterm, st, xfce4-terminal, gnome-terminal, the Linux console (but +see **FONTS**, below). + +Known **not** to work: rxvt (doesn't support Unicode at all). + +FONTS +===== + +For the human-readable column to display correctly, you'll need a font +with lots of glyphs. Try *Deja Vu Sans Mono*, *Symbola*, *Quivira*. +If you use urxvt, it searches for glyphs in multiple fonts, so you can +use all of the above at once. + +The Linux console is capable of rendering UTF-8, but it's incapable +of displaying more than 512 glyphs. Most console fonts only define +256, since using more than 256 means the console won't be able to +do bold. Expect to see lots of solid or dotted boxes. This isn't +specifically a problem with **uxd**. + +FILES +===== + +**uxd** doesn't read any files other than the input file, and doesn't write to +any files other than standard output. There's no config file. + +ENVIRONMENT +=========== + +**uxd** doesn't read anything from the environment. It's *not* necessary to +have a UTF-8 locale set in e.g. **LANG** or **LC_ALL**. Also, the **TERM** +variable is not used. + +EXIT STATUS +=========== + +Zero for success, non-zero for failure. + +Failure status should only be returned if **uxd** failed to open the +input file. Invalid input (non-UTF-8) doesn't count as an error; +it'll just have lots of red in the output. + +BUGS +==== + +**uxd** doesn't check for overlong UTF-8 encodings (e.g. a character +that could be a 1-byte sequence, but is encoded as 2 or more). +Sequences like this really should be colorized in red. Technically, +this means **uxd** supports WTF-8, not UTF-8. + +RFC 3629 doesn't allow UTF-8 to use codepoints above U+10FFFF. 4-byte +sequences can support codepoints U+110000 to U+1FFFFF, which are not +valid Unicode. If these occur in the input, **uxd** should colorize +them in red, but it doesn't (yet). + +There should be options and/or a config file to change the colors, +rather than baking them into the binary. + +Combining characters are not handled well. Or at all, really: the 2 +characters being combined will have an ANSI color code in between. +urxvt at least ignores the color code, so the composite character +displays in the color of the first (non-combining) character. I'm not +sure what a better solution would be... + +COPYRIGHT +========= + +Licensed under the WTFPL. See http://www.wtfpl.net/txt/copying/ for details. + +AUTHORS +======= + +B. Watson <urchlay@slackware.uk>. + +SEE ALSO +======== + +xxd(1), bvi(1), utf-8(7), unicode(7) |