.. |version| replace:: 0.0.1 .. |date| date:: === uxd === ---------------- UTF-8 hex dumper ---------------- :Manual section: 1 :Manual group: Urchlay's Utilities :Date: |date| :Version: |version| SYNOPSIS ======== uxd [*file* | *-*] DESCRIPTION =========== **uxd** is a hex dump utility that's aware of UTF-8 multibyte sequence semantics, and uses colorized output to indicate which byte sequences go with which human-readable characters. Input is read from *file*, or standard input if *file* is missing or given as **-**. The input is treated as UTF-8 encoded Unicode. Since ASCII is a subset, **uxd** works fine on plain ASCII files too. Other encodings such as UTF-16, ISO-8859-*, Shift-JIS, etc, can be used, but **uxd** won't handle these any better than a regular hex-dump utility such as **xxd**. Output is written to standard output, which is normally a terminal. It's assumed that the terminal supports ANSI-style color and UTF-8. See **TERMINAL SUPPORT** below. Each line of output consists of eighteen columns: the offset from the start of the file (in hex; minimum 4 digits), 16 bytes of hex data (or empty cells, if the last line of the dump is for fewer than 16 bytes), and the human-readable form of the same data. The hex bytes and human-readable data are colorized to make it obvious which bytes make up each character. Since UTF-8 is a variable-width encoding, this means that one character may be composed of up to 4 bytes. OPTIONS ======= There are no options yet. EXAMPLE ======= It's hard to give a proper example, since man pages don't support color. You'll have to use your imagination. Also, this section of the man page requires your man command to support UTF-8 embedded in the man page. If the examples looks mangled, try viewing the source (uxd.rst) in a text editor. Japanese text example:: $ echo ¥ǥ£¥ | uxd 0000: c2 a5 c7 a5 c2 a3 c2 a5 0a ¥ǥ£¥↵ GG GG YY YY GG GG YY YY PP GYGYP The colors are indicated by G/Y/P, for green, yellow, and purple. The character above each letter is displayed in that color. From the colorization, it's obvious that the "c2 a5" is the hex representation of the first ¥ character, and that the ǥ is represented by "c7 a5". The newline is displayed in purple because it's not a regular printable character. Its human-readable representation is ↵. Note that if a regular ↵ character appears in the input, it'll be rendered in either green or yellow (as a regular character). COLORS ====== **green**, **yellow** Printable characters (except the space, U+0020) alternate between green and yellow. **purple** Spaces and unprintable characters ("control" characters, newlines, tabs, etc). These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline. This is an improvement over the usual practice of printing these as periods, like standard hex dumpers do. **red** Invalid UTF-8 sequences. These are rendered with a red foreground, to make them stand out. Examples of invalid sequences: - Prefix bytes (>= 0x80) which are not followed by the correct number of continuation bytes (with their high 2 bits set to **10**). - Continuation bytes that aren't preceded by a valid prefix byte. - Truncated UTF-8 sequence at EOF. TERMINAL SUPPORT ================ **uxd** should work with any modern terminal that supports color, ANSI-style escape sequences, Unicode, and UTF-8 rendering. The author's testing is done primarily with **urxvt**\(1). Other terminals aren't tested as often. Known to work: urxvt, xterm, st, xfce4-terminal, gnome-terminal, the Linux console (but see **FONTS**, below). Known **not** to work: rxvt (doesn't support Unicode at all). FONTS ===== For the human-readable column to display correctly, you'll need a font with lots of glyphs. Try *Deja Vu Sans Mono*, *Symbola*, *Quivira*. If you use urxvt, it searches for glyphs in multiple fonts, so you can use all of the above at once. The Linux console is capable of rendering UTF-8, but it's incapable of displaying more than 512 glyphs. Most console fonts only define 256, since using more than 256 means the console won't be able to do bold. Expect to see lots of solid or dotted boxes. This isn't specifically a problem with **uxd**. FILES ===== **uxd** doesn't read any files other than the input file, and doesn't write to any files other than standard output. There's no config file. ENVIRONMENT =========== **uxd** doesn't read anything from the environment. It's *not* necessary to have a UTF-8 locale set in e.g. **LANG** or **LC_ALL**. Also, the **TERM** variable is not used. EXIT STATUS =========== Zero for success, non-zero for failure. Failure status should only be returned if **uxd** failed to open the input file. Invalid input (non-UTF-8) doesn't count as an error; it'll just have lots of red in the output. BUGS ==== **uxd** doesn't check for overlong UTF-8 encodings (e.g. a character that could be a 1-byte sequence, but is encoded as 2 or more). Sequences like this really should be colorized in red. Technically, this means **uxd** supports WTF-8, not UTF-8. RFC 3629 doesn't allow UTF-8 to use codepoints above U+10FFFF. 4-byte sequences can support codepoints U+110000 to U+1FFFFF, which are not valid Unicode. If these occur in the input, **uxd** should colorize them in red, but it doesn't (yet). There should be options and/or a config file to change the colors, rather than baking them into the binary. Combining characters are not handled well. Or at all, really: the 2 characters being combined will have an ANSI color code in between. urxvt at least ignores the color code, so the composite character displays in the color of the first (non-combining) character. I'm not sure what a better solution would be... COPYRIGHT ========= Licensed under the WTFPL. See http://www.wtfpl.net/txt/copying/ for details. AUTHORS ======= B. Watson . SEE ALSO ======== xxd(1), bvi(1), utf-8(7), unicode(7)