initial commit

author: B. Watson <urchlay@slackware.uk> 2024-12-12 06:21:05 -0500
committer: B. Watson <urchlay@slackware.uk> 2024-12-12 06:21:05 -0500
commit: 4df7bb4d762ff945fb7a823cb4c153cab7e3c273 (patch)
tree: bd58ead44eb3ff2c3e0d2935144bbe663d845b1a /uxd.rst
download: uxd-4df7bb4d762ff945fb7a823cb4c153cab7e3c273.tar.gz
1 files changed, 189 insertions, 0 deletions
diff --git a/uxd.rst b/uxd.rst
new file mode 100644
index 0000000..e5a8fff
--- /dev/null
+++ b/uxd.rst
@@ -0,0 +1,189 @@
+.. |version| replace:: 0.0.1
+.. |date| date::
+
+===
+uxd
+===
+
+----------------
+UTF-8 hex dumper
+----------------
+
+:Manual section: 1
+:Manual group: Urchlay's Utilities
+:Date: |date|
+:Version: |version|
+
+SYNOPSIS
+========
+
+uxd [*file* | *-*]
+
+DESCRIPTION
+===========
+
+**uxd** is a hex dump utility that's aware of UTF-8 multibyte sequence
+semantics.
+
+Input is read from *file*, or standard input if *file* is missing or
+given as **-**. The input is treated as UTF-8 encoded Unicode. Since
+ASCII is a subset, **uxd** works fine on plain ASCII files too. Other
+encodings such as UTF-16, ISO-8859-*, Shift-JIS, etc, can be used, but
+**uxd** won't handle these any better than a regular hex-dump utility
+such as **xxd**.
+
+Output is written to standard output, which is normally a
+terminal. It's assumed that the terminal supports ANSI-style color and
+UTF-8. See **TERMINAL SUPPORT** below.
+
+Each line of output consists of eighteen columns: the offset from the
+start of the file (in hex; minimum 4 digits), 16 bytes of hex
+data (or empty cells, if the last line of the dump is for fewer than
+16 bytes), and the human-readable form of the same data.
+
+The hex bytes and human-readable data are colorized to make it obvious
+which bytes make up each character. Since UTF-8 is a variable-width
+encoding, this means that one character may be composed of up to
+4 bytes.
+
+OPTIONS
+=======
+
+There are no options yet.
+
+EXAMPLE
+=======
+
+It's hard to give a proper example, since man pages don't support
+color. You'll have to use your imagination. Also, this section of
+the man page requires your man command to support UTF-8 embedded in
+the man page. If the examples looks mangled, try viewing the source
+(uxd.rst) in a text editor.
+
+Japanese text example::
+
+   $ echo ¥ǥ£¥ | uxd
+   0000: c2 a5 c7 a5 c2 a3 c2 a5  0a                       ¥ǥ£¥↵
+         GG GG YY YY GG GG YY YY  PP                       GYGYP
+
+The colors are indicated by G/Y/P, for green, yellow, and purple. The
+character above each letter is displayed in that color.
+
+From the colorization, it's obvious that the "c2 a5" is the hex
+representation of the first ¥ character, and that the ǥ is
+represented by "c7 a5".
+
+The newline is displayed in purple because it's not a regular
+printable character. Its human-readable representation is ↵. Note
+that if a regular ↵ character appears in the input, it'll be
+rendered in either green or yellow (as a regular character).
+
+COLORS
+======
+
+**green**, **yellow**
+  Printable characters (except the space, U+0020) alternate between green and yellow.
+
+**purple**
+  Spaces and unprintable characters ("control" characters, newlines, tabs, etc).
+  These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline.
+  This is an improvement over the usual practice of printing these as periods, like
+  standard hex dumpers do.
+
+**red**
+  Invalid UTF-8 sequences. These are rendered with a red foreground, to make them
+  stand out. Examples of invalid sequences:
+
+    - Prefix bytes (>= 0x80) which are not followed by the correct number of continuation
+      bytes (with their high 2 bits set to **10**).
+
+    - Continuation bytes that aren't preceded by a valid prefix byte.
+
+    - Truncated UTF-8 sequence at EOF.
+
+TERMINAL SUPPORT
+================
+
+**uxd** should work with any modern terminal that supports color,
+ANSI-style escape sequences, Unicode, and UTF-8 rendering.
+
+The author's testing is done primarily with **urxvt**\(1).  Other
+terminals aren't tested as often.
+
+Known to work: urxvt, xterm, st, xfce4-terminal, gnome-terminal, the Linux console (but
+see **FONTS**, below).
+
+Known **not** to work: rxvt (doesn't support Unicode at all).
+
+FONTS
+=====
+
+For the human-readable column to display correctly, you'll need a font
+with lots of glyphs. Try *Deja Vu Sans Mono*, *Symbola*, *Quivira*.
+If you use urxvt, it searches for glyphs in multiple fonts, so you can
+use all of the above at once.
+
+The Linux console is capable of rendering UTF-8, but it's incapable
+of displaying more than 512 glyphs. Most console fonts only define
+256, since using more than 256 means the console won't be able to
+do bold. Expect to see lots of solid or dotted boxes. This isn't
+specifically a problem with **uxd**.
+
+FILES
+=====
+
+**uxd** doesn't read any files other than the input file, and doesn't write to
+any files other than standard output. There's no config file.
+
+ENVIRONMENT
+===========
+
+**uxd** doesn't read anything from the environment. It's *not* necessary to
+have a UTF-8 locale set in e.g. **LANG** or **LC_ALL**. Also, the **TERM**
+variable is not used.
+
+EXIT STATUS
+===========
+
+Zero for success, non-zero for failure.
+
+Failure status should only be returned if **uxd** failed to open the
+input file. Invalid input (non-UTF-8) doesn't count as an error;
+it'll just have lots of red in the output.
+
+BUGS
+====
+
+**uxd** doesn't check for overlong UTF-8 encodings (e.g. a character
+that could be a 1-byte sequence, but is encoded as 2 or more).
+Sequences like this really should be colorized in red. Technically,
+this means **uxd** supports WTF-8, not UTF-8.
+
+RFC 3629 doesn't allow UTF-8 to use codepoints above U+10FFFF. 4-byte
+sequences can support codepoints U+110000 to U+1FFFFF, which are not
+valid Unicode. If these occur in the input, **uxd** should colorize
+them in red, but it doesn't (yet).
+
+There should be options and/or a config file to change the colors,
+rather than baking them into the binary.
+
+Combining characters are not handled well. Or at all, really: the 2
+characters being combined will have an ANSI color code in between.
+urxvt at least ignores the color code, so the composite character
+displays in the color of the first (non-combining) character. I'm not
+sure what a better solution would be...
+
+COPYRIGHT
+=========
+
+Licensed under the WTFPL. See http://www.wtfpl.net/txt/copying/ for details.
+
+AUTHORS
+=======
+
+B. Watson <urchlay@slackware.uk>.
+
+SEE ALSO
+========
+
+xxd(1), bvi(1), utf-8(7), unicode(7)
author	B. Watson <urchlay@slackware.uk>	2024-12-12 06:21:05 -0500
committer	B. Watson <urchlay@slackware.uk>	2024-12-12 06:21:05 -0500
commit	4df7bb4d762ff945fb7a823cb4c153cab7e3c273 (patch)
tree	bd58ead44eb3ff2c3e0d2935144bbe663d845b1a /uxd.rst
download	uxd-4df7bb4d762ff945fb7a823cb4c153cab7e3c273.tar.gz