5 files changed, 721 insertions, 0 deletions
diff --git a/Makefile b/Makefile
new file mode 100644
index 0000000..a9ee0e3
--- /dev/null
+++ b/Makefile
@@ -0,0 +1,12 @@
+CFLAGS=-O2 -fPIC -Wall
+
+all: uxd man
+
+test: uxd
+	./uxd
+
+man: uxd.rst
+	rst2man uxd.rst > uxd.1
+
+clean:
+	rm -f uxd
diff --git a/README b/README
new file mode 100644
index 0000000..dead264
--- /dev/null
+++ b/README
@@ -0,0 +1,51 @@
+uxd (Unicode-aware Hex Dumper)
+
+Hex dump utility that uses color to indicate multi-byte UTF-8
+sequences.
+
+As usual for hex dumps, output is columnar. The rightmost column
+(which would be ASCII in a regular hex dump) shows one Unicode
+character for each UTF-8 sequence in the dump.
+
+Unicode sequences in the hex column are color-coded to match their
+character in the right column. Colors alternate between a set of 4,
+to help keep track of which character goes with with byte sequence.
+
+Sample output:
+
+00000000: 41 e2 98 af e2 98 ae c2 bf c3 a1 e2 88 9e 42 0a  A☯☮¿á∞B↵
+[colors]  1  2        3        4     1        2     3  5   12341235
+
+;   0 black (don't use)
+5 = 1 red
+1 = 2 green
+4 = 3 yellow
+;   4 blue (don't use)
+2 = 5 purple
+3 = 6 cyan
+;   7 white (don't use)
+
+Colors 1 to 4 are used for successive Unicode characters. For
+instance, color 3 is used for the ☮ character, and also for its hex
+representation "e2 98 ae" in the dump. Note that the "A" and "B" are
+in the ASCII subset of Unicode, and are treated as one-byte sequences.
+If there's a BOM, it'll be in reverse video color 1 (green), and the
+printable form of it will likely be "BOM".
+
+Color 5 is for unprintable characters, with Unicode codepoints below
+0x20 (aka "control characters"), plus a few others like 0x7f (delete).
+↵ is used for newlines... note that an actual ↵ character will
+also be displayed as ↵, but in one of the 4 alternating colors.
+
+Not shown in the dump: byte sequences that have the high bit(s) set,
+but are not valid UTF-8, will be shown in color 5 (red), but in
+reverse video.
+
+Usage: uxd [options] [<filename> ...]
+
+Options should be based on xxd(1) options, though not all of them will
+be supported. If uxd-specific options exist, they should ideally use
+letters that xxd doesn't, to avoid confusion.
+
+Ideas:
+support other encodings for Unicode, like UTF-16?
diff --git a/uxd.1 b/uxd.1
new file mode 100644
index 0000000..87886b3
--- /dev/null
+++ b/uxd.1
@@ -0,0 +1,198 @@
+.\" Man page generated from reStructuredText.
+.
+.
+.nr rst2man-indent-level 0
+.
+.de1 rstReportMargin
+\\$1 \\n[an-margin]
+level \\n[rst2man-indent-level]
+level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
+-
+\\n[rst2man-indent0]
+\\n[rst2man-indent1]
+\\n[rst2man-indent2]
+..
+.de1 INDENT
+.\" .rstReportMargin pre:
+. RS \\$1
+. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
+. nr rst2man-indent-level +1
+.\" .rstReportMargin post:
+..
+.de UNINDENT
+. RE
+.\" indent \\n[an-margin]
+.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
+.nr rst2man-indent-level -1
+.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
+.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
+..
+.TH "UXD" 1 "2024-12-12" "0.0.1" "Urchlay's Utilities"
+.SH NAME
+uxd \- UTF-8 hex dumper
+.SH SYNOPSIS
+.sp
+uxd [\fIfile\fP | \fI\-\fP]
+.SH DESCRIPTION
+.sp
+\fBuxd\fP is a hex dump utility that\(aqs aware of UTF\-8 multibyte sequence
+semantics.
+.sp
+Input is read from \fIfile\fP, or standard input if \fIfile\fP is missing or
+given as \fB\-\fP\&. The input is treated as UTF\-8 encoded Unicode. Since
+ASCII is a subset, \fBuxd\fP works fine on plain ASCII files too. Other
+encodings such as UTF\-16, ISO\-8859\-\fI, Shift\-JIS, etc, can be used, but
+**uxd*\fP won\(aqt handle these any better than a regular hex\-dump utility
+such as \fBxxd\fP\&.
+.sp
+Output is written to standard output, which is normally a
+terminal. It\(aqs assumed that the terminal supports ANSI\-style color and
+UTF\-8. See \fBTERMINAL SUPPORT\fP below.
+.sp
+Each line of output consists of eighteen columns: the offset from the
+start of the file (in hex; minimum 4 digits), 16 bytes of hex
+data (or empty cells, if the last line of the dump is for fewer than
+16 bytes), and the human\-readable form of the same data.
+.sp
+The hex bytes and human\-readable data are colorized to make it obvious
+which bytes make up each character. Since UTF\-8 is a variable\-width
+encoding, this means that one character may be composed of up to
+4 bytes.
+.SH OPTIONS
+.sp
+There are no options yet.
+.SH EXAMPLE
+.sp
+It\(aqs hard to give a proper example, since man pages don\(aqt support
+color. You\(aqll have to use your imagination. Also, this section of
+the man page requires your man command to support UTF\-8 embedded in
+the man page. If the example looks mangled, try viewing the source
+(uxd.rst) in a text editor.
+.sp
+Japanese characters:
+.INDENT 0.0
+.INDENT 3.5
+.sp
+.nf
+.ft C
+$ echo ¥ǥ£¥ | uxd
+0000: c2 a5 c7 a5 c2 a3 c2 a5  0a                       ¥ǥ£¥↵
+      GG GG YY YY GG GG YY YY  PP                       GYGYP
+.ft P
+.fi
+.UNINDENT
+.UNINDENT
+.sp
+The colors are indicated by G/Y/P, for green, yellow, and purple. The
+character above each letter is displayed in that color.
+.sp
+From the colorization, it\(aqs obvious that the "c2 a5" is the hex
+representation of the first ¥ character, and that the ǥ is
+represented by "c7 a5".
+.sp
+The newline is displayed in purple because it\(aqs not a regular
+printable character. Its human\-readable representation is ↵. Note
+that if a regular ↵ character appears in the input, it\(aqll be
+rendered in either green or yellow (as a regular character).
+.SH COLORS
+.INDENT 0.0
+.TP
+.B \fBgreen\fP, \fByellow\fP
+Printable characters (except the space, U+0020) alternate between green and yellow.
+.TP
+.B \fBpurple\fP
+Spaces and unprintable characters ("control" characters, newlines, tabs, etc).
+These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline.
+This is an improvement over the usual practice of printing these as periods, like
+standard hex dumpers do.
+.TP
+.B \fBred\fP
+Invalid UTF\-8 sequences. These are rendered with a red foreground, to make them
+stand out. Examples of invalid sequences:
+.INDENT 7.0
+.INDENT 3.5
+.INDENT 0.0
+.IP \(bu 2
+Prefix bytes (>= 0x80) which are not followed by the correct number of continuation
+bytes (with their high 2 bits set to \fB10\fP).
+.IP \(bu 2
+Continuation bytes that aren\(aqt preceded by a valid prefix byte.
+.IP \(bu 2
+Truncated UTF\-8 sequence at EOF.
+.UNINDENT
+.UNINDENT
+.UNINDENT
+.UNINDENT
+.SH TERMINAL SUPPORT
+.sp
+\fBuxd\fP should work with any modern terminal that supports color,
+ANSI\-style escape sequences, Unicode, and UTF\-8 rendering.
+.sp
+The author\(aqs testing is done primarily with \fBurxvt\fP(1).  Other
+terminals aren\(aqt tested as often.
+.sp
+Known to work: urxvt, xterm, st, xfce4\-terminal, gnome\-terminal, the Linux console (but
+see \fBFONTS\fP, below).
+.sp
+Known \fBnot\fP to work: rxvt (doesn\(aqt support Unicode at all).
+.SH FONTS
+.sp
+For the human\-readable column to display correctly, you\(aqll need a font
+with lots of glyphs. Try \fIDeja Vu Sans Mono\fP, \fISymbola\fP, \fIQuivira\fP\&.
+If you use urxvt, it searches for glyphs in multiple fonts, so you can
+use all of the above at once.
+.sp
+The Linux console is capable of rendering UTF\-8, but it\(aqs incapable
+of displaying more than 512 glyphs. Most console fonts only define
+256, since using more than 256 means the console won\(aqt be able to
+do bold. Expect to see lots of solid or dotted boxes. This isn\(aqt
+specifically a problem with \fBuxd\fP\&.
+.SH FILES
+.sp
+\fBuxd\fP doesn\(aqt read any files other than the input file, and doesn\(aqt write to
+any files other than standard output. There\(aqs no config file.
+.SH ENVIRONMENT
+.sp
+\fBuxd\fP doesn\(aqt read anything from the environment. It\(aqs \fInot\fP necessary to
+have a UTF\-8 locale set in e.g. \fBLANG\fP or \fBLC_ALL\fP\&. Also, the \fBTERM\fP
+variable is not used.
+.SH EXIT STATUS
+.sp
+Zero for success, non\-zero for failure.
+.sp
+Failure status should only be returned if \fBuxd\fP failed to open the
+input file. Invalid input (non\-UTF\-8) doesn\(aqt count as an error;
+it\(aqll just have lots of red in the output.
+.SH BUGS
+.sp
+\fBuxd\fP doesn\(aqt check for overlong UTF\-8 encodings (e.g. a character
+that could be a 1\-byte sequence, but is encoded as 2 or more).
+Sequences like this really should be colorized in red. Technically,
+this means \fBuxd\fP supports WTF\-8, not UTF\-8.
+.sp
+RFC 3629 doesn\(aqt allow UTF\-8 to use codepoints above U+10FFFF. 4\-byte
+sequences can support codepoints U+110000 to U+1FFFFF, which are not
+valid Unicode. If these occur in the input, \fBuxd\fP should colorize
+them in red, but it doesn\(aqt (yet).
+.sp
+There should be options and/or a config file to change the colors,
+rather than baking them into the binary.
+.sp
+Combining characters are not handled well. Or at all, really: the 2
+characters being combined will have an ANSI color code in between.
+urxvt at least ignores the color code, so the composite character
+displays in the color of the first (non\-combining) character. I\(aqm not
+sure what a better solution would be...
+.SH COPYRIGHT
+.sp
+Licensed under the WTFPL. See \fI\%http://www.wtfpl.net/txt/copying/\fP for details.
+.SH AUTHORS
+.INDENT 0.0
+.IP B. 3
+Watson <\fI\%urchlay@slackware.uk\fP>.
+.UNINDENT
+.SH SEE ALSO
+.sp
+xxd(1), bvi(1), utf\-8(7), unicode(7)
+.\" Generated by docutils manpage writer.
+.
diff --git a/uxd.c b/uxd.c
new file mode 100644
index 0000000..00a2686
--- /dev/null
+++ b/uxd.c
@@ -0,0 +1,271 @@
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+/* output looks like:
+
+       0  1  2  3  4  5  6  7   8  9  A  B  C  D  E  F
+0000: 00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  abcdefghijklmnop
+
+...first column will extend to more digits if needed.
+*/
+
+/* UTF-8 spec summary, taken from Wikipedia and elsewhere, kept here
+   for locality of reference.
+
+Codepoints 0-0x7f encode as themselves, one byte each, bit 7 always 0.
+
+0x80 and up are encoded as multiple bytes. The first byte's bit 7 is
+always 1. The top bits determine the byte length of the sequence:
+
+110 - 2 bytes
+1110 - 3 bytes
+11110 - 4 bytes
+
+Continuation (2nd and further bytes) have 10 as the top 2 bits. If
+we get a continuation that's not after a sequence-starter, that's an
+error. If we get a sequence-starter, but the sequence doesn't have
+the correct number of continuation bytes (e.g. 110xxxxx followed by
+anything that isn't 10xxxxxx), that's an error too.
+
+BOM: if the file contains ef bb bf (aka U+FEFF), it should be colorized
+as a special (non-printable).
+If the file begins with ff fe, it's UTF-16 (little endian). If it's
+fe ff, it's UTF-16 big-endian. Probably we should detect these and
+print a warning on stderr.
+*/
+
+/* max UTF-8 sequence length, in bytes */
+#define MAXUTF8 4
+
+/* ANSI color */
+#define BLACK 0 /* don't use */
+#define RED 1
+#define GREEN 2
+#define YELLOW 3
+#define BLUE 4 /* don't use */
+#define PURPLE 5
+#define CYAN 6
+#define WHITE 7 /* don't use */
+
+#define SPECIAL PURPLE
+
+#define BAD_FG BLACK
+#define BAD_BG RED
+
+// const int normal_colors[] = { GREEN, PURPLE, CYAN };
+const int normal_colors[] = { GREEN, YELLOW };
+int cur_normal_color = 0;
+int dump_color;
+
+const char *self;
+FILE *input;
+
+/* these buffers are bigger than they need to be really. */
+char left_buf[4096];
+char right_buf[4096];
+
+#define MAX_DUMP_COLS 16
+int dump_column = 0;
+int filepos = 0;
+
+void usage(void) {
+	printf("Usage: %s <file>\n", self);
+	printf("  With no <file>, or with -, read standard input.\n");
+	exit(0);
+}
+
+void open_input(const int argc, const char *argv1) {
+	if(argc == 1) {
+		input = stdin;
+		return;
+	}
+
+	if(argv1[0] == '-' && argv1[1] != '\0') {
+		usage();
+	}
+
+	if(argc == 2) {
+		if(strcmp(argv1, "-") == 0)
+			input = stdin;
+		else {
+			input = fopen(argv1, "rb");
+			if(!input) {
+				fprintf(stderr, "%s: ", self);
+				perror(argv1);
+				exit(1);
+			}
+		}
+	}
+}
+
+char * const special_symbols[] = {
+	"␀", "␁", "␂", "␃", "␄", "␅", "␆", "␇", "␈", "⇥", "↵", "␋", "␌", "␍", "␎", "␏",
+	"␐", "␑", "␒", "␓", "␔", "␕", "␖", "␗", "␘", "␙", "␚", "␛", "␜", "␝", "␞", "␟",
+	"␣",
+};
+
+char *get_special(unsigned char c) {
+	if(c == 0x7f) return "⌦";
+	if(c <= ' ') return special_symbols[c];
+	return "?"; /* should never happen */
+}
+
+void set_self(const char *argv0) {
+	self = strrchr(argv0, '/');
+
+	if(self)
+		self++;
+	else
+		self = argv0;
+}
+
+void next_normal_color() {
+	cur_normal_color++;
+	cur_normal_color %= (sizeof(normal_colors) / sizeof(int));
+}
+
+void append_color(char *buf, int fgcolor, int bgcolor) {
+	char tmpbuf[100];
+
+	sprintf(tmpbuf, "\x1b[0;3%d", fgcolor);
+	strcat(buf, tmpbuf);
+	if(bgcolor) {
+		sprintf(tmpbuf, ";4%d", bgcolor);
+		strcat(buf, tmpbuf);
+	}
+	sprintf(tmpbuf, "m");
+	strcat(buf, tmpbuf);
+}
+
+void print_line(void) {
+	int spacing = MAX_DUMP_COLS - dump_column;
+	printf("%s", left_buf);
+	while(spacing--) printf("   ");
+	printf(" %s\n", right_buf);
+	left_buf[0] = right_buf[0] = '\0';
+}
+
+void append_color_off(char *buf) {
+	strcat(buf, "\x1b[0m");
+}
+
+void append_right(char *str) {
+	strcat(right_buf, str);
+}
+
+void append_left(unsigned char byte, int fgcolor, int bgcolor) {
+	char tmpbuf[100];
+
+	if(!dump_column)
+		sprintf(left_buf, "%04x: ", filepos);
+
+	append_color(left_buf, fgcolor, bgcolor);
+	sprintf(tmpbuf, "%02x", byte);
+	strcat(left_buf, tmpbuf);
+	append_color_off(left_buf);
+	strcat(left_buf, " ");
+
+	if(dump_column == 7) strcat(left_buf, " ");
+	dump_column++;
+	if(dump_column == MAX_DUMP_COLS) {
+		print_line();
+		dump_column = 0;
+	}
+
+	filepos++;
+}
+
+int dump_utf8_char(void) {
+	unsigned char bytes[] = { 0, 0, 0, 0, 0 };
+	unsigned char *cont_bytes = bytes + 1;
+	char *printable;
+	int bad = 0, special = 0;
+	int c, cont_count, i, fg, bg;
+
+	c = fgetc(input);
+	if(c == EOF)
+		return 0;
+
+	bytes[0] = (unsigned char)c;
+
+	if(c < 0x7f) {
+		cont_count = 0;
+		if(c <= ' ' || c == 0x7f)
+			special = 1;
+	} else if((c & 0xe0) == 0xc0) /* 110xxxxx */
+		cont_count = 1;
+	else if((c & 0xf0) == 0xe0)   /* 1110xxxx */
+		cont_count = 2;
+	else if((c & 0xf8) == 0xf0)   /* 11110xxx */
+		cont_count = 3;
+	else {
+		cont_count = 0;
+		bad = 1;
+	}
+
+	for(i = 0; i < cont_count; i++) {
+		int cb;
+		c = fgetc(input);
+
+		if(c == EOF) {
+			/* EOF in mid-sequence */
+			cont_count = i;
+			bad = 1;
+			break;
+		}
+
+		cb = cont_bytes[i] = (unsigned char)c;
+		if((cb & 0xc0) != 0x80) {
+			/* Expected 10xxxxxx, got something else */
+			cont_count = i;
+			bad = 1;
+			ungetc(cb, input);
+			break;
+		}
+	}
+
+	/* TODO: handle BOM? what about combining diacritics? */
+	if(bad) {
+		fg = BAD_FG;
+		bg = BAD_BG;
+		/* replacement character � is U+FFFD */
+		printable = "�";
+	} else if(special) {
+		fg = SPECIAL;
+		bg = 0;
+		printable = get_special(bytes[0]);
+	} else {
+		fg = normal_colors[cur_normal_color];
+		bg = 0;
+		printable = (char *)bytes;
+		next_normal_color();
+	}
+
+	append_color(right_buf, fg, bg);
+	append_right(printable);
+	append_color_off(right_buf);
+
+	for(i = 0; i <= cont_count; i++) {
+		append_left(bytes[i], fg, bg);
+	}
+
+	return 1;
+}
+
+void dump_file(void) {
+	while(dump_utf8_char())
+		;
+
+	if(dump_column)
+		print_line();
+}
+
+int main(int argc, char **argv) {
+	set_self(argv[0]);
+	open_input(argc, argv[1]);
+	dump_file();
+	fclose(input);
+	return 0;
+}
diff --git a/uxd.rst b/uxd.rst
new file mode 100644
index 0000000..e5a8fff
--- /dev/null
+++ b/uxd.rst
@@ -0,0 +1,189 @@
+.. |version| replace:: 0.0.1
+.. |date| date::
+
+===
+uxd
+===
+
+----------------
+UTF-8 hex dumper
+----------------
+
+:Manual section: 1
+:Manual group: Urchlay's Utilities
+:Date: |date|
+:Version: |version|
+
+SYNOPSIS
+========
+
+uxd [*file* | *-*]
+
+DESCRIPTION
+===========
+
+**uxd** is a hex dump utility that's aware of UTF-8 multibyte sequence
+semantics.
+
+Input is read from *file*, or standard input if *file* is missing or
+given as **-**. The input is treated as UTF-8 encoded Unicode. Since
+ASCII is a subset, **uxd** works fine on plain ASCII files too. Other
+encodings such as UTF-16, ISO-8859-*, Shift-JIS, etc, can be used, but
+**uxd** won't handle these any better than a regular hex-dump utility
+such as **xxd**.
+
+Output is written to standard output, which is normally a
+terminal. It's assumed that the terminal supports ANSI-style color and
+UTF-8. See **TERMINAL SUPPORT** below.
+
+Each line of output consists of eighteen columns: the offset from the
+start of the file (in hex; minimum 4 digits), 16 bytes of hex
+data (or empty cells, if the last line of the dump is for fewer than
+16 bytes), and the human-readable form of the same data.
+
+The hex bytes and human-readable data are colorized to make it obvious
+which bytes make up each character. Since UTF-8 is a variable-width
+encoding, this means that one character may be composed of up to
+4 bytes.
+
+OPTIONS
+=======
+
+There are no options yet.
+
+EXAMPLE
+=======
+
+It's hard to give a proper example, since man pages don't support
+color. You'll have to use your imagination. Also, this section of
+the man page requires your man command to support UTF-8 embedded in
+the man page. If the examples looks mangled, try viewing the source
+(uxd.rst) in a text editor.
+
+Japanese text example::
+
+   $ echo ¥ǥ£¥ | uxd
+   0000: c2 a5 c7 a5 c2 a3 c2 a5  0a                       ¥ǥ£¥↵
+         GG GG YY YY GG GG YY YY  PP                       GYGYP
+
+The colors are indicated by G/Y/P, for green, yellow, and purple. The
+character above each letter is displayed in that color.
+
+From the colorization, it's obvious that the "c2 a5" is the hex
+representation of the first ¥ character, and that the ǥ is
+represented by "c7 a5".
+
+The newline is displayed in purple because it's not a regular
+printable character. Its human-readable representation is ↵. Note
+that if a regular ↵ character appears in the input, it'll be
+rendered in either green or yellow (as a regular character).
+
+COLORS
+======
+
+**green**, **yellow**
+  Printable characters (except the space, U+0020) alternate between green and yellow.
+
+**purple**
+  Spaces and unprintable characters ("control" characters, newlines, tabs, etc).
+  These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline.
+  This is an improvement over the usual practice of printing these as periods, like
+  standard hex dumpers do.
+
+**red**
+  Invalid UTF-8 sequences. These are rendered with a red foreground, to make them
+  stand out. Examples of invalid sequences:
+
+    - Prefix bytes (>= 0x80) which are not followed by the correct number of continuation
+      bytes (with their high 2 bits set to **10**).
+
+    - Continuation bytes that aren't preceded by a valid prefix byte.
+
+    - Truncated UTF-8 sequence at EOF.
+
+TERMINAL SUPPORT
+================
+
+**uxd** should work with any modern terminal that supports color,
+ANSI-style escape sequences, Unicode, and UTF-8 rendering.
+
+The author's testing is done primarily with **urxvt**\(1).  Other
+terminals aren't tested as often.
+
+Known to work: urxvt, xterm, st, xfce4-terminal, gnome-terminal, the Linux console (but
+see **FONTS**, below).
+
+Known **not** to work: rxvt (doesn't support Unicode at all).
+
+FONTS
+=====
+
+For the human-readable column to display correctly, you'll need a font
+with lots of glyphs. Try *Deja Vu Sans Mono*, *Symbola*, *Quivira*.
+If you use urxvt, it searches for glyphs in multiple fonts, so you can
+use all of the above at once.
+
+The Linux console is capable of rendering UTF-8, but it's incapable
+of displaying more than 512 glyphs. Most console fonts only define
+256, since using more than 256 means the console won't be able to
+do bold. Expect to see lots of solid or dotted boxes. This isn't
+specifically a problem with **uxd**.
+
+FILES
+=====
+
+**uxd** doesn't read any files other than the input file, and doesn't write to
+any files other than standard output. There's no config file.
+
+ENVIRONMENT
+===========
+
+**uxd** doesn't read anything from the environment. It's *not* necessary to
+have a UTF-8 locale set in e.g. **LANG** or **LC_ALL**. Also, the **TERM**
+variable is not used.
+
+EXIT STATUS
+===========
+
+Zero for success, non-zero for failure.
+
+Failure status should only be returned if **uxd** failed to open the
+input file. Invalid input (non-UTF-8) doesn't count as an error;
+it'll just have lots of red in the output.
+
+BUGS
+====
+
+**uxd** doesn't check for overlong UTF-8 encodings (e.g. a character
+that could be a 1-byte sequence, but is encoded as 2 or more).
+Sequences like this really should be colorized in red. Technically,
+this means **uxd** supports WTF-8, not UTF-8.
+
+RFC 3629 doesn't allow UTF-8 to use codepoints above U+10FFFF. 4-byte
+sequences can support codepoints U+110000 to U+1FFFFF, which are not
+valid Unicode. If these occur in the input, **uxd** should colorize
+them in red, but it doesn't (yet).
+
+There should be options and/or a config file to change the colors,
+rather than baking them into the binary.
+
+Combining characters are not handled well. Or at all, really: the 2
+characters being combined will have an ANSI color code in between.
+urxvt at least ignores the color code, so the composite character
+displays in the color of the first (non-combining) character. I'm not
+sure what a better solution would be...
+
+COPYRIGHT
+=========
+
+Licensed under the WTFPL. See http://www.wtfpl.net/txt/copying/ for details.
+
+AUTHORS
+=======
+
+B. Watson <urchlay@slackware.uk>.
+
+SEE ALSO
+========
+
+xxd(1), bvi(1), utf-8(7), unicode(7)