From 4df7bb4d762ff945fb7a823cb4c153cab7e3c273 Mon Sep 17 00:00:00 2001 From: "B. Watson" Date: Thu, 12 Dec 2024 06:21:05 -0500 Subject: initial commit --- Makefile | 12 +++ README | 51 ++++++++++++ uxd.1 | 198 ++++++++++++++++++++++++++++++++++++++++++++++ uxd.c | 271 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ uxd.rst | 189 ++++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 721 insertions(+) create mode 100644 Makefile create mode 100644 README create mode 100644 uxd.1 create mode 100644 uxd.c create mode 100644 uxd.rst diff --git a/Makefile b/Makefile new file mode 100644 index 0000000..a9ee0e3 --- /dev/null +++ b/Makefile @@ -0,0 +1,12 @@ +CFLAGS=-O2 -fPIC -Wall + +all: uxd man + +test: uxd + ./uxd + +man: uxd.rst + rst2man uxd.rst > uxd.1 + +clean: + rm -f uxd diff --git a/README b/README new file mode 100644 index 0000000..dead264 --- /dev/null +++ b/README @@ -0,0 +1,51 @@ +uxd (Unicode-aware Hex Dumper) + +Hex dump utility that uses color to indicate multi-byte UTF-8 +sequences. + +As usual for hex dumps, output is columnar. The rightmost column +(which would be ASCII in a regular hex dump) shows one Unicode +character for each UTF-8 sequence in the dump. + +Unicode sequences in the hex column are color-coded to match their +character in the right column. Colors alternate between a set of 4, +to help keep track of which character goes with with byte sequence. + +Sample output: + +00000000: 41 e2 98 af e2 98 ae c2 bf c3 a1 e2 88 9e 42 0a A☯☮¿á∞B↵ +[colors] 1 2 3 4 1 2 3 5 12341235 + +; 0 black (don't use) +5 = 1 red +1 = 2 green +4 = 3 yellow +; 4 blue (don't use) +2 = 5 purple +3 = 6 cyan +; 7 white (don't use) + +Colors 1 to 4 are used for successive Unicode characters. For +instance, color 3 is used for the ☮ character, and also for its hex +representation "e2 98 ae" in the dump. Note that the "A" and "B" are +in the ASCII subset of Unicode, and are treated as one-byte sequences. +If there's a BOM, it'll be in reverse video color 1 (green), and the +printable form of it will likely be "BOM". + +Color 5 is for unprintable characters, with Unicode codepoints below +0x20 (aka "control characters"), plus a few others like 0x7f (delete). +↵ is used for newlines... note that an actual ↵ character will +also be displayed as ↵, but in one of the 4 alternating colors. + +Not shown in the dump: byte sequences that have the high bit(s) set, +but are not valid UTF-8, will be shown in color 5 (red), but in +reverse video. + +Usage: uxd [options] [ ...] + +Options should be based on xxd(1) options, though not all of them will +be supported. If uxd-specific options exist, they should ideally use +letters that xxd doesn't, to avoid confusion. + +Ideas: +support other encodings for Unicode, like UTF-16? diff --git a/uxd.1 b/uxd.1 new file mode 100644 index 0000000..87886b3 --- /dev/null +++ b/uxd.1 @@ -0,0 +1,198 @@ +.\" Man page generated from reStructuredText. +. +. +.nr rst2man-indent-level 0 +. +.de1 rstReportMargin +\\$1 \\n[an-margin] +level \\n[rst2man-indent-level] +level margin: \\n[rst2man-indent\\n[rst2man-indent-level]] +- +\\n[rst2man-indent0] +\\n[rst2man-indent1] +\\n[rst2man-indent2] +.. +.de1 INDENT +.\" .rstReportMargin pre: +. RS \\$1 +. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin] +. nr rst2man-indent-level +1 +.\" .rstReportMargin post: +.. +.de UNINDENT +. RE +.\" indent \\n[an-margin] +.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]] +.nr rst2man-indent-level -1 +.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]] +.in \\n[rst2man-indent\\n[rst2man-indent-level]]u +.. +.TH "UXD" 1 "2024-12-12" "0.0.1" "Urchlay's Utilities" +.SH NAME +uxd \- UTF-8 hex dumper +.SH SYNOPSIS +.sp +uxd [\fIfile\fP | \fI\-\fP] +.SH DESCRIPTION +.sp +\fBuxd\fP is a hex dump utility that\(aqs aware of UTF\-8 multibyte sequence +semantics. +.sp +Input is read from \fIfile\fP, or standard input if \fIfile\fP is missing or +given as \fB\-\fP\&. The input is treated as UTF\-8 encoded Unicode. Since +ASCII is a subset, \fBuxd\fP works fine on plain ASCII files too. Other +encodings such as UTF\-16, ISO\-8859\-\fI, Shift\-JIS, etc, can be used, but +**uxd*\fP won\(aqt handle these any better than a regular hex\-dump utility +such as \fBxxd\fP\&. +.sp +Output is written to standard output, which is normally a +terminal. It\(aqs assumed that the terminal supports ANSI\-style color and +UTF\-8. See \fBTERMINAL SUPPORT\fP below. +.sp +Each line of output consists of eighteen columns: the offset from the +start of the file (in hex; minimum 4 digits), 16 bytes of hex +data (or empty cells, if the last line of the dump is for fewer than +16 bytes), and the human\-readable form of the same data. +.sp +The hex bytes and human\-readable data are colorized to make it obvious +which bytes make up each character. Since UTF\-8 is a variable\-width +encoding, this means that one character may be composed of up to +4 bytes. +.SH OPTIONS +.sp +There are no options yet. +.SH EXAMPLE +.sp +It\(aqs hard to give a proper example, since man pages don\(aqt support +color. You\(aqll have to use your imagination. Also, this section of +the man page requires your man command to support UTF\-8 embedded in +the man page. If the example looks mangled, try viewing the source +(uxd.rst) in a text editor. +.sp +Japanese characters: +.INDENT 0.0 +.INDENT 3.5 +.sp +.nf +.ft C +$ echo ¥ǥ£¥ | uxd +0000: c2 a5 c7 a5 c2 a3 c2 a5 0a ¥ǥ£¥↵ + GG GG YY YY GG GG YY YY PP GYGYP +.ft P +.fi +.UNINDENT +.UNINDENT +.sp +The colors are indicated by G/Y/P, for green, yellow, and purple. The +character above each letter is displayed in that color. +.sp +From the colorization, it\(aqs obvious that the "c2 a5" is the hex +representation of the first ¥ character, and that the ǥ is +represented by "c7 a5". +.sp +The newline is displayed in purple because it\(aqs not a regular +printable character. Its human\-readable representation is ↵. Note +that if a regular ↵ character appears in the input, it\(aqll be +rendered in either green or yellow (as a regular character). +.SH COLORS +.INDENT 0.0 +.TP +.B \fBgreen\fP, \fByellow\fP +Printable characters (except the space, U+0020) alternate between green and yellow. +.TP +.B \fBpurple\fP +Spaces and unprintable characters ("control" characters, newlines, tabs, etc). +These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline. +This is an improvement over the usual practice of printing these as periods, like +standard hex dumpers do. +.TP +.B \fBred\fP +Invalid UTF\-8 sequences. These are rendered with a red foreground, to make them +stand out. Examples of invalid sequences: +.INDENT 7.0 +.INDENT 3.5 +.INDENT 0.0 +.IP \(bu 2 +Prefix bytes (>= 0x80) which are not followed by the correct number of continuation +bytes (with their high 2 bits set to \fB10\fP). +.IP \(bu 2 +Continuation bytes that aren\(aqt preceded by a valid prefix byte. +.IP \(bu 2 +Truncated UTF\-8 sequence at EOF. +.UNINDENT +.UNINDENT +.UNINDENT +.UNINDENT +.SH TERMINAL SUPPORT +.sp +\fBuxd\fP should work with any modern terminal that supports color, +ANSI\-style escape sequences, Unicode, and UTF\-8 rendering. +.sp +The author\(aqs testing is done primarily with \fBurxvt\fP(1). Other +terminals aren\(aqt tested as often. +.sp +Known to work: urxvt, xterm, st, xfce4\-terminal, gnome\-terminal, the Linux console (but +see \fBFONTS\fP, below). +.sp +Known \fBnot\fP to work: rxvt (doesn\(aqt support Unicode at all). +.SH FONTS +.sp +For the human\-readable column to display correctly, you\(aqll need a font +with lots of glyphs. Try \fIDeja Vu Sans Mono\fP, \fISymbola\fP, \fIQuivira\fP\&. +If you use urxvt, it searches for glyphs in multiple fonts, so you can +use all of the above at once. +.sp +The Linux console is capable of rendering UTF\-8, but it\(aqs incapable +of displaying more than 512 glyphs. Most console fonts only define +256, since using more than 256 means the console won\(aqt be able to +do bold. Expect to see lots of solid or dotted boxes. This isn\(aqt +specifically a problem with \fBuxd\fP\&. +.SH FILES +.sp +\fBuxd\fP doesn\(aqt read any files other than the input file, and doesn\(aqt write to +any files other than standard output. There\(aqs no config file. +.SH ENVIRONMENT +.sp +\fBuxd\fP doesn\(aqt read anything from the environment. It\(aqs \fInot\fP necessary to +have a UTF\-8 locale set in e.g. \fBLANG\fP or \fBLC_ALL\fP\&. Also, the \fBTERM\fP +variable is not used. +.SH EXIT STATUS +.sp +Zero for success, non\-zero for failure. +.sp +Failure status should only be returned if \fBuxd\fP failed to open the +input file. Invalid input (non\-UTF\-8) doesn\(aqt count as an error; +it\(aqll just have lots of red in the output. +.SH BUGS +.sp +\fBuxd\fP doesn\(aqt check for overlong UTF\-8 encodings (e.g. a character +that could be a 1\-byte sequence, but is encoded as 2 or more). +Sequences like this really should be colorized in red. Technically, +this means \fBuxd\fP supports WTF\-8, not UTF\-8. +.sp +RFC 3629 doesn\(aqt allow UTF\-8 to use codepoints above U+10FFFF. 4\-byte +sequences can support codepoints U+110000 to U+1FFFFF, which are not +valid Unicode. If these occur in the input, \fBuxd\fP should colorize +them in red, but it doesn\(aqt (yet). +.sp +There should be options and/or a config file to change the colors, +rather than baking them into the binary. +.sp +Combining characters are not handled well. Or at all, really: the 2 +characters being combined will have an ANSI color code in between. +urxvt at least ignores the color code, so the composite character +displays in the color of the first (non\-combining) character. I\(aqm not +sure what a better solution would be... +.SH COPYRIGHT +.sp +Licensed under the WTFPL. See \fI\%http://www.wtfpl.net/txt/copying/\fP for details. +.SH AUTHORS +.INDENT 0.0 +.IP B. 3 +Watson <\fI\%urchlay@slackware.uk\fP>. +.UNINDENT +.SH SEE ALSO +.sp +xxd(1), bvi(1), utf\-8(7), unicode(7) +.\" Generated by docutils manpage writer. +. diff --git a/uxd.c b/uxd.c new file mode 100644 index 0000000..00a2686 --- /dev/null +++ b/uxd.c @@ -0,0 +1,271 @@ +#include +#include +#include +#include + +/* output looks like: + + 0 1 2 3 4 5 6 7 8 9 A B C D E F +0000: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 abcdefghijklmnop + +...first column will extend to more digits if needed. +*/ + +/* UTF-8 spec summary, taken from Wikipedia and elsewhere, kept here + for locality of reference. + +Codepoints 0-0x7f encode as themselves, one byte each, bit 7 always 0. + +0x80 and up are encoded as multiple bytes. The first byte's bit 7 is +always 1. The top bits determine the byte length of the sequence: + +110 - 2 bytes +1110 - 3 bytes +11110 - 4 bytes + +Continuation (2nd and further bytes) have 10 as the top 2 bits. If +we get a continuation that's not after a sequence-starter, that's an +error. If we get a sequence-starter, but the sequence doesn't have +the correct number of continuation bytes (e.g. 110xxxxx followed by +anything that isn't 10xxxxxx), that's an error too. + +BOM: if the file contains ef bb bf (aka U+FEFF), it should be colorized +as a special (non-printable). +If the file begins with ff fe, it's UTF-16 (little endian). If it's +fe ff, it's UTF-16 big-endian. Probably we should detect these and +print a warning on stderr. +*/ + +/* max UTF-8 sequence length, in bytes */ +#define MAXUTF8 4 + +/* ANSI color */ +#define BLACK 0 /* don't use */ +#define RED 1 +#define GREEN 2 +#define YELLOW 3 +#define BLUE 4 /* don't use */ +#define PURPLE 5 +#define CYAN 6 +#define WHITE 7 /* don't use */ + +#define SPECIAL PURPLE + +#define BAD_FG BLACK +#define BAD_BG RED + +// const int normal_colors[] = { GREEN, PURPLE, CYAN }; +const int normal_colors[] = { GREEN, YELLOW }; +int cur_normal_color = 0; +int dump_color; + +const char *self; +FILE *input; + +/* these buffers are bigger than they need to be really. */ +char left_buf[4096]; +char right_buf[4096]; + +#define MAX_DUMP_COLS 16 +int dump_column = 0; +int filepos = 0; + +void usage(void) { + printf("Usage: %s \n", self); + printf(" With no , or with -, read standard input.\n"); + exit(0); +} + +void open_input(const int argc, const char *argv1) { + if(argc == 1) { + input = stdin; + return; + } + + if(argv1[0] == '-' && argv1[1] != '\0') { + usage(); + } + + if(argc == 2) { + if(strcmp(argv1, "-") == 0) + input = stdin; + else { + input = fopen(argv1, "rb"); + if(!input) { + fprintf(stderr, "%s: ", self); + perror(argv1); + exit(1); + } + } + } +} + +char * const special_symbols[] = { + "␀", "␁", "␂", "␃", "␄", "␅", "␆", "␇", "␈", "⇥", "↵", "␋", "␌", "␍", "␎", "␏", + "␐", "␑", "␒", "␓", "␔", "␕", "␖", "␗", "␘", "␙", "␚", "␛", "␜", "␝", "␞", "␟", + "␣", +}; + +char *get_special(unsigned char c) { + if(c == 0x7f) return "⌦"; + if(c <= ' ') return special_symbols[c]; + return "?"; /* should never happen */ +} + +void set_self(const char *argv0) { + self = strrchr(argv0, '/'); + + if(self) + self++; + else + self = argv0; +} + +void next_normal_color() { + cur_normal_color++; + cur_normal_color %= (sizeof(normal_colors) / sizeof(int)); +} + +void append_color(char *buf, int fgcolor, int bgcolor) { + char tmpbuf[100]; + + sprintf(tmpbuf, "\x1b[0;3%d", fgcolor); + strcat(buf, tmpbuf); + if(bgcolor) { + sprintf(tmpbuf, ";4%d", bgcolor); + strcat(buf, tmpbuf); + } + sprintf(tmpbuf, "m"); + strcat(buf, tmpbuf); +} + +void print_line(void) { + int spacing = MAX_DUMP_COLS - dump_column; + printf("%s", left_buf); + while(spacing--) printf(" "); + printf(" %s\n", right_buf); + left_buf[0] = right_buf[0] = '\0'; +} + +void append_color_off(char *buf) { + strcat(buf, "\x1b[0m"); +} + +void append_right(char *str) { + strcat(right_buf, str); +} + +void append_left(unsigned char byte, int fgcolor, int bgcolor) { + char tmpbuf[100]; + + if(!dump_column) + sprintf(left_buf, "%04x: ", filepos); + + append_color(left_buf, fgcolor, bgcolor); + sprintf(tmpbuf, "%02x", byte); + strcat(left_buf, tmpbuf); + append_color_off(left_buf); + strcat(left_buf, " "); + + if(dump_column == 7) strcat(left_buf, " "); + dump_column++; + if(dump_column == MAX_DUMP_COLS) { + print_line(); + dump_column = 0; + } + + filepos++; +} + +int dump_utf8_char(void) { + unsigned char bytes[] = { 0, 0, 0, 0, 0 }; + unsigned char *cont_bytes = bytes + 1; + char *printable; + int bad = 0, special = 0; + int c, cont_count, i, fg, bg; + + c = fgetc(input); + if(c == EOF) + return 0; + + bytes[0] = (unsigned char)c; + + if(c < 0x7f) { + cont_count = 0; + if(c <= ' ' || c == 0x7f) + special = 1; + } else if((c & 0xe0) == 0xc0) /* 110xxxxx */ + cont_count = 1; + else if((c & 0xf0) == 0xe0) /* 1110xxxx */ + cont_count = 2; + else if((c & 0xf8) == 0xf0) /* 11110xxx */ + cont_count = 3; + else { + cont_count = 0; + bad = 1; + } + + for(i = 0; i < cont_count; i++) { + int cb; + c = fgetc(input); + + if(c == EOF) { + /* EOF in mid-sequence */ + cont_count = i; + bad = 1; + break; + } + + cb = cont_bytes[i] = (unsigned char)c; + if((cb & 0xc0) != 0x80) { + /* Expected 10xxxxxx, got something else */ + cont_count = i; + bad = 1; + ungetc(cb, input); + break; + } + } + + /* TODO: handle BOM? what about combining diacritics? */ + if(bad) { + fg = BAD_FG; + bg = BAD_BG; + /* replacement character � is U+FFFD */ + printable = "�"; + } else if(special) { + fg = SPECIAL; + bg = 0; + printable = get_special(bytes[0]); + } else { + fg = normal_colors[cur_normal_color]; + bg = 0; + printable = (char *)bytes; + next_normal_color(); + } + + append_color(right_buf, fg, bg); + append_right(printable); + append_color_off(right_buf); + + for(i = 0; i <= cont_count; i++) { + append_left(bytes[i], fg, bg); + } + + return 1; +} + +void dump_file(void) { + while(dump_utf8_char()) + ; + + if(dump_column) + print_line(); +} + +int main(int argc, char **argv) { + set_self(argv[0]); + open_input(argc, argv[1]); + dump_file(); + fclose(input); + return 0; +} diff --git a/uxd.rst b/uxd.rst new file mode 100644 index 0000000..e5a8fff --- /dev/null +++ b/uxd.rst @@ -0,0 +1,189 @@ +.. |version| replace:: 0.0.1 +.. |date| date:: + +=== +uxd +=== + +---------------- +UTF-8 hex dumper +---------------- + +:Manual section: 1 +:Manual group: Urchlay's Utilities +:Date: |date| +:Version: |version| + +SYNOPSIS +======== + +uxd [*file* | *-*] + +DESCRIPTION +=========== + +**uxd** is a hex dump utility that's aware of UTF-8 multibyte sequence +semantics. + +Input is read from *file*, or standard input if *file* is missing or +given as **-**. The input is treated as UTF-8 encoded Unicode. Since +ASCII is a subset, **uxd** works fine on plain ASCII files too. Other +encodings such as UTF-16, ISO-8859-*, Shift-JIS, etc, can be used, but +**uxd** won't handle these any better than a regular hex-dump utility +such as **xxd**. + +Output is written to standard output, which is normally a +terminal. It's assumed that the terminal supports ANSI-style color and +UTF-8. See **TERMINAL SUPPORT** below. + +Each line of output consists of eighteen columns: the offset from the +start of the file (in hex; minimum 4 digits), 16 bytes of hex +data (or empty cells, if the last line of the dump is for fewer than +16 bytes), and the human-readable form of the same data. + +The hex bytes and human-readable data are colorized to make it obvious +which bytes make up each character. Since UTF-8 is a variable-width +encoding, this means that one character may be composed of up to +4 bytes. + +OPTIONS +======= + +There are no options yet. + +EXAMPLE +======= + +It's hard to give a proper example, since man pages don't support +color. You'll have to use your imagination. Also, this section of +the man page requires your man command to support UTF-8 embedded in +the man page. If the examples looks mangled, try viewing the source +(uxd.rst) in a text editor. + +Japanese text example:: + + $ echo ¥ǥ£¥ | uxd + 0000: c2 a5 c7 a5 c2 a3 c2 a5 0a ¥ǥ£¥↵ + GG GG YY YY GG GG YY YY PP GYGYP + +The colors are indicated by G/Y/P, for green, yellow, and purple. The +character above each letter is displayed in that color. + +From the colorization, it's obvious that the "c2 a5" is the hex +representation of the first ¥ character, and that the ǥ is +represented by "c7 a5". + +The newline is displayed in purple because it's not a regular +printable character. Its human-readable representation is ↵. Note +that if a regular ↵ character appears in the input, it'll be +rendered in either green or yellow (as a regular character). + +COLORS +====== + +**green**, **yellow** + Printable characters (except the space, U+0020) alternate between green and yellow. + +**purple** + Spaces and unprintable characters ("control" characters, newlines, tabs, etc). + These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline. + This is an improvement over the usual practice of printing these as periods, like + standard hex dumpers do. + +**red** + Invalid UTF-8 sequences. These are rendered with a red foreground, to make them + stand out. Examples of invalid sequences: + + - Prefix bytes (>= 0x80) which are not followed by the correct number of continuation + bytes (with their high 2 bits set to **10**). + + - Continuation bytes that aren't preceded by a valid prefix byte. + + - Truncated UTF-8 sequence at EOF. + +TERMINAL SUPPORT +================ + +**uxd** should work with any modern terminal that supports color, +ANSI-style escape sequences, Unicode, and UTF-8 rendering. + +The author's testing is done primarily with **urxvt**\(1). Other +terminals aren't tested as often. + +Known to work: urxvt, xterm, st, xfce4-terminal, gnome-terminal, the Linux console (but +see **FONTS**, below). + +Known **not** to work: rxvt (doesn't support Unicode at all). + +FONTS +===== + +For the human-readable column to display correctly, you'll need a font +with lots of glyphs. Try *Deja Vu Sans Mono*, *Symbola*, *Quivira*. +If you use urxvt, it searches for glyphs in multiple fonts, so you can +use all of the above at once. + +The Linux console is capable of rendering UTF-8, but it's incapable +of displaying more than 512 glyphs. Most console fonts only define +256, since using more than 256 means the console won't be able to +do bold. Expect to see lots of solid or dotted boxes. This isn't +specifically a problem with **uxd**. + +FILES +===== + +**uxd** doesn't read any files other than the input file, and doesn't write to +any files other than standard output. There's no config file. + +ENVIRONMENT +=========== + +**uxd** doesn't read anything from the environment. It's *not* necessary to +have a UTF-8 locale set in e.g. **LANG** or **LC_ALL**. Also, the **TERM** +variable is not used. + +EXIT STATUS +=========== + +Zero for success, non-zero for failure. + +Failure status should only be returned if **uxd** failed to open the +input file. Invalid input (non-UTF-8) doesn't count as an error; +it'll just have lots of red in the output. + +BUGS +==== + +**uxd** doesn't check for overlong UTF-8 encodings (e.g. a character +that could be a 1-byte sequence, but is encoded as 2 or more). +Sequences like this really should be colorized in red. Technically, +this means **uxd** supports WTF-8, not UTF-8. + +RFC 3629 doesn't allow UTF-8 to use codepoints above U+10FFFF. 4-byte +sequences can support codepoints U+110000 to U+1FFFFF, which are not +valid Unicode. If these occur in the input, **uxd** should colorize +them in red, but it doesn't (yet). + +There should be options and/or a config file to change the colors, +rather than baking them into the binary. + +Combining characters are not handled well. Or at all, really: the 2 +characters being combined will have an ANSI color code in between. +urxvt at least ignores the color code, so the composite character +displays in the color of the first (non-combining) character. I'm not +sure what a better solution would be... + +COPYRIGHT +========= + +Licensed under the WTFPL. See http://www.wtfpl.net/txt/copying/ for details. + +AUTHORS +======= + +B. Watson . + +SEE ALSO +======== + +xxd(1), bvi(1), utf-8(7), unicode(7) -- cgit v1.2.3