1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
|
.\" Man page generated from reStructuredText.
.
.
.nr rst2man-indent-level 0
.
.de1 rstReportMargin
\\$1 \\n[an-margin]
level \\n[rst2man-indent-level]
level margin: \\n[rst2man-indent\\n[rst2man-indent-level]]
-
\\n[rst2man-indent0]
\\n[rst2man-indent1]
\\n[rst2man-indent2]
..
.de1 INDENT
.\" .rstReportMargin pre:
. RS \\$1
. nr rst2man-indent\\n[rst2man-indent-level] \\n[an-margin]
. nr rst2man-indent-level +1
.\" .rstReportMargin post:
..
.de UNINDENT
. RE
.\" indent \\n[an-margin]
.\" old: \\n[rst2man-indent\\n[rst2man-indent-level]]
.nr rst2man-indent-level -1
.\" new: \\n[rst2man-indent\\n[rst2man-indent-level]]
.in \\n[rst2man-indent\\n[rst2man-indent-level]]u
..
.TH "UXD" 1 "2024-12-12" "0.0.1" "Urchlay's Utilities"
.SH NAME
uxd \- UTF-8 hex dumper
.SH SYNOPSIS
.sp
uxd [\fIfile\fP | \fI\-\fP]
.SH DESCRIPTION
.sp
\fBuxd\fP is a hex dump utility that\(aqs aware of UTF\-8 multibyte sequence
semantics, and uses colorized output to indicate which byte
sequences go with which human\-readable characters.
.sp
Input is read from \fIfile\fP, or standard input if \fIfile\fP is missing or
given as \fB\-\fP\&. The input is treated as UTF\-8 encoded Unicode. Since
ASCII is a subset, \fBuxd\fP works fine on plain ASCII files too. Other
encodings such as UTF\-16, ISO\-8859\-\fI, Shift\-JIS, etc, can be used, but
**uxd*\fP won\(aqt handle these any better than a regular hex\-dump utility
such as \fBxxd\fP\&.
.sp
Output is written to standard output, which is normally a
terminal. It\(aqs assumed that the terminal supports ANSI\-style color and
UTF\-8. See \fBTERMINAL SUPPORT\fP below. If you want to pipe the output
to a pager, try \fBless \-R\fP\&.
.sp
Each line of output consists of eighteen columns: the offset from the
start of the file (in hex; minimum 4 digits), 16 bytes of hex
data (or empty cells, if the last line of the dump is for fewer than
16 bytes), and the human\-readable form of the same data.
.sp
The hex bytes and human\-readable data are colorized to make it obvious
which bytes make up each character. Since UTF\-8 is a variable\-width
encoding, this means that one character may be composed of up to
4 bytes.
.SH OPTIONS
.sp
There are no options yet.
.SH EXAMPLE
.sp
It\(aqs hard to give a proper example, since man pages don\(aqt support
color. You\(aqll have to use your imagination. Also, this section of
the man page requires your man command to support UTF\-8 embedded in
the man page. If the examples looks mangled, try viewing the source
(uxd.rst) in a text editor.
.sp
Japanese text example:
.INDENT 0.0
.INDENT 3.5
.sp
.nf
.ft C
$ echo ¥ǥ£¥ | uxd
0000: c2\-a5 c7\-a5 c2\-a3 c2\-a5 0a ¥ǥ£¥↵
GGGGG YYYYY GGGGG YYYYY PP GYGYP
.ft P
.fi
.UNINDENT
.UNINDENT
.sp
The colors are indicated by G/Y/P, for green, yellow, and purple. The
character above each letter is displayed in that color.
.sp
From the colorization, and from the dashes between the bytes, it\(aqs
obvious that the "c2 a5" is the hex representation of the first ¥
character, and that the ǥ is represented by "c7 a5".
.sp
The newline is displayed in purple because it\(aqs not a regular
printable character. Its human\-readable representation is ↵. Note
that if a regular ↵ character appears in the input, it\(aqll be
rendered in either green or yellow (as a regular character).
.SH COLORS
.INDENT 0.0
.TP
.B \fBgreen\fP, \fByellow\fP
Printable characters (except the space, U+0020) alternate between green and yellow.
.TP
.B \fBpurple\fP
Spaces and unprintable characters ("control" characters, newlines, tabs, etc).
These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline.
Hopefully this is an improvement over the usual practice of printing these as periods, like
standard hex dumpers do. The Unicode BOM (byte order marker, U+FEFF) is printed
as a purple letter B.
.TP
.B \fBred\fP
Invalid UTF\-8 sequences. These are rendered with a red foreground, to make them
stand out. Examples of invalid sequences:
.INDENT 7.0
.INDENT 3.5
.INDENT 0.0
.IP \(bu 2
Prefix bytes (>= 0x80) which are not followed by the correct number of continuation
bytes (with their high 2 bits set to \fB10\fP).
.IP \(bu 2
Continuation bytes that aren\(aqt preceded by a valid prefix byte.
.IP \(bu 2
Truncated UTF\-8 sequence at EOF.
.IP \(bu 2
Codepoints above U+10FFFF, which are disallowed by RFC 3629.
.UNINDENT
.UNINDENT
.UNINDENT
.UNINDENT
.SH TERMINAL SUPPORT
.sp
\fBuxd\fP should work with any modern terminal that supports color,
ANSI\-style escape sequences, Unicode, and UTF\-8 rendering.
.sp
The author\(aqs testing is done primarily with \fBurxvt\fP(1). Other
terminals aren\(aqt tested as often.
.sp
Known to work: urxvt, xterm, st, xfce4\-terminal, gnome\-terminal, kitty, the Linux console (but
see \fBFONTS\fP, below).
.sp
Known \fBnot\fP to work: rxvt (doesn\(aqt support Unicode at all).
.SH FONTS
.sp
For the human\-readable column to display correctly, you\(aqll need a font
with lots of glyphs. Try \fIDeja Vu Sans Mono\fP, \fISymbola\fP, \fIQuivira\fP\&.
If you use urxvt, it searches for glyphs in multiple fonts, so you can
use all of the above at once.
.sp
The Linux console is capable of rendering UTF\-8, but it\(aqs incapable
of displaying more than 512 glyphs. Most console fonts only define
256, since using more than 256 means the console won\(aqt be able to
do bold. Expect to see lots of solid or dotted boxes. This isn\(aqt
specifically a problem with \fBuxd\fP\&.
.SH FILES
.sp
\fBuxd\fP doesn\(aqt read any files other than the input file, and doesn\(aqt write to
any files other than standard output. There\(aqs no config file.
.SH ENVIRONMENT
.sp
\fBuxd\fP doesn\(aqt read anything from the environment. It\(aqs \fInot\fP necessary to
have a UTF\-8 locale set in e.g. \fBLANG\fP or \fBLC_ALL\fP\&. Also, the \fBTERM\fP
variable is not used.
.SH EXIT STATUS
.sp
Zero for success, non\-zero for failure.
.sp
Failure status should only be returned if \fBuxd\fP failed to open the
input file. Invalid input (non\-UTF\-8) doesn\(aqt count as an error;
it\(aqll just have lots of red in the output.
.SH BUGS
.sp
\fBuxd\fP doesn\(aqt check for overlong UTF\-8 encodings (e.g. a character
that could be a 1\-byte sequence, but is encoded as 2 or more).
Sequences like this really should be colorized in red. Technically,
this means \fBuxd\fP supports WTF\-8, not UTF\-8.
.sp
There should be options and/or a config file to change the colors,
rather than baking them into the binary.
.sp
Combining characters are not handled well. Or at all, really: the 2
characters being combined will have an ANSI color code in between.
urxvt at least ignores the color code, so the composite character
displays in the color of the first (non\-combining) character. I\(aqm not
sure what a better solution would be...
.SH COPYRIGHT
.sp
Licensed under the WTFPL. See \fI\%http://www.wtfpl.net/txt/copying/\fP for details.
.SH AUTHORS
.INDENT 0.0
.IP B. 3
Watson <\fI\%urchlay@slackware.uk\fP>.
.UNINDENT
.SH SEE ALSO
.sp
xxd(1), bvi(1), utf\-8(7), unicode(7)
.\" Generated by docutils manpage writer.
.
|