aboutsummaryrefslogtreecommitdiff
path: root/uxd.rst
blob: a17eb1cae2f4e62f39f5e7728a401a9ae62e66e2 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
.. include:: ver.rst

.. |date| date::

===
uxd
===

----------------
UTF-8 hex dumper
----------------

:Manual section: 1
:Manual group: Urchlay's Utilities
:Date: |date|
:Version: |version|

SYNOPSIS
========

uxd [**-n**] [**-c** *colors*] [**-l** *length*] [**-o** *offset*] [[**-s** | **-S**] *seekpos*] [-[**bchimnruv**] [*file* | *-*]

DESCRIPTION
===========

**uxd** is a hex dump utility that's aware of UTF-8 multibyte sequence
semantics, and uses colorized output to indicate which byte
sequences go with which human-readable characters.

Input is read from *file*, or standard input if *file* is missing or
given as **-**. The input is treated as UTF-8 encoded Unicode. Since
ASCII is a subset, **uxd** works fine on plain ASCII files too. Other
encodings such as UTF-16, ISO-8859-*, Shift-JIS, etc, can be used, but
**uxd** won't handle these any better than a regular hex-dump utility
such as **xxd**.

Output is written to standard output, which is normally a
terminal. It's assumed that the terminal supports ANSI-style color and
UTF-8. See **TERMINAL SUPPORT** below. If you want to pipe the output
to a pager, try **less -R**.

OPTIONS
=======

These options can be used on the command line, and/or in the
**UXD_OPTS** environment variable. The command line takes precedence
over the environment.

Options can be bundled: **-ubc1234** is the same as **-u** **-b** **-c
1234**. The one exception is the **-n** option, which should appear
by itself.

.. the comments are turned into the --help message by mkusage.pl.

-1
   Don't alternate colors for normal characters.

.. don't alternate colors for normal characters.

-b
  Bold output. This may be more or less readable, depending on your
  terminal and its color settings. Ignored if **-m** given.

.. bold color output.

-c nnnn
  Set the colors to use. Must be 2 to 4 digits, from 0 to 7. These are
  standard ANSI colors. The first 2 are the alternating colors for
  normal characters, the 3rd digit (optional) is the color for non-printable
  and space characters, and the 4th (optional) is for invalid UTF-8
  sequences. Default: **2351**. This option also disables a prior **-m**
  option.

.. colors (2 to 4 digits, 0 to 7).

-h, --help
  Print built-in usage message and exit.

.. print this help message.

-i
  After dumping, print information about the input: number of bytes,
  characters, ASCII (one-byte) characters, multi-byte characters, and
  bad sequences.

.. print number of bytes/chars/ascii/multibyte/bad sequences.

-l length
  Stop dumping after *length* bytes (not characters). If the limit is
  reached in the middle of a multibyte character, the entire character
  will be dumped.

.. stop dumping after <length> bytes (not characters).

-m
  Monochrome mode. Uses underline, bold, reverse video instead of color.
  Use this if you have trouble distinguishing the colors, or if they
  look too much like angry fruit salad. Disables prior **-b**, **-c**
  options.

.. monochrome mode.

-n
  Ignore **UXD_OPTS** environment variable. This option should not be
  bundled with other options (e.g. use **-n -u**, not **-nu**).

.. ignore UXD_OPTS environment variable.

-o offset
  Add this amount to the hex offsets (left column). May be negative,
  if you can think of a reason to want it to be. Can be given in
  decimal, hex (with *0x* prefix), or octal (with *0* prefix).

.. added to hex offsets (decimal, 0x hex, 0 octal).

-r
  Highlight multi-byte sequences in reverse video, in the hex
  output. Ignored if **-m** given.

.. highlight multi-byte chars in reverse video.

-s pos
  Seek in input before starting to dump. *pos* is bytes, not
  characters. Positive *pos* means seek from the start of the
  input. Negative *pos* only works on files (not standard input);
  it means seek backward from EOF. Can be given in decimal, hex (with
  *0x* prefix), or octal (with *0* prefix).

.. seek in input before dumping (-pos = seek back from EOF).

-S pos
  Same as **-s**, but file offsets start at 0 rather than the
  position after seeking. **-S 100** is the same as **-s 100 -o -100**.
  Works with negative *pos*, too.

.. like -s, but also sets -o so addresses start at 0.

-u
  Use uppercase hex digits *A-F*. Default is lowercase.

.. uppercase hex digits.

-v, --version
  Print version number and exit.

.. print version of uxd.

OUTPUT FORMAT
=============

The output is designed to fit in an 80-column terminal.

Each line of output consists of eighteen columns: the offset from the
start of the file (in hex; minimum 4 digits), 16 bytes of hex
data (or empty cells, if the last line of the dump is for fewer than
16 bytes), and the human-readable form of the same data.

The hex bytes and human-readable data are colorized to make it obvious
which bytes make up each character. Since UTF-8 is a variable-width
encoding, this means that one character may be composed of up to
4 bytes.

The hex bytes that make up one character are displayed in the same
color, which alternates between yellow and green for successive
characters. In addition, they have dashes instead of spaces between
them. An example would be **c3-b1** (for an ñ character).

The 16-byte hex display always has an extra "spacer" column in the
center. Normally this is a space, but if a multibyte character spans
it, it will be a dash (so there'll be two dashes: **c3--b1**).

Since the output lines are always 16 hex bytes, multibyte characters
can span two lines. When this happens, the character itself will be
printed on the first line, along with the first byte(s) on hex. The
last hex byte will be followed by a dash, and the next line of hex
dump will have the remaining bytes (in the same color as the first
bytes and character). This sounds complicated, but it's easy to
understand once you see it a few times.

EXAMPLE
=======

It's hard to give a proper example, since man pages don't support
color. You'll have to use your imagination. Also, this section of
the man page requires your man command to support UTF-8 embedded in
the man page. If the examples looks mangled, try viewing the source
(uxd.rst) in a text editor.

Example copied from the Japanese **ls**\(1) man page::

  $ echo デフォル | ./uxd
  0000: e3-83-87 e3-83-95 e3-82--a9 e3-83-ab 0a           デフォル↵
        GGGGGGGG YYYYYYYY GGGGGGGGG YYYYYYYY PP           G Y G Y P

The colors are indicated by G/Y/P, for green, yellow, and purple. The
character above each letter is displayed in that color.

From the colorization, and from the dashes between the bytes, it's
obvious that "e3 83 87" is the hex representation of the first
character, and that the 2nd is represented by "e3 83 95.

The newline is displayed in purple because it's not a regular
printable character. Its human-readable representation is ↵. Note
that if a regular ↵ character appears in the input, it'll be
rendered in either green or yellow (so you can tell it's not just
another newline).

COLORS
======

The colors in this description are the default ones. They can be
changed with the **-c** option (see above).

**green**, **yellow**
  Printable characters (except the space, U+0020) alternate between green and yellow.

**purple**
  Spaces and unprintable characters ("control" characters, newlines, tabs, etc).
  These are printed as "visible" characters, e.g. ␣ for the space, ↵ for a newline.
  Hopefully this is an improvement over the usual practice of printing these as periods, like
  standard hex dumpers do. The Unicode BOM (byte order marker, U+FEFF) is printed
  as a purple letter B.

**red**
  Invalid UTF-8 sequences. These are rendered with a red background, to make them
  stand out. Examples of invalid sequences:

    - Prefix bytes (>= 0x80) which are not followed by the correct number of continuation
      bytes (with their high 2 bits set to **10**).

    - Continuation bytes that aren't preceded by a valid prefix byte.

    - Truncated UTF-8 sequence at EOF.

    - Codepoints above U+10FFFF, which are disallowed by RFC 3629.

TERMINAL SUPPORT
================

**uxd** should work with any modern terminal that supports color,
ANSI-style escape sequences, Unicode, and UTF-8 rendering.

The author's testing is done primarily with **urxvt**\(1). Other
terminals aren't tested as often. Some terminals may need UTF-8
enabled, if it's not on by default (e.g. xterm).

Known to work: urxvt, xterm, st, xfce4-terminal, gnome-terminal,
kitty, konsole, the Linux console (but see **FONTS**, below).

Known **not** to work: rxvt (doesn't support Unicode at all), and its
derivatives such as aterm.

**uxd** also builds and runs correctly on a Mac running a recent
version of OSX (though I'm not sure what terminal was used).

FONTS
=====

For the human-readable column to display correctly, you'll need a font
with lots of glyphs. Try *Deja Vu Sans Mono*, *Symbola*, or *Quivira*
(although it's not really a terminal font). If you use urxvt, it
searches for glyphs in multiple fonts, so you can use all of the above
at once.

Any glyph your font lacks, you'll see as a dotted box, or perhaps
a solid block. This isn't something **uxd** can do anything about;
you'll have to use a different font, or (if you use urxvt) add another
font to your URxvt*font resource.

The Linux console is capable of rendering UTF-8, but it's incapable
of displaying more than 512 glyphs. Most console fonts only define
256, since using more than 256 means the console won't be able to
do bold. Expect to see lots of solid or dotted boxes. This isn't
specifically a problem with **uxd**.

FILES
=====

**uxd** doesn't read any files other than the input file, and doesn't write to
any files other than standard output. There's no config file.

ENVIRONMENT
===========

**UXD_OPTS**
  If this is set, its value is treated as a set of options, which
  get applied before any command-line options (unless the command-line
  options inclue **-n**).

**NO_COLOR**
  If this is set (to any value), **uxd** runs in monochrome mode, just
  as though the **-m** option were given. This variable is also
  respected by **xxd**.

It's *not* necessary to have a UTF-8 locale set in e.g. **LANG** or
**LC_ALL**. Also, the **TERM** variable is not used.

EXIT STATUS
===========

Zero for success, non-zero for failure.

Failure status will only be returned if **uxd** failed to open the
input file. Invalid input (non-UTF-8) doesn't count as an error;
it'll just have lots of red in the output.

BUGS
====

**uxd** doesn't check for overlong UTF-8 encodings (e.g. a character
that could be a 1-byte sequence, but is encoded as 2 or more).
Sequences like this really should be colorized in red. Technically,
this means **uxd** supports WTF-8, not UTF-8.

There should be options and/or a config file to change the colors,
rather than baking them into the binary.

Combining characters are not handled well. Or at all, really: the 2
characters being combined will have an ANSI color code in between.
urxvt at least ignores the color code, so the composite character
displays in the color of the first (non-combining) character. I'm not
sure what a better solution would be...

COPYRIGHT
=========

Licensed under the WTFPL. See http://www.wtfpl.net/txt/copying/ for details.

AUTHORS
=======

B. Watson <urchlay@slackware.uk>.

SEE ALSO
========

xxd(1), bvi(1), utf-8(7), unicode(7)