catdvi can be used to transform TeX DVI files into text, losing
formatting its main aim on SBo is to be used by recoll, when it cannot
extract text from pdf files by other means.


catdvi is a program that translates TeX Device Independent (DVI) files
into readable plain text. The program is under development. It
produces satisfactory results in many cases, but still has some issues
with complicated input.

Goals Actually, "translate to plain text" can mean several different
things, depending on the intended use:

Output formatted text that resembles the layout of the DVI file as
closely as possible, suitable for e.g. preview on a character cell
terminal or printing on a teletype style printer. Output unformatted
text in "read order". (Rather than "print order", which makes quite a
difference with e.g. multi-column page layouts). Useful for searching,
indexing and other kinds of postprocessing, and maybe also for export
to different text processors. Output (not completely plain) text in
read order with the formatting distilled into some kind of markup so
that paragraph breaks, subscripts, superscripts, etc. can still be
recognized. This functionality is essentially a (La-)TeX decompiler,
useful for recovery of lost or otherwise unavailable .tex files.
catdvi's principal target is to create human-readable text files from
DVI input, and hence the first kind of translation.

The second kind is supported as well because one of the developers
needed it and it could be obtained as an easy by-product (based on the
mostly true assumption that read order = order in the source file =
order in the DVI file).

The third kind of translation is the most difficult one to achieve
since a DVI file does not contain logical markup information. The
structure of the text has to be guessed from heuristic principles and
an analysis of certain characteristics of TeX's output. No attempt in
this direction has been made so far. But knowledge of some aspects of
text structure would also help to improve the quality of layout in
case 1. If it turns out these can reliably be guessed, an option to
show them as markup will probably follow. This feature has low
priority at the moment, especially since nobody has expressed a need
for it.