Paul DuBois
dubois@primate.wisc.edu
Wisconsin Regional Primate Research Center
Revision date: 20 May 1997
tc2rtf is a postprocessor for converting troffcvt
output to RTF. This document describes how it works and some of
the design issues involved in writing it.
In RTF paragraph formatting properties can only be set once per
paragraph, which means that once a paragraph has begun its properties
are frozen. Some ways of resetting them are: (i) after the \par
at the end of the previous paragraph, issue a \pard followed
by new settings; (ii) put each paragraph in a group, and issue
settings within each group. Each approach is similar in that paragraph
properties are reset to some default and then can be set as appropriate
for a new paragraph.
There are some differences between the approaches. The first approach
resets paragraph properties to the RTF defaults. The second resets
them to the paragraph state in effect at the time the group for
the first paragraph is begun. This means it's possible to set
up some arbitrary default which can be restored simply by beginning
a new group. But it also undoes any changes made to character
properties within the group. The first approach is "flatter"
because there are fewer groups, and simpler in the sense that
it's not necessary to restore any character formatting properties.
The second approach is simpler in the sense that it's likely fewer
paragraph properties will need to be reset, since the default
state is more likely to be close to the format used throughout
the document.
It's not obvious that either approach enjoys clear advantages
over the other. tc2rtf uses the first approach.
The above discussion assumes all changes to paragraph properties
occur between paragraphs and not within paragraph text. It's possible
for troffcvt output to contain within-paragraph changes,
however, since troff requests can occur anywhere, and can
be specified with a no-break control character. If such changes
are written in the middle of a paragraph, they do bad things things
to RTF readers (e..g., Microsoft Word 5.0 botches a paragraph
badly if \li or \fi are set in the middle). Two
ways to handle this problem are to force a \par if a paragraph
format change occurs within a paragraph, or to ignore the change
when it occurs and let it take effect after the paragraph ends
("lazy evaluation!"). It's not evident that either solution
is "correct." tc2rtf adopts the latter.
troff has concepts of page offset, indent, temporary indent,
and line length. (These are expressed in troffcvt output
as \offset, \indent, \temp-indent and \line-length).
These are not isomorphic to RTF, which has concepts for left and
right margins, left and right indent, and first-line indent for
the first line of a paragraph. (These are expressed in RTF as
\margl, \margr, \li, \ri and \fi.)
The troff settings can be changed at any time. The RTF
left and right margin values are document formatting properties,
and can only be set once (before any document text). The indents
can only be set once per paragraph, as discussed above.
Differences between the two methods of expressing page layout
are handled as follows. Output is turned off while tc2rtf
is reading the setup section of troffcvt output. When the
setup information has been completely read (\setup-end
has been seen, tc2rtf assumes that the current offset+indent
should be the document left margin, and that any space on the
right not taken up by offset, indent or line length should be
the right margin. Thereafter, changes in offset or indent may
change the left indent, relative to the left margin. Changes in
offset, indent or line length may change the right indent, relative
to the right margin.
Changes in the temporary indent are mapped onto first-line indent,
on the assumption that \temp-indent will normally occur
before the text of a paragraph. A difference between troff
and RTF is that the troff temporary indent is relative
to the page offset, while RTF first-line indent is relative to
the current left indent.
The temporary indent is reset to be equal to the left indent at
each \par since in troff the .ti setting
is transient.
Another difference between troff and RTF is that the temporary
indent changes the tab settings for the first line of a paragraph,
whereas the first-line-indent in RTF does not. tc2rtf does
not attempt to simulate troff's behavior, since there isn't
any way of knowing when the second line of a paragraph has been
reached. (RTF includes no mechanism for expressing or discovering
font metrics.)
\leader-char and \tab-char are both ignored. Leaders
and tabs are always written as plain tab characters.
A document containing tbl input is best handled by using
tblcvt to preprocess the document before feeding the result
to troffcvt and tc2rtf.
Tables are a pain to do well in RTF. As the RTF specification
says, "tables are probably the trickiest part of RTF to read
and write correctly." While adding support for tblcvt-related
output to tc2rtf, I found it alarmingly easy to crash or
lock Word (both Macintosh and Windows versions) unless table controls
were written just right. Even now I'm not overly confident that
tables are written correctly, though tc2rtf table output
no longer seems to cause crashes. One of the keys is to make sure
to write \intbl in each cell, even for empty cells. Further,
it's typically a good idea to emit \pard for each cell,
but it must be written before the \intbl, not after.
Otherwise Word seems to forget that it's in a cell. (This seems
silly. Surely if you've seen \intbl but not \cell
or \row it's reasonable to expect Word to consider itself
still in the cell? Apparently not.)
I don't know how to do these in RTF, so what tc2rtf does
is write out ugly but highly visible sequences to make it obvious
to the user that the document contains stuff that needs some hand
tuning. Bracket characters are written out surrounded by <BRACKET<
and >BRACKET>. Characters which should be overstruck
are written out surrounded by <OVERSTRIKE< and
>OVERSTRIKE>. Ick.