TeX2Star

A System for Converting TeX to OpenOffice By Jeffrey Starr TeX2Star

Overview • Why does conversion matter? • Why has it not already been done? • Why is it difficult? • Proposal: TeX->OpenOffice • Proposal: TeX->DVI->OpenOffice • Solution • Unsolved problems

What is OpenOffice? • Open Source office suite • Based on StarOffice, currently owned by Sun Microsystems • Cross-Platform • XML based, standards driven • Semantic-based format

What is TeX? • Written by Donald E. Knuth • Solution to declining standardsin mathematical typography • Heavily used in mathematics and physics • Both a program and a programming language • Presentation-based format

Why Bother to Convert? • TeX rare outside mathematical circles • Conflicts with publishing software • Does not fit within current word processing model • TeX's purpose to is to produce journal-quality typography, not facilitate editing of content.

TeX has no direct editable outputs. Aside: Editable Output • TeX has many presentation outputs: • DVI • PostScript • PDF • PNG • TIFF • Fax

Solution: TeX->OpenOffice • Why use the outputs? Read the original document. • Perfect knowledge of content and (presentational) intent • Write a program that reads TeX and outputs OpenOffice, instead of DVI

Problems with TeX->OpenOffice • TeX is a large system • Eight years development • Too large for a semester • Irregular • Non-Balanced • Many special cases

TeX is Irregular • An irregular language is one in which typical rules of processing are violated • Irregular '\atop': (TeX) • {numerator \atop denominator} • Regular '\frac': (LaTeX) • \frac{numerator}{denominator}

TeX is not balanced • A language that is balanced will have an explicit beginning and end to each grouping • Non-balanced font commands: (TeX) • \bf this is bold \rm this is normal, roman text • Balanced font commands: (LaTeX) • \textbf{this is bold} this is back to normal

TeX has many special cases • \par may either: • explicitly end a paragraph • do nothing (if in math mode) • do nothing (if in restricted horizontal mode) • tell TeX to build the current page • \par is also irregular (acts on material already processed and in the reverse direction) and unbalanced (may or may not be proceeded by \indent, a primitive to start a paragraph)

Solution: TeX->DVI->OpenOffice • Let TeX deal with TeX • Run TeX on the original text • Read the resultant DVI output • Process the DVI output to OpenOffice

Problem: Lack of semantic data • DVI contains font definitions, text stream, and description of black boxes • Fonts contain characters, but do not say what those characters are • Especially a problem with kerning “ff” vs. “ff” • Also a problem with bold and italics text --- bold and italics are their own fonts

Solution: Add Annotations • Use interpositioning and the TeX primitive '\special' to send extra information to DVI file • \special leaves comments that can be read later • Reading the DVI with proper annotation allows the text to retain some level of semantic information • Difference between knowing that the next character is smaller and raised versus knowing that the next character is a superscript

Problem: Unbalanced Tags • Some primitives are balanced, but many are not • Tags may affect the document for an arbitrary length of time or are local to a paragraph or specific block of text

Solution: Balancing • Algorithm: • Given: database of tags • start tag, end tag, 'insert end tag' tags • Go through list of tags, find one that needs help balancing • Go forward along list, finding nearest tag that closes the previous tag, or end of document • Insert end of tag into the list of tags

Post Document Editing • Further balancing and insertion of tags may be necessary after first sweep through file • Tables: • OpenOffice format requires number of columns to be specified • We don't know how many columns will be needed until after we read the entire table • Solution: After processing, go back and insert the needed information

Unsolved Problems • Footnotes: • Defined by position in the page • Automatic positioning conflicts with paragraph detection tool • Unable to discern between footnotes, extra paragraph, header, or footer • Non-English alphabets

Conclusion • Semantics of document are lost in TeX itself, so no hope of recovery • Overt presentation can be recovered for editing • Method works to translate an irregular, non-well formed language into a regular, well-formed language (XML)

TeX2Star

TeX2Star

Presentation Transcript