190 likes | 380 Views
A System for Converting TeX to OpenOffice By Jeffrey Starr. TeX2Star. Overview. Why does conversion matter? Why has it not already been done? Why is it difficult? Proposal: TeX->OpenOffice Proposal: TeX->DVI->OpenOffice Solution Unsolved problems. What is OpenOffice?.
E N D
A System for Converting TeX to OpenOffice By Jeffrey Starr TeX2Star
Overview • Why does conversion matter? • Why has it not already been done? • Why is it difficult? • Proposal: TeX->OpenOffice • Proposal: TeX->DVI->OpenOffice • Solution • Unsolved problems
What is OpenOffice? • Open Source office suite • Based on StarOffice, currently owned by Sun Microsystems • Cross-Platform • XML based, standards driven • Semantic-based format
What is TeX? • Written by Donald E. Knuth • Solution to declining standardsin mathematical typography • Heavily used in mathematics and physics • Both a program and a programming language • Presentation-based format
Why Bother to Convert? • TeX rare outside mathematical circles • Conflicts with publishing software • Does not fit within current word processing model • TeX's purpose to is to produce journal-quality typography, not facilitate editing of content.
TeX has no direct editable outputs. Aside: Editable Output • TeX has many presentation outputs: • DVI • PostScript • PDF • PNG • TIFF • Fax
Solution: TeX->OpenOffice • Why use the outputs? Read the original document. • Perfect knowledge of content and (presentational) intent • Write a program that reads TeX and outputs OpenOffice, instead of DVI
Problems with TeX->OpenOffice • TeX is a large system • Eight years development • Too large for a semester • Irregular • Non-Balanced • Many special cases
TeX is Irregular • An irregular language is one in which typical rules of processing are violated • Irregular '\atop': (TeX) • {numerator \atop denominator} • Regular '\frac': (LaTeX) • \frac{numerator}{denominator}
TeX is not balanced • A language that is balanced will have an explicit beginning and end to each grouping • Non-balanced font commands: (TeX) • \bf this is bold \rm this is normal, roman text • Balanced font commands: (LaTeX) • \textbf{this is bold} this is back to normal
TeX has many special cases • \par may either: • explicitly end a paragraph • do nothing (if in math mode) • do nothing (if in restricted horizontal mode) • tell TeX to build the current page • \par is also irregular (acts on material already processed and in the reverse direction) and unbalanced (may or may not be proceeded by \indent, a primitive to start a paragraph)
Solution: TeX->DVI->OpenOffice • Let TeX deal with TeX • Run TeX on the original text • Read the resultant DVI output • Process the DVI output to OpenOffice
Problem: Lack of semantic data • DVI contains font definitions, text stream, and description of black boxes • Fonts contain characters, but do not say what those characters are • Especially a problem with kerning “ff” vs. “ff” • Also a problem with bold and italics text --- bold and italics are their own fonts
Solution: Add Annotations • Use interpositioning and the TeX primitive '\special' to send extra information to DVI file • \special leaves comments that can be read later • Reading the DVI with proper annotation allows the text to retain some level of semantic information • Difference between knowing that the next character is smaller and raised versus knowing that the next character is a superscript
Problem: Unbalanced Tags • Some primitives are balanced, but many are not • Tags may affect the document for an arbitrary length of time or are local to a paragraph or specific block of text
Solution: Balancing • Algorithm: • Given: database of tags • start tag, end tag, 'insert end tag' tags • Go through list of tags, find one that needs help balancing • Go forward along list, finding nearest tag that closes the previous tag, or end of document • Insert end of tag into the list of tags
Post Document Editing • Further balancing and insertion of tags may be necessary after first sweep through file • Tables: • OpenOffice format requires number of columns to be specified • We don't know how many columns will be needed until after we read the entire table • Solution: After processing, go back and insert the needed information
Unsolved Problems • Footnotes: • Defined by position in the page • Automatic positioning conflicts with paragraph detection tool • Unable to discern between footnotes, extra paragraph, header, or footer • Non-English alphabets
Conclusion • Semantics of document are lost in TeX itself, so no hope of recovery • Overt presentation can be recovered for editing • Method works to translate an irregular, non-well formed language into a regular, well-formed language (XML)