130 likes | 268 Views
Inferring Structure Information from Typography. Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen. Overview. Context Deriving Structure Information: Partitioning Typographic abstraction
E N D
Inferring Structure Informationfrom Typography Christian Fuß Dipl.-Inform. Felix Gatzemeier Michael Kirchhof Dipl.-Inform. Oliver Meyer Department of Computer Science III, RWTH Aachen
Overview • Context • Deriving Structure Information: • Partitioning • Typographic abstraction • Determine Type • Conclusion • Cooperation project of • Prototype aTool in the WEP goupof the Global-Info Project (www.global-info.org)
Conversion Standard format Author Writing < > < > Proprietary document format < > < > < > < > Today’s Publication Chain Publisher Copy Editing Web Publ. Reader Typesetting Reading
Unformatted Formatted Somehow Formatted Structured(XML) Somehow Formatted Structured(XML) Classification of Submissions Submissions TEX MS Word Unformatted Formatted Somehow Formatted Correctly Formatted
Basic Assumptions Known target document type Textual Nature Typographic markup Consistent markup
Deriving Structure Information In: MS Word document • Record Formatting (Format Tuples) • Locate the Elements • Reduce Format Tuples to Patterns • Determine Types Out: XML documentAlso interactively
Format Tuples • The basic typographic abstraction • FormatTuple("Is this a dagger?") = [Times, 22pt, regular, roman] • Here: Font, Size, Weight, Variation • Planned: Search expressions modulo Text • More general: Including regular expressions of text content or context.
Locate the Elements • Tree-Partitioning of Formatted Character Streams on • Format Tuple changes • Paragraphs breaks • Nesting of Inline Elements • Is this a dagger? <ft1> • Is this a dagger?<ft1 <ft2> ft1> • Is this a dagger?<ft1 <ft2> > • Is this a dagger?<ft1 <ft2 <ft3> > > • Format-To-Type Map: FormatTuple ElementType ft1(times, 22pt, reg, roman) dummyType1 ft2 (times, 22pt, bold, roman) dummyType2 ft3 (times, 22pt, reg, italic) dummyType3
FormatPattern ElementType fp1(*, *, regular, *) dummyType1 fp2 (*, *, bold, *) dummyType2 fp2b (*, *, bold, roman) dummyType2fp3 (*, *, regular, italic) dummyType3 Format patterns • Identity too restrictive wildcard generalizationIs this a dagger? (,,)Times Times Times *22pt 22pt 22pt *regular bold regular boldroman roman roman * • (, a, b) = (a, a, b); (a, b, ) = (a, b, b) • (, a, ) propagated to paragraph level • Format-To-Type Map:
FormatPattern ElementType (*, *, regular, *) Body (*, *, bold, *) FirstTerm (*, *, bold, roman) FirstTerm (*, *, regular, italic) Emphasis Determine Types • Replace dummy types in Format-To-Type Map • Preconfiguration by publisher • Controlled Learning from the author
Further useable information • Allowed context from the DTD • Paragraph standard format • Text patterns • Bullets • Enumeration • Whitespace • ASCII Markup (Is *this* a dagger?) • Format pattern match confidence
Motivational aspects • Quick feedback on formal correctness • Publication preview while keeping format freedom • (Via XSL) flexible previews of other formats • New structure-based functionality: • Structure editing • Structure evaluation • Document templates
Conclusion • Summary • 4-step inference • Record format tuples • Locate the elements • Reduce tuples to patterns • Determine types • Increase efficiency of publication chain • Provide unobtrusive structuring for non-expert authors • Plans • Cautious extension of inference • Validation of document • Evaluation with authors