90 likes | 168 Views
Illinois D-Lib Testbed: Technologies for Converting Legacy Mathematics for Display on the Web. Timothy W. Cole Thomas G. Habing William H. Mischo Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign
E N D
Illinois D-Lib Testbed:Technologies for Converting Legacy Mathematics for Display on the Web Timothy W. Cole Thomas G. Habing William H. Mischo Grainger Engineering Library Information Center University of Illinois at Urbana-Champaign http://dli.grainger.uiuc.edu/Publications/MathMLConf/ • thabing@uiuc.edu MathML & Math on the Web
Project Background & Objectives • Funded 1994-98 under DLI-I (NSF, DARPA, & NASA) Continued 1998-2001 under CNRI’s D-Lib Test Suite • Objectives: • Construct Large-Scale, Multipublisher, Markup-Based Full-Text Journal Testbed. • Investigate Processing, Indexing, Normalization, Retrieval, Rendering and Linking. • Study End-User Searching Behavior and Needs. • Testbed contains 60,000 Articles from 50 Journal Titles • Received as SGML (various DTDs); converted to XML • Content & support from AIP, APS, ASCE, IEE, ASM, ACM, Elsevier • Additional support from IEEE, NRL, NTT Learning Systems MathML & Math on the Web
Project Background (cont.) • Accomplishments: • Process & Retrieve from Multiple Publishers & Heterogeneous DTDs. • SGML to XML Conversion. • Metadata Extraction, Representation, Merging. • Dynamic Linking: Forward/Backward, from/to A & I DBs. • Current Investigations: • Mathematics Markup & Rendering Issues • Metadata Harvesting: Replicative & Distributed • E-Journal Archiving • Local Resource Resolution • Asynchronous Searching of Multiple Resources MathML & Math on the Web
Converting Legacy Markup to MathML • Goal: Convert publisher-specific XML math markup to standard presentation MathML • Desired result: can then focus on single rendering solution • Groundrules: • Minimize need for human intervention • Utilize standards-based techniques (e.g., XSLT, JavaScript, DOM) • Embed MathML in full XML document • Validate success of conversion based on quality of presentation • Strive for consistency across MathML viewers • Scope: • E.g. in 17,000 APS articles, > 2.3 M instances of math (100 K block) • http://dli.grainger.uiuc.edu/MathMLStyle/math_sample.htm MathML & Math on the Web
Mathematics Markup Transformations • Identify & translate mathematical character references • Identify & tokenize mathematical content • Recognize & transform mathematical markup (e.g., embellishments, script & limit schemtas, etc.) Presentational MathML <math xmlns=“http://www.w3.org/…”> <msubsup> <mrow><mi>α</mi></mrow> <mrow><mi>i</mi></mrow> <mrow><mn>2</mn></mrow> </msubsup> </math> ISO 12083 Math <dformula> <g>a</g> <sup>2</sup> <inf>i</inf> </dformula> MathML & Math on the Web
Approach & Algorithim • For each XML document: Identify mathematical nodes (e.g., <dformula>, <formula>) • Recursively apply templates to every child node within mathematical nodes: • Look up entities & special characters and Convert to appropriate MathML characters & tokenize (JavaScript) • Tokenize remaining #PCDATA (JavaScript) • Convert Postfix markup to MathML (e.g., <sup>, <inf>) • Re-tag one-to-one transformations (e.g., <sum>, <ul>, <ll>) • Transformed mathematical nodes (<math>) replace original mathematical nodes in document • Include default namespace attribute MathML & Math on the Web
Approach & Algorithim (cont.) • Illustrative XSLT: <xsl:when test="sup or inf"> <xsl:for-each select="child::node()"> <xsl:choose> <xsl:when test="name(self::node())='sup' and name(following-sibling::node()[1])='inf'"> <xsl:element name="msubsup” namespace=“http://www.w3.org/…”> <xsl:element name="mrow” namespace=“http://www.w3.org/…”> <xsl:apply-templates select="preceding-sibling::node()[1]"/> </xsl:element> <xsl:element name="mrow” namespace=“http://www.w3.org/…”> <xsl:apply-templates select="following-sibling::node()[1]"/> </xsl:element> <xsl:element name="mrow" namespace=“http://www.w3.org/…”> <xsl:apply-templates select="self::node()"/> </xsl:element> </xsl:element> </xsl:when> . . . THERE ARE FOUR MORE CASES TO HANDLE ! MathML & Math on the Web
Remaining Issues • JavaScript from within XSLT • Rely on MS-specific mechanisms to invoke extension functions • Inconsistent Rendering by MathML Viewers • Validating against TechExplorer, Amaya, Mozilla, MS IE (w/ CSS) • Incomplete MathML implementations • Ambiguity & Overuse of <mrow> • Limited impact on appearance • Verbosity -- 60% increase for inline, 15% increase for block • Character / glyph issues • STIX project / Unicode update will provide some relief • Automated Checking for Errors / Problems • Rendering System Performance MathML & Math on the Web
Status • Developing publisher-specific XSLT stylesheets • See sample transformed issue of Physical Review Letters • XSLT allows us to generate standard MathML from publisher-dependent SGML math markup • Moves customization to pre-processing stage • Allows for single, common rendering solution • MathML can be rendered in some browsers / tools without the need to style (Mozilla, techexplorer, Mathematica) MathML & Math on the Web