220 likes | 444 Views
MathML and Digital Libraries. Timothy W. Cole ( t-cole3@uiuc.edu ), University of Illinois at Urbana-Champaign Workshop on International Scientific Data, Standards & Digital Libraries @ JCDL 2005, Denver, CO. Digital Typography?.
E N D
MathML and Digital Libraries Timothy W. Cole (t-cole3@uiuc.edu), University of Illinois at Urbana-Champaign Workshop on International Scientific Data, Standards & Digital Libraries@ JCDL 2005, Denver, CO
Digital Typography? Typography, when considered “as the servant of mathematics,” should have as a goal “to communicate mathematics effectively by making it possible to publish mathematical papers and books of high quality, without excessive cost.” -- Donald Knuth (inventor of TEX), 1978 • Where do we go from TEX? t-cole3@uiuc.edu University of Illinois at UC
MathML Antecedents • HTML Math (Dave Raggett, 1994; HTML 3.0/3.2) • Limited functionality • Community preference to keep HTML simple, generic • SGML ISO 12083 (1994) Mathematics DTD • Exclusively presentational • Not as sophisticated as MathML • W3C Math Working Group Formed 1997 • MathML 1.0 released 1998 • MathML 2.0 released 2001 t-cole3@uiuc.edu University of Illinois at UC
MathML – Presentation Markup • Used to describe the layout structure of mathematical notation – focus on visual constructs (about 30 elements). Includes: • Token elements – primarily PCData content models • Layout Schemata, for building expressions – element content models • Semantic hints via invisible characters are encouraged <math xmlns='http://www.w3.org/1998/Math/MathML‘> <mrow> <msqrt> <mi>z</mi> </msqrt> <mo>=</mo> <msup> <mi>z</mi> <mfrac><mn>1</mn><mn>2</mn></mfrac> </msup> </mrow> </math> t-cole3@uiuc.edu University of Illinois at UC
MathML – Content (Semantic) Markup • Intended to support encoding of the underlying mathematical structure of an expression, rather than any particular rendering for the expression – about 120 elements • Limited scope, designed to express “commonplace mathematical constructs,” i.e., through about first 2 years of college (calculus) • can be extended,e.g., through the use of OpenMath <math xmlns='http://www.w3.org/1998/Math/MathML‘> <apply><eq/> <apply><root/> <degree><cn type='integer'>2</cn></degree><ci>z</ci></apply> <apply><power/> <ci>z</ci><cn type='rational'>1<sep/>2</cn></apply> </apply> </math> t-cole3@uiuc.edu University of Illinois at UC
Displaying MathML in Web Browsers • Mozilla / Netscape 7.x renders MathML natively • But not all elements supported, or supported well • Internet Explorer does not render MathML natively • But plug-ins available – TechExplorer, MathPlayer • Syntax for using MathML different between browsers • W3C provides XSLT that allows a page to viewable in both • Rendering quality also depends on fonts available • Thousands of code points have been added to Unicode • STIXFonts.org is creating more than 4,000 freely available new glyphs for these Unicode points (rest were pre-existing) t-cole3@uiuc.edu University of Illinois at UC
Other Software Support for MathML (2004) • Editors: MathType + Word, Integre MathML Editor, SciWriter, MathFlow, Scientific Workplace, MathFlow + Epic/XMetaL, OpenOffice, others • Interactive Components: Techexplorer, WebEQ • TeX Converters: TeXML, Hermes, tex4ht, others • Computation: Maple, Mathematica, REDUCE, others t-cole3@uiuc.edu University of Illinois at UC
Using MathML (as of 2004) • MathML incorporated into a number of document types: • Docbook, NIH Journal Archiving & Publishing, etc. • MathML is being used in STM publications : • In production: American Institute of Physics, US Patent Office • In pilot: Springer, Wiley, Marcel Dekker, Elsevier, others • MathML behind the scenes in e-Learning software: • Blackboard, WebCT, eCollege, Diploma.EDU, MapleTA, other • BUT adoption by author community limited to date t-cole3@uiuc.edu University of Illinois at UC
Current Status & Open Issues • MathML 2.0 is stable, available, & in increasing use • Proving useful for many publication & educational applications • Tailored for use on the Web; interactivity demostrated • Commitment to stay open & maintain backward compatibility • Limitations, Issues, & Needs • Utility for research mathematics? • Use in / cross-walks with other Sci-Tech MLs • Low-level: fonts / glyphs, character entity definitions, … • Semantic utility not yet proven (being investigated):- Comparing MathML expressions for degree of equivalence- Searching by MathML expressions t-cole3@uiuc.edu University of Illinois at UC
But… • Rendering of MathML typically not exactly the same application to application • For most complex functions, MathML not enough – must augment with annotations meaningful to Mathematica • Applications still rely on something other than MathML for complex, behind the scenes functionality • No canonical forms – no accepted normalization rules t-cole3@uiuc.edu University of Illinois at UC
Experiment: Using MathML to Link Literature • Hypotheses: • Readers of a journal article may want to know more about mathematical functions used in article … And vice-versa • Functions site keywords/phrases & MathML found in body of articles may be useful for linking, search & discovery • Approach: • Examine journal article testbed for presence of Functions site keywords; add links & terminology to article metadata • Extract selected MathML occurrences from article body; “normalize”; add to article metadata t-cole3@uiuc.edu University of Illinois at UC
Article Testbed: Illinois DLI-I / D-Lib Test Suite • Funded 1994-98 under DLI-I (NSF, DARPA, & NASA) Continued 1998-2001 under CNRI’s D-Lib Test Suite • Featured: • Multi-publisher, Markup-Based Full-Text Journal Testbed. • Studies of Processing, Indexing, Normalization, Retrieval, Rendering, Linking & End-User Searching Needs. • Testbed contains 75,000+ Articles from 60 Journal Titles • Received as SGML (various DTDs); converted to XML • Content from AIP, APS, ASCE, IEE, ASM, ACM, Elsevier • Additional support from IEEE, NRL, NTT Learning Systems t-cole3@uiuc.edu University of Illinois at UC
Searching for Occurrences of Functions Site Metadata Terms in Article Testbed • Functions Site: 83,000 pages containing keywords in HTML <meta> elements • Approximately 7,500 unique keywords • Not all keywords useful for full-text search • Eliminated numerals, Mathematica expressions, single word phrases, phrases containing selected words • About 1,000 keywords left after automated filtering • 367 of these keywords appear in journal article testbed • 44,000 articles contain at least one keyword • On average each of those 44,000 contain 2 keywords t-cole3@uiuc.edu University of Illinois at UC
Issues in Using Functions Keywords/phrases for Characterizing Articles • Vocabulary switching • Keywords in mathematics, not necessarily terms used in physics, electrical engineering, & computer science • Meaning of words in full-text less precise • Synonyms, less specific use, etc. • Context – same descriptive keywords may map to many different branches of Functions site • E.g., 300 occurrences of “q-series” in Functions site;which are relevant to 18 articles containing this keyword? • Use hierarchical tree of Functions Site to place user in reasonable proximity t-cole3@uiuc.edu University of Illinois at UC
Using MathML Found in Article Body • 2.5 million occurrences of MathML in testbed, but most not useful for searching & linking • MathML used for greek characters in text • Many instances short, inline fragments • Quality – many instances can’t be parsed by Mathematica • Criteria • Explicitly equations (block display format) • Minimum length, 200 bytes • Accepted by Mathematica kernel • About 3% of occurrences meet criteria t-cole3@uiuc.edu University of Illinois at UC
Observations from this early experiment • Currently, text cues in articles seem more effective for linking than mathematical cues • Mathematics embedded in articles needs lots of processing to begin to be useful for discovery, linking, … • We need to understand mathematical equivalents of language grammar, noun-phrases, … t-cole3@uiuc.edu University of Illinois at UC
Links • Wolfram Functions Website http://functions.wolfram.com/ http://functions.wolfram.com/formulasearch • W3C Math Working Group http://www.w3.org/Math/ http://www.w3.org/Math/testsuite/ • Design Science http://www.dessci.com/ http://www.dessci.com/en/products/webeq/interactive/default.htm • Enhancing the Searching of Mathematics (IMA Hot Topic) http://www.ima.umn.edu/complex/spring/searching.html t-cole3@uiuc.edu University of Illinois at UC