230 likes | 366 Views
Extracting Math from PostScript Documents. Michael Yang Univ. Calif., Irvine Richard Fateman Univ. Calif, Berkeley. Why Extract Math from Documents?. The current and recent past publications of scholarly journals in mathematics are not adequately indexed.
E N D
Extracting Math from PostScript Documents Michael Yang Univ. Calif., Irvine Richard Fateman Univ. Calif, Berkeley ISSAC-2004
Why Extract Math from Documents? • The current and recent past publications of scholarly journals in mathematics are not adequately indexed. • Imagine a query: “Find papers that involve this differential equation:” x2 y’’+xy’+(x2-m2)y=0 • Or “Is there a common name for this equation? [Ans: yes, Bessel’s] ISSAC-2004
Why Extract Math from Documents? • Find papers that may be relevant to a formula or a proof of a related theorem. • Find out if a discovery is actually novel or a rediscovery of a previous result. • Even: Is this formula true? ISSAC-2004
How can we search, anyway? • Search in integral tables using hashing, flexible pattern matching. • Example: TILU (Fateman, Einwohner) • The general problem looks like a huge challenge of unification with simplifications of analytic functions. Is a=f(b) the same as f-1(a)=b ? ISSAC-2004
These are obviously hard questions • But we are much better off if we can start with a few decades of the most recent math papers and their formulas to search. • Prerequisite: encoding of formulas with semantic markup, the point of this paper. ISSAC-2004
Why start with PostScript or PDF? • We have many papers, including math journals, online, some of them free, with essentially all markup removed, stored for printing as PS or PDF. • Automation of inserting the markup, even if only partly successful, can help enable further work to make it possible to index and search for math. ISSAC-2004
Is this easier or harder than OCR? • It should be easier, because all the characters are known as error-free glyphs. • OCR tends to make erroneous symbol identifications if there is inadequate word-based context. • For example o0O°º, 1lI|!i , Illinois (!), -_= • Well-known sources of PS provide stereotypes for the font/glyph/location mapping. • But it could be harder if the PostScript is truly obscure (PS is Turing equivalent, after all) ISSAC-2004
An Example From a paper by Cyril Banderier et al, ``Random Maps, Coalescing Saddles, Singularity Analysis, and Airy Phenomena,'' Random Structures and Algorithms, 19 3-4, 194--246 (2001)} only slightly edited by inserting newlines. [explain origin] ....0.002 0.0025 200 400 600 800 1000 k Figure 3. Left: The standard Airy distribution. Right: Observed frequencies of core sizes k 2 [20; 1000] in 50,000 random maps of size 2,000, showing the bimodal character of the distribution. variety of integral or power series representations including (see [1, 45]) 1) Ai(z) 1 2 Z 1 1 e i(zt t 3 =3) dt = 1 3 2=3 1 X n=0 3 1=3 z n ( n 1) 3) n sin 2(n 1) 3 : Equipped with this de nition, we present the main character of the paper, a probability distribution closely related to the Airy function. De nition 1. The standard .... ISSAC-2004
What is this really? In this particular case, extraction of the document image shows two formulas in the middle of the citation: ISSAC-2004
How could we encode this image? Recognize the characters on the page as equivalent to a expression, for example: $${\mbox Ai}(z) = {1\over{2 \pi}}\int _{-\infty}^{+\infty} e^{i(zt+t^3/3)}dt$$ $$~~= {1 \over {\pi 3^{2/3}}}\sum_{n=0}^\infty (3^{1/3}z)^n {{\Gamma((n+1)/3)} \over {n!}} \sin {{2(n+1)\pi}\over 3}.$$ or some alternative in MathML or OpenMath. What are the barriers to getting to this point? ISSAC-2004
Detecting Math in the first place • Look for changes in font, italics, font size changes, altered baselines. • Consider the density of text (formulas are low density). • Notice the presence of special characters unusual in text: = is common in math, but not in text (Also +, -, parens). ISSAC-2004
Implementation • Run PostScript through a modified Ghostscript (PS interpreter) to output text file information suitable for geometric/math processing. • Run this file through previously developed OCR-based technology (in Lisp) for using bounding-boxes, contents, positions,… to create a geometric 2-D “relative position” tree. Process further to identify semantic relationships if possible and output a hierarchical tree-representation of math formulas. • Convert this to TeX (could be MathML equally well). ISSAC-2004
Possible Future Work • Better font tools • Look at more producers of PS (not just TeX and dvips), e.g. Acrobat Distiller. • Run some tests (NEC) to see if we can extract sufficient formulas to add to the indexing information. • Examine the issue of “formula similarity” e.g. parameter substitution, simplification, rearrangement. (relatively easy in the context of integration because there is a designated variable of integration.) ISSAC-2004
Conclusions • It’s possible to automatically revisit previously typeset documents and invent plausible versions of TeX source-code for some, perhaps much, of published TeX. • This provides an additional link to a chain which may eventually lead to more widespread semantic encoding of math for index and retrieval. • Given the difficulties, a better route for the future is to have authors or editors use semantic mark-up for digital mathematical documents for “born digital documents.” Publishers should encourage this kind of work, although standards are currently disappointing. ISSAC-2004
Another paper, not included • Submitted to ISSAC-2004 • Author: R. Fateman ISSAC-2004
Rational Function Computing with Poles and Residues • Here’s the idea: consider 2 forms for the same rational expression. ISSAC-2004
Which form is better? • Generality of representation • Complexity (Cost) of operations • Arithmetic (+, *, /) • Integration, derivatives, limits, series, … • Numerical evaluation • Display for human viewing ISSAC-2004
Keep constant numerators over (powers of) linear denominators ( + polynomial) • Works for encoding arbitrary rational functions (over complex numbers) in one variable. • Plausibly requires high-precision floats if you start with ratio of polynomials where the roots of the denominator cannot be expressed as exact rational numbers. ISSAC-2004
PRO: Once you have this representation • Addition of rational functions is essentially free, compared to standard representation since no polynomial GCD is required. • a/b + c/d is already simplified except for sorting and the possibility that b=d • Multiplication of rational functions is inexpensive also, again no GCD needed. ISSAC-2004
CON: Do you want to use this representation? • Division is not fast, so it is more appropriate if division is infrequent. • If the input is not already in residue/pole form, or if you have to do division, finding zeros introduces approximations [maybe for the first time in a problem]. • Output forms may look longer. ISSAC-2004
Examples • Ordinary addition: orders of magnitude faster. E.g 45,000 times faster. • Ordinary multiplication: maybe 2X faster • What about mixtures of + and * together? What important algorithms are there? • Sparse determinant calculation. ISSAC-2004
A determinant benchmark • Consider matrices with entries of this form: • Determinant of 8X8 matrix in Macsyma 2.4, on a 2.6GHz Pentium 4 computer. • Using Gaussian Elimination 112 sec • Using Minor Expansion 109 sec • Using Residues/Poles (75% in bignum arithmetic) 41 sec • Using Residues/Poles and double-floats 1.6sec ISSAC-2004
Conclusions • No surprise that avoiding GCDs is a winner. • Using approximate calculations can provide huge speedups. Do we really need exact computation everywhere we provide it? • We have a potential application for high-precision zero-finding, as well as non-overflowing software floats (GMP, ARPREC) ISSAC-2004