230 likes | 359 Views
Document Image Analysis Lecture 5: Metrics. Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center. The course so far…. Reminder: All course materials are online: http://www-inst.eecs.berkeley.edu/~cs294-9/ Overview of the DIA Research Field
E N D
Document Image AnalysisLecture 5: Metrics Richard J. Fateman Henry S. Baird University of California – Berkeley Xerox Palo Alto Research Center UC Berkeley CS294-9 Fall 2000
The course so far…. • Reminder: All course materials are online: http://www-inst.eecs.berkeley.edu/~cs294-9/ • Overview of the DIA Research Field • Some applications (Postal Addresses, Checks): • Research Objectives: more systematic modeling, design • Some basic engineering UC Berkeley CS294-9 Fall 2000
How well are we doing? • Cost to achieve a useful result • Compare digital version to • hand keying/ digitizing • verification • correction • Correction cost may dominate total system cost UC Berkeley CS294-9 Fall 2000
When is a result nearly correct? • Character Model • Correct • Reject • Error • String model • Insertion • Deletion • Rejection • Substitution [wrong letter identification] UC Berkeley CS294-9 Fall 2000
Using ascii character labels ABCDEFGHIJKL = s1 ACD~~OIIUKL = s2 Insert B after A in s2 Substitute E for ~, F for ~ [~=reject] subst G for O in s2 subst H for I in s2 subst I for U … etc (really H was recognized as II, IJ was recognized as U) UC Berkeley CS294-9 Fall 2000
Ascii labels are inadequate • Unicode + • Font + • Point size + • Tag information <author> .. </author> UC Berkeley CS294-9 Fall 2000
Simple measures may mislead Increase the rejection rate and this “error rate” decreases. Reject all characters to get 0/0? Some applications (e.g. post office) force very low error, even if (low confidence) correct results are sometimes rejected. UC Berkeley CS294-9 Fall 2000
Some errors are acceptable • Keyword search: if the key word occurs many times and is occasionally rejected • Erroneous (nonsense) words are unlikely to be found by a search • Caveat: if a key word is consistently changed to a nearby word, it may be missed (e.g. search for durnptruck and never find it.) UC Berkeley CS294-9 Fall 2000
Example: UNLV-ISRI document collection • 20 million pages of scientific, legal, official memos from DOE and contractors • Rock mining • Maps • Safe transportation of nuclear waste • Average length 44 pages UC Berkeley CS294-9 Fall 2000
Example: UNLV-ISRI document collection • DOE’s Licensing Support System Prototype • 104,000 Page images, 2,600 documents • Manually typed “correct” text • OCR text • To determine relevance to queries, 3 methods used • Geology students ranking (0/1) • OCR keyword search • “correct” text search UC Berkeley CS294-9 Fall 2000
Example: UNLV-ISRI document collection • Exact match on 71 queries. • 632 returned by correct text • 617 returned by OCR. • Essentially: OCR is OK for this application. • Probabilistic ranking / frequency: • Excessive OCR errors affected ranking • On average, similar results • Feedback on relevance was not helpful for poor OCR • Benchmarking: similar relevance = good results UC Berkeley CS294-9 Fall 2000
Example: UNLV-ISRI document collection One surprising result is that for some standard tests of precision and recall, processing OCR did better than actual text. [Crummy OCR meant that some terms were not recognized; but the documents were irrelevant….] UC Berkeley CS294-9 Fall 2000
A theory for computing accuracy • Consider the result of OCR to be a string • Idealization: most common errors involve mis-counting the number of spaces! • Ignores size/font/absolute position etc etc UC Berkeley CS294-9 Fall 2000
Computing the shortest edit distance • Bio-informatics sequencing • Associate a cost for each correspondence. For example, • Match or substitute (cost 0 or 1) • Insert or delete (cost 2) UC Berkeley CS294-9 Fall 2000
A C U G A U G U G A A U G G A A 14 Attempt to align of AUGGAA to ACUGAUGUGA. Distances were calculated using following parameters: s(a,b) = 0 when a equals b; s(a,b) = 1 when a differs from b insert or delete cost = 2. One of the possible optimal paths is indicated by a solid line connecting cells. It corresponds to the following alignment: ACUGAUGUGA A-UG--G-AA [explain dynamic programming here?] UC Berkeley CS294-9 Fall 2000
Computing the shortest edit distance • Also useful for other tasks (recognizing speech) • Lots of ways of organization of dynamic programming, still O(n2). • Probably of more interest is word accuracy, or accuracy on non-stopwords (excluding and the of … etc.) UC Berkeley CS294-9 Fall 2000
Correct Zoning is essential • Read order in multi-column pages • How to compare competing programs on performance of repeated headers • What to do with figures, logos. 123456 123456 UC Berkeley CS294-9 Fall 2000
Document Attribute Format Specification : DAFS ``While many formats exist for composing a document from electronic storage onto paper, no satisfactory standard exists for the reverse process. DAFS is intended to be a standard for document decomposition. It will used in applications such as OCR and document image understanding. There are three storage formats: DAFS-Unicode, DAFS-ASCII and a more compact DAFS-Binary form. DAFS is a file format specification for documents with a variety of uses. It is developed under the Document Image Understanding (DIMUND) project funded by ARPA.’’ www.raf.com, Illuminator, UW CDRoms (English and Japanese) UC Berkeley CS294-9 Fall 2000
DAFS vs SGML • DAFS= SGML+Unicode +CCITFax4 • SGML requires DTD (document type definition) • SGML is intended for structure, not appearance (e.g. not bold, italic) • Images which accidentally contain ascii version of <tag> can be problematical • Solved by putting images in separate files! UC Berkeley CS294-9 Fall 2000
Perfect results: how to obtain ground truth? • Painfully enter it by hand, or • Painfully correct OCR results, or • Compute some kind of average of OCR programs UC Berkeley CS294-9 Fall 2000
Perfect ground truth: a synthetic approach • (Kanungo,UMD): start with TeX, • produce the ground truth for layout form TeX, • Extract character positions, glyphs by analyzing DVI files • This provides essentially every bit position of each character. UC Berkeley CS294-9 Fall 2000
Ground truth • Next, commit to paper: • Print the DVI files • Scan a calibration page • Compute parameters of 2d2d transformations T imposed by physics • Scan the printout • Align the page • Run the recognizer • Compare reported positions (• T-1 ) to correct ones UC Berkeley CS294-9 Fall 2000
Change of Pace • Assignment 1 • What does it mean to write a program? • Documentation • Demo • Instructions for use • (perhaps optional) • Extensions, limitations, discussion • Discussion questions UC Berkeley CS294-9 Fall 2000