180 likes | 267 Views
Automated Form processing for DTIC Documents. March 20, 2006 Presented By, K. Maly, M. Zubair, S. Zeil. Outline. Overall process for handling documents in batches Issues Results Conclusion. Overall process for handling documents in batches. start. 1.
E N D
Automated Form processing for DTIC Documents • March 20, 2006 • Presented By, • K. Maly, M. Zubair, S. Zeil
Outline • Overall process for handling documents in batches • Issues • Results • Conclusion
start 1 • Omnipage xml document having 10 pages (first 5 and last 5). • Possibly, more than one page will have a match with more than one templates. At this time, we do not check how well they matched. • Determined by the ratio of the number of fields matched over the total number of fields. Read Next Page Have candidates? no pages left yes no no yes Match against all form templates Get the best one 3 Extract metadata Matched Templates # >0 no Move to “unresolved” folder Store “resolved” results 2 yes Add the page with its templates into candidate set End Figure: Flowchart of Processing One Document
Issues in form based metadata extraction Results Of 246 Documents Results Of 100 Documents
Forms are missing some obvious fields • For example in the following document, the POINT page (first page) has the author, but • the form doesn’t. • http://128.82.7.208:9090/dtic/newdocs/sf298/formdocs/pdfs/ADA425677.pdf
In the following form, the caption “REPORT DOCUMENTATION PAGE” is OCRed incorrectly as “REPORT DOCUMENTA110N PAGE “. These type of OCR errors are resolved using edit distance.
The following has no form caption. If the captions of a form page is missing, we recognize it as a form if more than 10 metadata field names have been found.
The following form spans on two pages. After finding a form page, we check the following pages by using field name match to see if it’s a part of the form.
In the following form we have word boundary detection errors for metadata field names. For example, “4. TITLE AND SUBTITLE” appears as “4 . T ITL E A ND SUB TI TL E”. (We use the following seqence for matching field names: exact match, match after removing white spaces, similar match (using edit distance))
Following are parts of two forms, where we can see the variations for the field “17. LIMITATION OF ABSTRACT”. Here we recognize the field name by matching it part by part. If the cell boundary information is available (i.e. "17. LIMITATION OF ABSTRACT" is in one cell), we will also rebuild the text field name by connecting the texts in the cell (e.g. "17.", "LIMITATION", "OF", "ABSTRACT" ===> "17. LIMITATION OF ABSTRACT") and match it against defined field name directly. Its worth noting that not all form pages have cell boundary information.
Coverage Type Missing In the Original Document The Title is missing in the Third Field of the PDF document it should contain “REPORT TYPE AND DATES COVERED”
Identified as sf298_1 The current templates identified this form but failed to extract because this wasa new kind of form and we can handle this case by writing a new template.
OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field
OCR Error In the Date Covered Field OCR has produced a garbage for the Third Field (From-TO) In the Dates Covered Field
Results of 264 Documents We are currently handling six types of forms (through templates), five are variations of sf298 form (Report Documentation Page) and one is other type of form. For any new forms templates can be written to handle them. Following are the recall and precision results based on 264 documents.
Conclusion • Execution Time : The Code took 21 hrs, 58 minutes to process our testbed of 10K pdf documents. • We found that for 10k documents we are getting good results for most of the form classes and relatively poor performance for sf298_3 due to OCR errors.