140 likes | 318 Views
NASA Feasibility Study Status Update. NASA Milestones. A. Feasibility Study to identify the NASA document types –Report - May 31, 2006 B. Form identification and template development - Template set - Aug 31, 2006
E N D
NASA Feasibility StudyStatus Update September 25, 2006
NASA Milestones A. Feasibility Study to identify the NASA document types –Report - May 31, 2006 B. Form identification and template development - Template set - Aug 31, 2006 C. Enhance classification algorithm for two specific classes – software packaged -Oct 31, 2006 D. Process study for inter-organizational collections – configuration software – Dec 1, 2006 E. Enhance engine to recognize two major classes – software packaged – Dec 15, 2006 F. Evaluation of extraction process – report – Feb 28,2006 September 25, 2006
Form Identification and Template Development August 31 Deliverable September 25, 2006
Form Identification and Template Development August 31 Deliverable DEMO September 25, 2006
Active Tasks for future NASA Milestones Standard Intermediate Representation of the Scanned Document (IDM) Design Classification Algorithm September 25, 2006
Independent Document Model (IDM) • Platform independent Document Model • Motivation • Dramatic XML Schema Change between Omnipage 14 and 15 • Tie the template engine to stable specification • Protects from linking directly to specific OCR product • Allows us to include statistics for enhanced feature usage • Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..) • Supports Pointpage Detection, Classification • Use XSLT 2.0 stylesheets to transform • Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change • Chain a series of sheets to add functionality (CleanML) September 25, 2006
OmniPage 14 XML Doc docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl Other OCR Output XML Doc IDM Usage • Each incoming XML schema requires specific XSLT 2.0 Stylesheet • Resulting IDM Doc used for “Form Based” templates • IDM transformed into CleanML for “Non-form” templates Form Based Extraction CleanML XML Doc Non Form Extraction September 25, 2006
Classification Algorithm • Two approaches: • Classification(switching) based on image classification • Post-hoc classification via validation September 25, 2006
Post-hoc classification via validation • Attempt metadata extraction with all plausible templates • Validate each results set, assigning confidence scores • Field-specific validation rules, may combine - statistical models derived for each field of - text length - % of words from English dictionary - % of phrases from knowledge base prepared for that field - pattern matching • Select metadata set with highest confidence score September 25, 2006
Sample set of extracted metadata bindings <metadata> <author>Steven J. Zeil</author> <organization>Old Dominion University Technical Report 2006-24</organization> <reportDate>September 12, 2006</reportDate> <title>Validation of Extracted Metadata</title> <abstract>A lengthy discussion of techniques for validating metadata is </abstract> </metadata> September 25, 2006
Validation template customized for the collection <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="author"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field> September 25, 2006
<val:field name="organization"> <val:min> <val:length/> <val:vocabulary/> <val:phrases length="2"/> <val:phrases length="3"/> <val:phrases length="4"/> </val:min> </val:field> <val:field name="reportNumber"> <val:max> <val:regexp pattern="Technical Report +\d\d\d\d-\d\d"/> </val:max> </val:field> <val:field name="reportDate"> <val:max> <val:dateFormat/> </val:max> </val:field> <val:field name="abstract"> <val:min> <val:length/> <val:dictionary/> </val:min> </val:field> </val:average> </val:validate> September 25, 2006
Annotated version of the metadata bindings <metadata confidence="0.59"> <author confidence="0.85">Steven J. Zeil</author> <organization confidence="0.42" warning="inappropriate vocabulary">Old Dominion University Technical Report 2006-24</organization> <reportDate confidence="1.0">September 12, 2006</reportDate> <title confidence="1.0">Validation of Extracted Metadata</title> <abstract confidence="0.3" warning="Unusually short"> A lengthy discussion of techniques for validating metadata is </abstract> </metadata> September 25, 2006