Metadata Extraction

Metadata Extraction Progress Report 12/14/2006

Outline • System Overview • Detailed Structure with Recent Changes • IDM representation of documents • validation & post-hoc classification • Status of Recent & Upcoming Deliverables • Future Directions

System Overview

Detailed Structure with Recent Changes • Input Processing • Form Processing • Post Processing • Nonform Processing

Input Processing • OCR – Omnipage update radically changed XML output • Details later • Study of 10188 DTIC documents found none with POINT (Page Of INTerest) pages outside 1st and last 5 • suspended efforts at more sophisticated POINT page location

Form Processing • Bug fixes and Tuning • Omnipage XML converted to IDM • Main form template engine rewritten to work from IDM

Independent Document Model (IDM) • Platform independent Document Model • Motivation • Dramatic XML Schema Change between Omnipage 14 and 15 • Tie the template engine to stable specification • Protects from linking directly to specific OCR product • Allows us to include statistics for enhanced feature usage • Statistics (i.e. avgDocFontSize, avgPageFontSize, wordCount, avgDocWordCount, etc..)

Generating IDM • Use XSLT 2.0 stylesheets to transform • Supporting new OCR schema only requires generation of new XSLT stylesheet. -- Engine does not change • Chain a series of sheets to add functionality (CleanML) • Schema Specification Available (http://dtic.cs.odu.edu/devzone/IDM_Specification.doc)

IDM Usage • Each incoming XML schema requires specific XSLT 2.0 Stylesheet • Resulting IDM Doc used for “Form Based” templates • IDM transformed into CleanML for “Non-form” templates OmniPage 14 XML Doc Form Based Extraction docTreeModelOmni14.xsl docTreeModelOmni15.xsl docTreeModelCleanML.xsl OmniPage 15 XML Doc IDM XML Doc docTreeModelOther.xsl CleanML XML Doc Other OCR Output XML Doc Non Form Extraction

IDM Tool Status • Converters completed to generate IDM from Omnipage 14 and 15 XML • Omnipage 15 proved to have numerous errors in its representation of an OCR’d document • Consequently, not recommended • Form-based extraction engine revised to work from IDM • Non-form engine still works from our older “CleanXML” • convertor from IDM to CleanXML completed as stop-gap measure • direct use of IDM deferred pending review of other engine modifications

Post Processing • No significant changes

Nonform Processing • Bug fixes & tuning • Added new validation component • Post-hoc classification • replaces former a priori classification schemes

Validation • Given a set of extracted metadata • mark each field with a confidence value indicating how trustworthy the extracted value is • mark the set with a composite confidence score • Fields and Sets with low confidence scores may be referred for additional processing • automated post-processing • human intervention and correction

Validating Extracted Metadata • Techniques must be independent of the extraction method • A validation specification is written for each collection, combining • Field-specific validation rules • statistical models derived for each field of • text length • % of words from English dictionary • % of phrases from knowledge base prepared for that field • pattern matching

Sample Validation Specification • Combines results from multiple fields <val:validate collection="dtic" xmlns:val="jelly:edu.odu.cs.dtic.validation.ValidationTagLibrary"> <val:average> <val:field name="UnclassifiedTitle">...</val:field> <val:field name="PersonalAuthor">...</val:field> <val:field name="CorporateAuthor">...</val:field> <val:field name="ReportDate">...</val:field> </val:average> </val:validate>

Validation Spec: Field Tests • Each field is subjected to one or more tests … <val:field name="PersonalAuthor"> <val:average> <val:length/> <val:max> <val:phrases length="1"/> <val:phrases length="2"/> <val:phrases length="3"/> </val:max> </val:average> </val:field> <val:field name="ReportDate"> <val:reportFormat/> </val:field> ...

Sample Input Metadata Set <metadata> <UnclassifiedTitle>Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor>Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate>Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

Sample Validator Output <metadata confidence="0.522"> <UnclassifiedTitle confidence="0.943">Thesis Title: The Military Extraterritorial Jurisdiction Act</UnclassifiedTitle> <PersonalAuthor confidence="0.622">Name of Candidate: LCDR Kathleen A. Kerrigan</PersonalAuthor> <ReportDate confidence="0.0" warning="ReportDate field does not match required pattern">Accepted this 18th day of June 2004 by:</ReportDate> </metadata>

Classification (a priori) • Previously, we had attempted various schemes for a priori classification • x-y trees • bin classification • Still investigating some • visual recognition

Post-Hoc Classification • Apply all templates to document • results in multiple candidate sets of metadata • Score each candidate using the validator • Select the best-scoring set

Demo & Experimental Results • Results of 157 documents http://128.82.7.147:8080/dtic/validsum157.jsp

Future Directions

Status of Recent & Upcoming Deliverables • DTIC - Classifier Development (9/19/06) • NASA - Enhance classification algorithm for two specific classes (10/31/2006) • NASA - Process study for inter-organizational collections – configuration software – (12/1/2006) • NASA - Enhance engine to recognize two major classes (Dec 15, 2006)

Classifier Development • DTIC - Classifier Development (9/19/06) • NASA - Enhance classification algorithm for two specific classes (10/31/2006) • Delayed by difficulties with a priori classification schemes • Now replaced by post hoc validation-based classification • some tuning of validation spec required • cleaning of metadata sources for statistical models • Demo posted 11/15/2006

Configuration • NASA - Process study for inter-organizational collections (12/1/2006) • extraction engines differentiate by collection-dependent template sets • validation specifications take collection name as a required attribute • used to locate distinct statistical models built for that collection • Regression test framework established • protects against changes or tuning to one collection degrading performance on others

Engine Enhancements • NASA - Enhance engine to recognize two major classes (12/15/2006) • in many ways, already satisfied • most planned enhancements deferred due to work on IDM • in short term, emphasis will be on expanding the template set to exploit existing engine features and availability of new post-hoc classifier

END • Questions?

Current System (Detailed)

Metadata Extraction