A Training and Classification System in Support of Automated Metadata Extraction

A Training and Classification System in Support of Automated Metadata Extraction PhD Proposal Paul K Flynn 14 May 2009

Motivation • bootstrap

Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule

Metadata Extraction System • Large, heterogeneous, evolving collections consisting of documents with diverse layout and structure • Defense Technical Information Center (DTIC) • U.S. Government Printing Office (GPO) – Environmental Protection Agency (EPA) • Two types of targets • Forms – documents containing a standardized form filled out with metadata of interest • Non-forms – all others • Developing a system which can be transitioned into a production system

Form Document

Nonform Document

Design concepts • Heterogeneity • A new document is classified, assigning it to a group of documents of similar layout – reducing the problem to multiple homogeneous collections • Associated with each class of document layouts is a template, a scripted description of how to associate blocks of text in the layout with metadata fields. • Evolution • New classes of documents accommodated by writing a new template • templates are comparatively simple, no lengthy retraining required • potentially rapid response to changes in collection • Robustness • Use of Validation techniques to detect extraction problems and selection of templates

Architecture & Implementation

Validation Scripts Final Validation Classification

Post-Hoc Classification • Apply all templates to document • results in multiple candidate sets of metadata • Score each candidate using the validator • Select the best-scoring set

Post hoc classification shortcomings Selected Correct

Classification (a priori) • Replace Post hoc classification alone with a new classification module • Continue to use Validator to provide semantic verification of extracts

Focus: Non-Form Processing • Classification – compare document against known document layouts • Select template written for closest matching layout • Apply non-form extraction engine to document and template • Send to validator for scoring

Extraction System

Proposal • Investigate Classification methodologies and implementation • Create Training System for managing and creating templates • Specific questions we will attempt to answer: • Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates? • Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system? • Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development? • Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?

Previous Research • Extensive coverage in literature • Model and methodology match purpose • No universal classifier • Many experiments use limited number of classes • Use visual similarity, logical structures, or textual features • Machine learning • Decision trees • Distance measures • Multiple classifiers

Issues and Complexities • Self-imposed constraints • Must be simple to maintain • Automated process for deploying from Training System • Avoid adding dependency on additional 3rd party packages

Unpredictable OCR Results

Unpredictable Page Segmentation

Second page relevancy Page 1 Page 2

Manual Classification • Documents may appear visually similar at thumbnail scale • Closer inspection reveals semantic differences

Incorrect Manual Classification Position of Date Field different – detectable by post hoc classification

Initial experiments • Methods tested • Block distances – Tries to match blocks and measure distances • MxN Overlap – divide page into bins and count matches • Common Vocab – find common words in pages of training class • Vocab1 – looks at only 1st page • Vocab5 – looks at 1st five pages • MXY tree – variant encodes structure as string, uses edit-distance to measure similarity • MXY + MxN – sums two methods • Used best 4/5 votes to declare match

Summary Sample Results MXY Tree

Summary Sample Results MxN Block Distance MXY Tree MXY Tree -MxN

Experimental Results • Precision = #correct / #total in class • Recall = #correct / #answers

Avenues of Investigation • Implement and test variety of method • Handling multiple page classification • Multiple Classifiers • Methods of combining • Weighting best for each class • Deriving signature for specifying combination rules • Clustering methods for bootstrapping

Training System • Manual classification • Inaccurate • Time consuming • Not dynamic • Need automated verification of extractions • Developers open original documents multiple times • Correctness open to interpretation • Regression testing only compares against previous extraction attempts • Need to allow multiple template writers to interact • Need to measure effects of individual templates against the whole

Metadata Training System Templates and Classification Signatures Training Evaluator Production System Clustered Docs Candidates Template Maker BootStrap Classifier Baseline Data Pool Trained Pool Baseline Data Manager Training Docs

Persistence Layer • Manage complete set of Training documents • Track Baseline data input by users • Allows for independent confirmation • Change tracking • Track subset of documents with out templates developed • Track trained documents • Allow for multiple access

Baseline Data Manager • GUI for establishing Baseline • Highlight and copy • Work on OCR to account for errors • Provide auditing and tracking of changes

Bootstrap Classifier • Dynamic GUI to allow: • Differing classifiers • Clustering method • User can flag documents to ignore • User can designate single doc as matcher • Provides output to Template Maker

Template Maker Results for document Template being developed

Template Maker • Final form depends on Engine replacement • Sample documents come from Bootstrap Classifier • Also prepares classification spec for export to production system

Training Evaluator • Verifies operation of template and classification spec • Identifies other documents in the pool which are correctly extracted • Moved to trained pool • Removed from further bootstrapping • Supports robust regression testing

Testing and Evaluation • Proposed Tests • Evaluate effectiveness of pre-classification module • Evaluate effectiveness of adding similarity score to validation • Evaluate the effectiveness of the bootstrap classification • End to End evaluation • Other tests as needed

Evaluate effectiveness of pre-classification module • Use simple Baseline classifier to test • Select candidate templates • Adjust post hoc score • Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?

Evaluate effectiveness of adding similarity score to validation • Use Baseline classifier • Determine a baseline cluster of 5 documents to serve as the “signature” targets for measuring similarity. • Apply score to final validation • Remove templates to measure effects • Assessment: Evaluate the percent of documents which are correctly flagged as resolved. • Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?

Evaluate the effectiveness of the bootstrap classification • Measure amount of time to completely run thru training collection • Isolate template development time by providing appropriate template • Assessment: Compare the time needed to classify the documents to the manual method. • Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?

End to End evaluation • Can we significantly decrease the amount of time and manpower to tailor the system to a new collection? • Create a mini-collection by downloading 100 documents from DTIC. • Assign two separate teams of trained template writers to create templates to correctly extract metadata from a minimum of 80 documents. • One team will perform the task using manual classification, a version of the Template Maker with the training system enhancements disabled and a production system (with no templates) for extraction. • The other team will use the complete training system. • Assessment: The teams will use logs to record work time. We will evaluate logs to assess time usage and conduct interviews to compile observations and impressions of the system.

Schedule

A Training and Classification System in Support of Automated Metadata Extraction

A Training and Classification System in Support of Automated Metadata Extraction

Presentation Transcript

DNS Data and Metadata Extraction

Metadata Extraction

Metadata: Automated generation

Extraction as Classification

Metadata Support and Management

Automated Personality Classification

Common Logic in Support of Metadata and Ontologies

Automated Metadata Extraction for Large, Diverse and Evolving Document Collections Kurt Maly

Extraction as Classification

Automated Metadata Creation: Possibilities and Pitfalls

Automated Extraction and Parameterization of Motions in Large Data Sets

Automated Detection and Classification of NFRs

Automated Extraction of JPF Options and Documentation

Automated Medicare decision support system

Classification and the Metadata Registry

Metadata Support and Management

Feature Extraction and Classification of Mammographic Masses

Automated Classification of Crystallization Images

Automated Classification and Analysis of Internet Malware

Automated Extraction of Storm Characteristics