480 likes | 653 Views
A Training and Classification System in Support of Automated Metadata Extraction. PhD Proposal Paul K Flynn 14 May 2009. Motivation. bootstrap. Overview . Background Metadata Extraction Overview System Description Nonform classification Proposal Classification Background
E N D
A Training and Classification System in Support of Automated Metadata Extraction PhD Proposal Paul K Flynn 14 May 2009
Motivation • bootstrap
Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule
Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule
Metadata Extraction System • Large, heterogeneous, evolving collections consisting of documents with diverse layout and structure • Defense Technical Information Center (DTIC) • U.S. Government Printing Office (GPO) – Environmental Protection Agency (EPA) • Two types of targets • Forms – documents containing a standardized form filled out with metadata of interest • Non-forms – all others • Developing a system which can be transitioned into a production system
Design concepts • Heterogeneity • A new document is classified, assigning it to a group of documents of similar layout – reducing the problem to multiple homogeneous collections • Associated with each class of document layouts is a template, a scripted description of how to associate blocks of text in the layout with metadata fields. • Evolution • New classes of documents accommodated by writing a new template • templates are comparatively simple, no lengthy retraining required • potentially rapid response to changes in collection • Robustness • Use of Validation techniques to detect extraction problems and selection of templates
Validation Scripts Final Validation Classification
Post-Hoc Classification • Apply all templates to document • results in multiple candidate sets of metadata • Score each candidate using the validator • Select the best-scoring set
Post hoc classification shortcomings Selected Correct
Classification (a priori) • Replace Post hoc classification alone with a new classification module • Continue to use Validator to provide semantic verification of extracts
Focus: Non-Form Processing • Classification – compare document against known document layouts • Select template written for closest matching layout • Apply non-form extraction engine to document and template • Send to validator for scoring
Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule
Proposal • Investigate Classification methodologies and implementation • Create Training System for managing and creating templates • Specific questions we will attempt to answer: • Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates? • Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system? • Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development? • Can we significantly decrease the amount of time and manpower to tailor the system to a new collection?
Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule
Previous Research • Extensive coverage in literature • Model and methodology match purpose • No universal classifier • Many experiments use limited number of classes • Use visual similarity, logical structures, or textual features • Machine learning • Decision trees • Distance measures • Multiple classifiers
Issues and Complexities • Self-imposed constraints • Must be simple to maintain • Automated process for deploying from Training System • Avoid adding dependency on additional 3rd party packages
Second page relevancy Page 1 Page 2
Manual Classification • Documents may appear visually similar at thumbnail scale • Closer inspection reveals semantic differences
Incorrect Manual Classification Position of Date Field different – detectable by post hoc classification
Initial experiments • Methods tested • Block distances – Tries to match blocks and measure distances • MxN Overlap – divide page into bins and count matches • Common Vocab – find common words in pages of training class • Vocab1 – looks at only 1st page • Vocab5 – looks at 1st five pages • MXY tree – variant encodes structure as string, uses edit-distance to measure similarity • MXY + MxN – sums two methods • Used best 4/5 votes to declare match
Summary Sample Results MXY Tree
Summary Sample Results MxN Block Distance MXY Tree MXY Tree -MxN
Experimental Results • Precision = #correct / #total in class • Recall = #correct / #answers
Experimental Results • Precision = #correct / #total in class • Recall = #correct / #answers
Avenues of Investigation • Implement and test variety of method • Handling multiple page classification • Multiple Classifiers • Methods of combining • Weighting best for each class • Deriving signature for specifying combination rules • Clustering methods for bootstrapping
Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule
Training System • Manual classification • Inaccurate • Time consuming • Not dynamic • Need automated verification of extractions • Developers open original documents multiple times • Correctness open to interpretation • Regression testing only compares against previous extraction attempts • Need to allow multiple template writers to interact • Need to measure effects of individual templates against the whole
Metadata Training System Templates and Classification Signatures Training Evaluator Production System Clustered Docs Candidates Template Maker BootStrap Classifier Baseline Data Pool Trained Pool Baseline Data Manager Training Docs
Persistence Layer • Manage complete set of Training documents • Track Baseline data input by users • Allows for independent confirmation • Change tracking • Track subset of documents with out templates developed • Track trained documents • Allow for multiple access
Baseline Data Manager • GUI for establishing Baseline • Highlight and copy • Work on OCR to account for errors • Provide auditing and tracking of changes
Bootstrap Classifier • Dynamic GUI to allow: • Differing classifiers • Clustering method • User can flag documents to ignore • User can designate single doc as matcher • Provides output to Template Maker
Template Maker Results for document Template being developed
Template Maker • Final form depends on Engine replacement • Sample documents come from Bootstrap Classifier • Also prepares classification spec for export to production system
Training Evaluator • Verifies operation of template and classification spec • Identifies other documents in the pool which are correctly extracted • Moved to trained pool • Removed from further bootstrapping • Supports robust regression testing
Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule
Testing and Evaluation • Proposed Tests • Evaluate effectiveness of pre-classification module • Evaluate effectiveness of adding similarity score to validation • Evaluate the effectiveness of the bootstrap classification • End to End evaluation • Other tests as needed
Evaluate effectiveness of pre-classification module • Use simple Baseline classifier to test • Select candidate templates • Adjust post hoc score • Can the accuracy of the post hoc validation classification be improved by adding a pre-classification step to determine the most likely candidate templates?
Evaluate effectiveness of adding similarity score to validation • Use Baseline classifier • Determine a baseline cluster of 5 documents to serve as the “signature” targets for measuring similarity. • Apply score to final validation • Remove templates to measure effects • Assessment: Evaluate the percent of documents which are correctly flagged as resolved. • Can we improve the reliability of final validation acceptance and rejection decisions by combining the layout similarity measures with the existing validation system?
Evaluate the effectiveness of the bootstrap classification • Measure amount of time to completely run thru training collection • Isolate template development time by providing appropriate template • Assessment: Compare the time needed to classify the documents to the manual method. • Can we improve the process for creating document templates by building an integrated training system that can identify candidate groups for template development?
End to End evaluation • Can we significantly decrease the amount of time and manpower to tailor the system to a new collection? • Create a mini-collection by downloading 100 documents from DTIC. • Assign two separate teams of trained template writers to create templates to correctly extract metadata from a minimum of 80 documents. • One team will perform the task using manual classification, a version of the Template Maker with the training system enhancements disabled and a production system (with no templates) for extraction. • The other team will use the complete training system. • Assessment: The teams will use logs to record work time. We will evaluate logs to assess time usage and conduct interviews to compile observations and impressions of the system.
Overview • Background • Metadata Extraction Overview • System Description • Nonform classification • Proposal • Classification • Background • Potential Methods • Complexities • Early experiments • Avenues of Investigation • Training System • Design Overview • Testing and Evaluation • Schedule