200 likes | 299 Views
Data Mining Applied to Document Imaging. Jeff Rekoske. Agenda. Introduction Problem Definition Solution and Methodology Progress Report Tools Techniques Applied from CSC-288 Lessons Learned/Reinforced Summary. Introduction. Employed as SW Developer and DBA on document imaging project
E N D
Data Mining Applied to Document Imaging Jeff Rekoske
Agenda • Introduction • Problem Definition • Solution and Methodology • Progress Report • Tools • Techniques Applied from CSC-288 • Lessons Learned/Reinforced • Summary
Introduction • Employed as SW Developer and DBA on document imaging project • Access to OCR statistics • Management staff has a few questions that can be answered by analysis of existing data
Problem Definition • Two Parts • Management questions • Data mining demonstration
Management Questions • Result of interviews • Fairly basic • What forms are processed the most? • What are the recognition rates for the top forms? • What is the percentage of forms that were presented to an operator for keying?
Data Mining Demonstration • Purpose is to show the usefulness of data mining techniques. • Prediction of rates for new forms • Characteristics of highly recognized forms • Use mined data to develop new forms
Solution • Data mart • Answer management questions • Provide data for mining activities
Methodology • Choose a small timeframe to sample data • September – October 2004 • Use ETL to load data • Relatively “clean” process due to data location • Apply SQL statements to data mart to answer management questions
Methodology (continued) • Extract data from data mart to create WEKA files • Attribute-Relation File Format (ARFF) • Use WEKA to create classifier model using C4.5 algorithm (pass/fail recognition) • Validate model with 10-fold cross validation
Progress Report • First part (management questions) complete • 14,210 imaged documents • 865,409 OCR fields • View created that joins tables • Allows for non-technical personnel to create basic queries • Management is pleased with results
Progress Report (continued) • Part Two (WEKA –classifier) in progress • ARFF generation scripts complete • Need to run ARFF files through WEKA • Need to cross validate results
Tools • Oracle 8i RDBMS • Oracle PL/SQL scripting language • WEKA implementation of C4.5 classifier • WEKA cross validation
Techniques Applied from CSC-288 • Data Mart • Snowflake Schema • ETL • OLAP Operations
Techniques Applied (continued) • Classification • C4.5 Algorithm • Supervised Learning • Credibility • Cross-Validation
Lessons Learned/Reinforced • Get firm requirements (if possible) • Data marts can get large quickly • OLAP operations should be performed offline (from the OLTP system) • Demonstrations are useful for explaining concepts
Summary • Application of knowledge from CSC-288 to my work • Data mart can be used to answer multiple questions without effecting OLTP processing • Hopefully demonstrate using the data mart for creating a classification model
References • “Data Mining: Concepts and Techniques,” by Jiawei Han and Micheline Kamber, Morgan Kaufmann, San Francisco, 2001 • "Data Mining: Practical machine learning tools with Java implementations," by Ian H. Witten and Eibe Frank, Morgan Kaufmann, San Francisco, 2000.