280 likes | 418 Views
Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm. A Thesis Proposal Presented to the Department of Computer Science Brigham Young University. Kenneth Martin Tubbs Jr. Motivation. Millions of people want genealogical information
E N D
Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm A Thesis Proposal Presented to the Department of Computer Science Brigham Young University Kenneth Martin Tubbs Jr.
Motivation • Millions of people want genealogical information • Acquiring microfilm is expensive and time consuming
Problem • Searching microfilm by hand is slow, error prone, and tedious • Extraction by hand requires enormous amounts of time and manpower
Problem • Tables have different layouts and styles • Tables contain different records • Tables lack information and are ambiguous
Related Work • Current work exploits the geometric properties of tables • Regular expressions, grammars, probabilistic models, and templates • They ignore the ontological constraints of the information
Input Features • Coordinates of each cell. • Printed text of each cell. • Whether or not each cell is empty. • XML Input File • < cell rectangle="335,114,521,172" printed_text =“NAME and Surname of each Person" empty=“0" • /> … Related Work
Input Collect Evidence XML Input File(Preprocessed Microfilm Image) Apply Rules Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements
Cell Types Label Cells Print Value Cells Empty Cells
Genealogical Ontology Age Name Gender * * * Address 1 1.1 1.1 1 1 4.3 1.3 Family Person
Extract Features Collect Evidence • The algorithm extracts features • Support or refute a geometric and ontological relationships • Extracted features yield a confidence value between 0 and 1
5 Relationships Collect Evidence • Associate value cells to label Cells • Associate label cells to label Cells • Associate value cells to value Cells • Match label cells to object set in the genealogical ontology • Identify label cells that factor other label cells
.75 .10 .20 .32 Evidence Matrix Collect Evidence Label Cells Values Cells
Apply Rules Collect Evidence Apply Rules • A set correlation rules associate the values of the evidence matrices • The algorithm iterates over the set of correlation rules
.90 Value - Value A Rule Collect Evidence Apply Rules .75 .10 .20 .32 Label - Value j min[LVji & LVjk ] = min {min[LVji & LVjk ] * [ VVik + .3], max[LVji & LVjk ] }
A Rule Collect Evidence Apply Rules .90 .75 .32 Value - Value .75 .32 Label - Value j min[LVji & LVjk ] = min { min[LVji & LVjk ] * [ VVik + .3], max[LVji & LVjk ] }
Factoring Collect Evidence Apply Rules [Name] per [Address] = 9 / 2 = 4.5
Genealogical Ontology Collect Evidence Apply Rules [Name] per [Address] = 1 * 4.3 * 1.1 = 4.73 Age Name Gender * * * Address 1 1.1 1.1 1 1 4.3 1.3 Family Person
A Factoring Rule Collect Evidence Apply Rules • Compare the expected cardinality, O, ratio for a pair of label cells with the observed cardinality ratio, Ni/Nj. FMij = FMij * [1 - | Oij – Ni/Nj | + C] = FMij * [1 - | 4.73 – 4.5 | + .5] = FMij * 1.27
Score Results Collect Evidence Apply Rules • Score extracted record structure • Human user for verification Store Results
Score Results Collect Evidence Apply Rules Store Results
INSERT INTO Person (Name) VALUES ('335,114,521,172 ') INSERT INTO Person (Name) VALUES ('335,173,521,231') Database Collect Evidence Apply Rules • Create SQL Insert statements to store table cell coordinates Store Results … Name Family … 0123 0123 …
Input Collect Evidence XML Input File(Preprocessed Microfilm Image) Apply Rules Store Results Genealogical Ontology Algorithm Method Output SQL Insert Statements
Measurements • 5 – 7 Concept Tables • 5 Train Set – Real World Tables • 15 Test Set - Real World Tables • Precision, recall, and accuracy of the cells written in the SQL statements.
Contributions • Exploiting both constraints of a genealogical ontology and geometry • Combines extracted features using correlation rules
Delimitations • Tables of rows and columns • Genealogical domain. • English language documents • Tables that do not span multiple documents
Artifacts • Application/demo in the Java programming language.
Recognizing Table Structure from the Extracted Cells of Genealogical Microfilm A Thesis Proposal Presented to the Department of Computer Science Brigham Young University Kenneth Martin Tubbs Jr.