550 likes | 711 Views
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables. A Thesis Submitted to the Faculty of Brigham Young University. Kenneth Martin Tubbs Jr. Motivation. Millions of people want genealogical information Acquiring microfilm is expensive and time consuming.
E N D
Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables A Thesis Submitted to the Faculty of Brigham Young University Kenneth Martin Tubbs Jr.
Motivation • Millions of people want genealogical information • Acquiring microfilm is expensive and time consuming
Extraction Problem • Searching microfilm by hand is slow, error prone, and tedious • Extraction by hand requires enormous amounts of time and manpower
Difficulties • Tables have different layouts and styles • Tables contain different records • Tables do not use a uniform schema • Tables lack information and are ambiguous
Related Work • Current work exploits the geometric properties of tables • Regular expressions, grammars, probabilistic models, and templates • They ignore the ontological constraints of this information
Contributions • Exploit both ontological and geometric constraints • Identify complex records • Work with tables with hand-written values
Input Generate Confidences XML Input File(Preprocessed Microfilm Image) Enforce Constraints Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements
Training Set • 25 Tables from 5 different microfilm rolls • Used to: • Identify relationships between table cells • Create genealogical ontology • Define features to extract • Generate rules (constraints)
Input: Microfilm Table • Input Features • Coordinates of each cell. • Printed text for label cells. • Whether or not each value cell is empty.
Input: Microfilm Table <index source="0444770/0444770_2.gif"ontology="ontology.xml"> <cellrect="7,131,62,261"printed_text="Dwelling-houses number in the order of visitation."empty="0" /> <cellrect="61,132,118,260"printed_text="Families number in order of visitation."empty="0" /> <cellrect="119,132,436,261"printed_text="The Name of every Person whose usual place of abode on the first day of June, 1840, was in this family."empty="0" /> <cellrect="62,260,120,295"printed_text="2"empty="0" /> <cellrect="118,260,436,298"printed_text="3"empty="0" /> <cellrect="7,458,62,497"printed_text=""empty="1" /> . . .
Genealogical Ontology <Ontology> <ObjectSet id="0" name="Person" syn="" lex="0"/> <ObjectSet id="1" name="Family" syn="families" lex="0"/> <ObjectSet id="2" name="Event" syn="" lex="0"/> <ObjectSet id="3" name="Age" syn="age birthday" lex="1"/> <ObjectSet id="4" name="Relationship" syn="relationship relation" lex="1"/> <ObjectSet id="5" name="Full Name" syn="full name whom who" lex="1"/> <ObjectSet id="6" name="First Name" syn="first given christian" lex="1"/> <ObjectSet id="7" name="Middle Name(s)" syn="middle initial" lex="1"/> <ObjectSet id="8" name="Last Name" syn="last surname" lex="1"/> <ObjectSet id="9" name="Title(s)" syn="title" lex="1"/> . . .
Generate Confidences Generate Confidences • Confidence of relationships between pairs of cells • Generate confidence values between 0 and 1
Relationships Generate Confidences • A label cell describes a value cell • Value cells in same row or column • Label cells form a multi-level label • A label cell maps to an object set • Identify factoring
Label Cell and Value Cell Generate Confidences A continuous path between a label cell and a value cell Label Label Confidence = 1 If a path exists 0 If no path exists
Label Cell and Value Cell Generate Confidences Preferences for label – value orientations Label Label
Label Cell and Value Cell Generate Confidences Compare the height or width of each label cell with each value cell Label OR Label Not Similar Similar 0 1
Value Cell and Value Cell(Same Row) Generate Confidences A continuous, horizontal path exists between a pair of value cells Confidence = 1 If a path exists 0 If no path exists
Value Cell and Value Cell (Same Column) Generate Confidences A continuous, vertical path exists between a label cell and a value cell Confidence = 1 If a path exists 0 If no path exists
Value Cell and Value Cell(Geometrically Similar ) Generate Confidences Compare height and width Not Similar Similar 0 1
Multi-level Labels Generate Confidences • Distance between the midpoints • A line through the midpoints • Share a common border
Match Label Cells to Object Sets Generate Confidences • Match synonyms of object sets to words in a label • Location of matched words • Order that object sets match words Object Sets Full Name Location Day Family
Enforce Constraints Generate Confidences Enforce Constraints • A set of rules describe geometric and ontological constraints. • For example: • Value cells of the same type have the same dimensions • A family can’t have 100 members • The algorithm iterates over the rules
1. Similar Value Cells Generate Confidences Enforce Constraints
1. Similar Value Cells Generate Confidences Enforce Constraints LowerConfidence
1. Similar Value Cells Generate Confidences Enforce Constraints
2. Combine Aggregations Generate Confidences Enforce Constraints
3. Multi-level Labels Generate Confidences Enforce Constraints
4. Factoring Generate Confidences Enforce Constraints Check Cardinality Constraints • Observed cardinality: • microfilm table • Expected cardinality: • genealogy ontology
Observed Cardinality Generate Confidences Enforce Constraints [First Name] per [Family] = 45 / 9 = 4.67 . . .
Expected Cardinality Generate Confidences Enforce Constraints [First Name] per [Family] = 4.8 * 1 * 1 = 4.8
5. Ontological Similarity Generate Confidences Enforce Constraints Increase Confidence of Label to Object Set Mappings
6. Same Microfilm Roll Generate Confidences Enforce Constraints • Microfilm from the same roll have the same structure and relationships • Generate the confidence values for multiple tables from the same roll • Take the average of the respective confidence values
Verify Results Generate Confidences Enforce Constraints Verify Results
INSERT INTO Person (Full Name) VALUES ('335,114,521,172') INSERT INTO Person (Full Name) VALUES ('335,173,521,231') Database Generate Confidences Apply Rules • Create SQL Insert statements to store value cell coordinates Verify Results … Full Name … …
Input Generate Confidences XML Input File(Preprocessed Microfilm Image) Enforce Constraints Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements
Experiments • 75 Tables from 15 different microfilm rolls • Precision, recall, and accuracy • Populated SQL fields • Each relationship
3 Success Examples • Specialized Record • Ontology Constraints • Factoring
1. Specialized Records INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Gender, Occupation, Race, Family_Identifier, Birth_Identifier) (1, '109,455,267,478', '314,456 ,336,479', '291,456,314,478', '505,457,637,480', '267,456,291,478', 1, 1) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (2, 2) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (3, 3) INSERT INTO MOTHER_CHILD (Mother_Identifier, Child_Identifier) (3, 1) INSERT INTO FATHER_CHILD (Father_Identifier, Child_Identifier) (2, 1) INSERT INTO EVENT (Event_Identifier, Location) (1, '894,460,997,483') INSERT INTO EVENT (Event_Identifier, Location) (2, '997,460,1076,483') INSERT INTO EVENT (Event_Identifier, Location) (3, '1076,461,1153,484')
2. Ontology Constraints INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Family_Identifier, Burial_Identifier) (1, '70,243,331,373', '620,243,687,370', 1, 1) INSERT INTO FAMILY (Family_Identifier, Location) (1, '331,243,508,372') INSERT INTO EVENT (Event_Identifier, Date) (1, '508,243,620,371') INSERT INTO PERSON (Person_Identifier, Full_Name) (2,'687,241,861,372')
3 Types of Errors • Ambiguous Factoring • Long Label Names • Ambiguous Columns