1 / 55

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables. A Thesis Submitted to the Faculty of Brigham Young University. Kenneth Martin Tubbs Jr. Motivation. Millions of people want genealogical information Acquiring microfilm is expensive and time consuming.

aaralyn
Download Presentation

Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Recognizing Records from the Extracted Cells of Genealogical Microfilm Tables A Thesis Submitted to the Faculty of Brigham Young University Kenneth Martin Tubbs Jr.

  2. Motivation • Millions of people want genealogical information • Acquiring microfilm is expensive and time consuming

  3. Extraction Problem • Searching microfilm by hand is slow, error prone, and tedious • Extraction by hand requires enormous amounts of time and manpower

  4. Difficulties • Tables have different layouts and styles • Tables contain different records • Tables do not use a uniform schema • Tables lack information and are ambiguous

  5. Related Work • Current work exploits the geometric properties of tables • Regular expressions, grammars, probabilistic models, and templates • They ignore the ontological constraints of this information

  6. Contributions • Exploit both ontological and geometric constraints • Identify complex records • Work with tables with hand-written values

  7. Input Generate Confidences XML Input File(Preprocessed Microfilm Image) Enforce Constraints Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements

  8. Training Set • 25 Tables from 5 different microfilm rolls • Used to: • Identify relationships between table cells • Create genealogical ontology • Define features to extract • Generate rules (constraints)

  9. Input: Microfilm Table

  10. Input: Microfilm Table

  11. Input: Microfilm Table • Input Features • Coordinates of each cell. • Printed text for label cells. • Whether or not each value cell is empty.

  12. Input: Microfilm Table <index source="0444770/0444770_2.gif"ontology="ontology.xml"> <cellrect="7,131,62,261"printed_text="Dwelling-houses number in the order of visitation."empty="0" /> <cellrect="61,132,118,260"printed_text="Families number in order of visitation."empty="0" /> <cellrect="119,132,436,261"printed_text="The Name of every Person whose usual place of abode on the first day of June, 1840, was in this family."empty="0" /> <cellrect="62,260,120,295"printed_text="2"empty="0" /> <cellrect="118,260,436,298"printed_text="3"empty="0" /> <cellrect="7,458,62,497"printed_text=""empty="1" /> . . .

  13. Genealogical Ontology

  14. Genealogical Ontology

  15. Genealogical Ontology <Ontology> <ObjectSet id="0" name="Person" syn="" lex="0"/> <ObjectSet id="1" name="Family" syn="families" lex="0"/> <ObjectSet id="2" name="Event" syn="" lex="0"/> <ObjectSet id="3" name="Age" syn="age birthday" lex="1"/> <ObjectSet id="4" name="Relationship" syn="relationship relation" lex="1"/> <ObjectSet id="5" name="Full Name" syn="full name whom who" lex="1"/> <ObjectSet id="6" name="First Name" syn="first given christian" lex="1"/> <ObjectSet id="7" name="Middle Name(s)" syn="middle initial" lex="1"/> <ObjectSet id="8" name="Last Name" syn="last surname" lex="1"/> <ObjectSet id="9" name="Title(s)" syn="title" lex="1"/> . . .

  16. Generate Confidences Generate Confidences • Confidence of relationships between pairs of cells • Generate confidence values between 0 and 1

  17. Relationships Generate Confidences • A label cell describes a value cell • Value cells in same row or column • Label cells form a multi-level label • A label cell maps to an object set • Identify factoring

  18. Label Cell and Value Cell Generate Confidences A continuous path between a label cell and a value cell Label Label Confidence = 1 If a path exists 0 If no path exists

  19. Label Cell and Value Cell Generate Confidences Preferences for label – value orientations Label Label

  20. Label Cell and Value Cell Generate Confidences Compare the height or width of each label cell with each value cell Label OR Label Not Similar Similar 0 1

  21. Value Cell and Value Cell(Same Row) Generate Confidences A continuous, horizontal path exists between a pair of value cells Confidence = 1 If a path exists 0 If no path exists

  22. Value Cell and Value Cell (Same Column) Generate Confidences A continuous, vertical path exists between a label cell and a value cell Confidence = 1 If a path exists 0 If no path exists

  23. Value Cell and Value Cell(Geometrically Similar ) Generate Confidences Compare height and width Not Similar Similar 0 1

  24. Multi-level Labels Generate Confidences • Distance between the midpoints • A line through the midpoints • Share a common border

  25. Match Label Cells to Object Sets Generate Confidences • Match synonyms of object sets to words in a label • Location of matched words • Order that object sets match words Object Sets Full Name Location Day Family

  26. Enforce Constraints Generate Confidences Enforce Constraints • A set of rules describe geometric and ontological constraints. • For example: • Value cells of the same type have the same dimensions • A family can’t have 100 members • The algorithm iterates over the rules

  27. 1. Similar Value Cells Generate Confidences Enforce Constraints

  28. 1. Similar Value Cells Generate Confidences Enforce Constraints LowerConfidence

  29. 1. Similar Value Cells Generate Confidences Enforce Constraints

  30. 2. Combine Aggregations Generate Confidences Enforce Constraints

  31. 3. Multi-level Labels Generate Confidences Enforce Constraints

  32. 4. Factoring Generate Confidences Enforce Constraints Check Cardinality Constraints • Observed cardinality: • microfilm table • Expected cardinality: • genealogy ontology

  33. Observed Cardinality Generate Confidences Enforce Constraints [First Name] per [Family] = 45 / 9 = 4.67 . . .

  34. Expected Cardinality Generate Confidences Enforce Constraints [First Name] per [Family] = 4.8 * 1 * 1 = 4.8

  35. 5. Ontological Similarity Generate Confidences Enforce Constraints Increase Confidence of Label to Object Set Mappings

  36. 6. Same Microfilm Roll Generate Confidences Enforce Constraints • Microfilm from the same roll have the same structure and relationships • Generate the confidence values for multiple tables from the same roll • Take the average of the respective confidence values

  37. Verify Results Generate Confidences Enforce Constraints Verify Results

  38. INSERT INTO Person (Full Name) VALUES ('335,114,521,172') INSERT INTO Person (Full Name) VALUES ('335,173,521,231') Database Generate Confidences Apply Rules • Create SQL Insert statements to store value cell coordinates Verify Results … Full Name … …

  39. Input Generate Confidences XML Input File(Preprocessed Microfilm Image) Enforce Constraints Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements

  40. Training Set Results

  41. Ambiguous Factoring

  42. Experiments • 75 Tables from 15 different microfilm rolls • Precision, recall, and accuracy • Populated SQL fields • Each relationship

  43. Test Set Results

  44. 3 Success Examples • Specialized Record • Ontology Constraints • Factoring

  45. 1. Specialized Records

  46. 1. Specialized Records INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Gender, Occupation, Race, Family_Identifier, Birth_Identifier) (1, '109,455,267,478', '314,456 ,336,479', '291,456,314,478', '505,457,637,480', '267,456,291,478', 1, 1) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (2, 2) INSERT INTO PERSON (Person_Identifier, Birth_Identifier) (3, 3) INSERT INTO MOTHER_CHILD (Mother_Identifier, Child_Identifier) (3, 1) INSERT INTO FATHER_CHILD (Father_Identifier, Child_Identifier) (2, 1) INSERT INTO EVENT (Event_Identifier, Location) (1, '894,460,997,483') INSERT INTO EVENT (Event_Identifier, Location) (2, '997,460,1076,483') INSERT INTO EVENT (Event_Identifier, Location) (3, '1076,461,1153,484')

  47. 2. Ontology Constraints

  48. 2. Ontology Constraints INSERT INTO PERSON (Person_Identifier, Full_Name, Age, Family_Identifier, Burial_Identifier) (1, '70,243,331,373', '620,243,687,370', 1, 1) INSERT INTO FAMILY (Family_Identifier, Location) (1, '331,243,508,372') INSERT INTO EVENT (Event_Identifier, Date) (1, '508,243,620,371') INSERT INTO PERSON (Person_Identifier, Full_Name) (2,'687,241,861,372')

  49. 3. Factoring

  50. 3 Types of Errors • Ambiguous Factoring • Long Label Names • Ambiguous Columns

More Related