1 / 24

Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm

This system automatically identifies and extracts records from genealogical microfilm using table zones, coordinates, printed text, and empty cell data. It identifies structure, matches attributes, checks constraints, and produces record patterns, attributes, and XML files.

pconnie
Download Presentation

Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatically Identifying Records from the Extracted Data Fields of Genealogical Microfilm Kenneth Tubbs

  2. Microfilm Image

  3. Input Table Zones • The coordinates of each table cell • The printed text in ASCII for each cell, if any. • Whether or not the cell is empty.

  4. Table Zones Identify Structure Record Patterns Match Attributes Genealogical Ontology Check Constraints Algorithm

  5. Identify Structure • Identify Table Primitives • Aggregate Table Primitives • Sort Candidates Identify Structure

  6. Name Identify Structure • Identify Table Primitives Column: [[table_label width] [table_value width]+] {below} Identify Structure

  7. Name Identify Structure • Identify Table Primitives Row: [[table_label height] [table_value height]+] {left} Identify Structure

  8. Row Primitive Column Primitive Identify Structure • Identify Table Primitives Printed Text Hand-written Text Identify Structure

  9. Identify Structure 2. Identify Table Primitives • Probabilistic Rules are associated with each • primitive type. • Examples • Column primitives should be factored left to right. (.9) • Row primitives factor the Column primitives below them. (.7) Identify Structure

  10. A B C D E F G H I J K L Identify Structure 2. Aggregate Table Primitives Identify Structure

  11. G H I J K L Identify Structure 2. Aggregate Table Primitives [G H I J K L] or [G] [ H I J K L] or [K] [G H I J L] or [G] [H I J [K][L]] or Others Identify Structure

  12. Identify Structure 2. Sort Candidates • The candidates are evaluated based on: • The confidence of the table primitive matches. • The probability the the rules used are correct. Identify Structure

  13. Identify Structure 2. Sort Candidates • [G] [ H I J K L] • [G H I J K L] • [G] [H I J [K][L]] • [K] [G H I J L] • Others Identify Structure

  14. Match Attributes • Identify Possible Mappings • Sort Candidates Match Attributes

  15. Name Name Sex Gender Female Age Female, Age Genealogical Ontology Match Attributes • Identify Possible Mappings Mapping types Printed Text • Identical Matches • Synonym Matches • Composite Matches • Human-Aided Matches Match Attributes

  16. Match Attributes 2. Sort Candidates • The candidates are evaluated based on: • The type of the match. • The confidence of the match. Match Attributes

  17. Check Constraints • Identify the individual records • Evaluate the records with the Genealogical Ontology. Check Constraints

  18. Check Constraints Table (Address , Age) = 4.1 Address 1 1 1 4.1 3.9 4.2 Name Age Gender Check Constraints

  19. Check Constraints Ontology (Address, Age) = 1.5 * 4.3 * .9 = 5.805 Age Name Gender 5 1.1 10 1.1 .9 1.1 1.5 1.3 4.3 1.3 Address Family Person Check Constraints

  20. Check Constraints Constraint_Score = 1 2 (1\(2n)) *  | Ontology(i, j) – Table(i,j) |2 • The variables “i” and “j” are attributes. • The sum is over all combinations of “i” and “j”. • The variable “n” is number of attributes. Check Constraints

  21. Check Constraints The algorithm sorts the candidates by their constraint score. The algorithm creates rules to prevent the factoring of the attributes the receive low constraint scores. Check Constraints

  22. Table Zones Identify Structure Record Patterns Match Attributes Genealogical Ontology Check Constraints Algorithm

  23. Final Remarks • The algorithm produces: • Record Patterns • Attributes for each record • Geometry for each record • 2. Attribute mappings from the table to the ontology.

  24. Final Remarks • Given extracted values for the information written by hand, • the process can extract the records into an XML file. • Individuals can then query the XML files and index • back into the original microfilm images.

More Related