450 likes | 464 Views
This research project aims to automate the extraction of data from microfilm tables to improve efficiency and accuracy. By utilizing a genealogical ontology algorithm, constraints are enforced to verify and organize the extracted data. Confidence matrices are generated to establish relationships between cells and labels, enabling the identification of records. The algorithm compares geometrical attributes and orientations to ensure accurate data extraction. The output includes SQL insert statements for easy integration with databases.
E N D
Recognizing Recordsfrom the Extracted Cellsof Microfilm Tables Kenneth M. Tubbs David W. Embley Brigham Young University Supported by NSF
Motivation • Millions want microfilm information • 1880 census on-line, end of October • 3 million hits per hour on familysearch.org • Acquiring information from microfilm • Expensive and time consuming • 2.5 million rolls, 20,000 extractors, 100 hours per year: requires 104 years • Finding a way to automate: big win!
Difficulties • Different layouts and styles • Different types of data • Sometimes ambiguous • Type-written labels (OCR) • Hand-written data (?)
Objective: Identify Records • Ontological as well as geometric constraints • Layout of handwritten values • Layout of empty cells Given a zoned image of a microfilm table, exploit: Output field coordinates (labeled with respect to the ontology) and organized into records
Input Generate Confidence XML Input File(Preprocessed Microfilm Image) Enforce Constraints Verify Results Genealogical Ontology Algorithm Method Output SQL Insert Statements
“Training” Set • 25 Tables from 5 different microfilm rolls • Used to: • Identify relationships between table cells • Create genealogical ontology • Define features to extract • Generate rules (constraints)
Input: Microfilm Table • Input Features • Coordinates of each cell • Printed text for label cells • Cell empty or not
Input: Microfilm Table <index source="0444770/0444770_2.gif"ontology="ontology.xml"> <cellrect="7,131,62,261"printed_text="Dwelling-houses number in the order of visitation."empty="0" /> <cellrect="61,132,118,260"printed_text="Families number in order of visitation."empty="0" /> <cellrect="119,132,436,261"printed_text="The Name of every Person whose usual place of abode on the first day of June, 1840, was in this family."empty="0" /> <cellrect="62,260,120,295"printed_text="2"empty="0" /> <cellrect="118,260,436,298"printed_text="3"empty="0" /> <cellrect="7,458,62,497"printed_text=""empty="1" /> . . .
Genealogical Ontology <Ontology> <ObjectSet id="0" name="Person" syn="" lex="0"/> <ObjectSet id="1" name="Family" syn="families" lex="0"/> <ObjectSet id="2" name="Event" syn="" lex="0"/> <ObjectSet id="3" name="Age" syn="age birthday" lex="1"/> <ObjectSet id="4" name="Relationship" syn="relationship relation" lex="1"/> <ObjectSet id="5" name="Full Name" syn="full name whom who" lex="1"/> <ObjectSet id="6" name="First Name" syn="first given christian" lex="1"/> <ObjectSet id="7" name="Middle Name(s)" syn="middle initial" lex="1"/> <ObjectSet id="8" name="Last Name" syn="last surname" lex="1"/> <ObjectSet id="9" name="Title(s)" syn="title" lex="1"/> . . .
Generate Confidence Matrices Generate Confidence • Relationships between pairs of cells • Confidence values between 0 and 1
Relationships Generate Confidence • Label cell describes value cells • Value cells in same row or column • Label cells form a multi-level label • Label cells correspond to object sets • Value factoring and nested values
Label Cell and Value Cell Generate Confidence A continuous path between a label cell and a value cell Label Label Confidence = 1 If a path exists 0 If no path exists
Label Cell and Value Cell Generate Confidence Preferences for label – value orientations Label Label
Label Cell and Value Cell Generate Confidence Compare the height or width of each label cell with each value cell Label OR Label Not Similar Similar 0 1
Value Cell and Value Cell(Same Row) Generate Confidence A continuous, horizontal path exists between a pair of value cells Confidence = 1 If a path exists 0 If no path exists
Value Cell and Value Cell (Same Column) Generate Confidence A continuous, vertical path exists between a label cell and a value cell Confidence = 1 If a path exists 0 If no path exists
Value Cell and Value Cell(Geometrically Similar ) Generate Confidence Compare height and width Not Similar Similar 0 1
Multi-level Labels Generate Confidence • Distance between the midpoints • A line through the midpoints • Share a common border
Match Label Cells to Object Sets Generate Confidence • Location of matched words • Order of matched words Object Sets Full Name Location Day Family
Enforce Constraints Generate Confidence Enforce Constraints • Rules for geometric and ontological constraints • Examples: • Same-type value cells have the same dimensions. • A family can’t have 100 members. • Iterate over the rules, seeking convergence
Similar Value Cells Generate Confidence Enforce Constraints
Similar Value Cells Generate Confidence Enforce Constraints LowerConfidence
Similar Value Cells Generate Confidence Enforce Constraints
Combine Aggregations Generate Confidence Enforce Constraints
Multi-level Labels Generate Confidence Enforce Constraints
Factoring Generate Confidence Enforce Constraints Check Cardinality Constraints • Observed cardinality in microfilm table • Expected cardinality in genealogy ontology
Observed Cardinality Generate Confidence Enforce Constraints [First Name] per [Family] = 45 / 9 = 4.67 . . .
Expected Cardinality Generate Confidence Enforce Constraints [First Name] per [Family] = 4.8 * 1 * 1 = 4.8
Ontological Similarity Generate Confidence Enforce Constraints Increase Confidence of Label to Object Set Mappings
Same Microfilm Roll Generate Confidence Enforce Constraints Average Confidence Values Across Tables
Verify Results Generate Confidence Enforce Constraints Verify Results
INSERT INTO Person (Full Name) VALUES ('335,114,521,172') INSERT INTO Person (Full Name) VALUES ('335,173,521,231') Database Generate Confidence Apply Rules SQL Statements Insert Value Cell Coordinates Verify Results … Full Name … …
Experiments • 75 tables from 15 different microfilm rolls • Precision, recall, and accuracy • Populated SQL fields • Each relationship
Some Long Label NamesCaused Confusion State here the particular Religion or Religious Denomination, to which each persons belongs. [Members of Protestant Denomina- tions are requested not to describe themselves by the vague term ‘Protestant,’ but to enter the name of the Particular Church, Denomination, or Body, to which they belong.]
Ambiguous ColumnsCaused Confusion Full Name
Conclusions • Identified records in microfilm tables • Geometric and ontological properties • Evidence matrices & corroboration rules • Accuracy: ~92% http://www.rdhd.byu.edu http://www.fht.byu.edu