400 likes | 546 Views
Automatic Extraction From and Reasoning About Genealogical Records: A Prototype. By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science. Digital Images – Human Index.
E N D
Automatic Extraction From and Reasoning About Genealogical Records: A Prototype By Charla J. Woodbury A thesis submitted to the faculty of Brigham Young University in partial fulfillment of the requirements for the degree of Master of Science
Digital Images – Human Index • Large number of competing family history websites • Digital images • Human indexes • Researchers hunting through records and indexes • to put families together 2
Problem • Large amounts of primary genealogical data • Big projects to index and extract records • Two independent indexers and adjudication • Millions of human hours used to index or match records for names and families 3
Automated Extraction Solution • Create a specialized extraction ontology to interpret and label genealogical data • Add rules and logic that • Label family roles - husband, daughter, etc. • Link family relationships • HUSBAND – WIFE • PARENT – CHILD 4
Outline 5 Data Preparation Ontology Extraction System (OntoES) OWL File and SWRL Rules SPARQL Queries Experimental Results Conclusions
1. Data Preparation • Collect machine-readable records from three different countries • Format in HTML format for extraction • Prepare lexicons for names, places, etc. 6
New England Vital Records – Beverly, Massachusetts 1668-1849 7
Transcript of the original record 1576/1577 eodem die Nicholaus Patch Christinam Denman 26 Jan 1605 Richard Patch et Joanna Lavor 1613 Septembris 26 Johannes Elliott et Joanna Woodbery matrimonis cominguntur 1615 Augusti 7 Thoms Prime et Maria Patch matrimonio cominguntur 1616/1617 Januarij 29 Wilhelmus Woodbery et Elizabetha Patch matrimonio cominguntur 1620 Maij 2 : Wilhelmus Hillerd et Fortu: Patch 1622 Septembris 17 Nicholas Patch et Elizabetha Owsley matrimonio cominguntur 1627/1628 Januarij 22 : Richardus Patch et Maria White matrimonio cominguntur 1630/1631 Januarij 15 Andreas Elliott et Joanna Patch matrimonio cominguntur 1639/1640 Februarij 12 Andreas Elliott et Joanna Pittes matrimonio cominguntur
2. Ontology Extraction System OntoES: automatically interpret and correctly label genealogical data using • Data frames • Regular expressions • Lexicons • Date conversion methods 12
Regular expressions • MARDATE • Value expression • EXAMPLE type 25-Sep 1613 (0\d|1\d|2\d|30|31|\d)-{Month}\.?\s*(\d\d\d\d) • Keyword expression (\b(md\.?|marry|marriage|married|maried|wed|wedding)\b)
1Ober 7ber 8ber 9ber apr april aprilis aug august augusti augustus avr avril avrilis dec december decembr decembre decembri feb febr februari february jan januarij january jul juli julius july jun june Sample MONTH LEXICON 16
Object Level 17
CANONICALIZATION METHODSinside the ontology • Regularize date (Julian format: YYYYddd) 1620 2-May → 1620093 • Display stored Julian format as DD MMM YYYY • 1620093 →2 MAY 1620 18
Feast Dates Dates expressed as a holy day • Fixed Dates Christmas 1720 →25 DEC 1720 • Moveable Dates around Easter (36 possible Easter dates with leap year variation) • 1723 Dnica Septuagesima →24 JAN 1723 • Same day as previous entry 19
Run Ontology • Input • Ontology (Created with OntoES) • HTML data (Hypertext Markup Language) • Output • RDF database (Resource Description Format) • OWL file (Ontology Web Language) 20
3. OWL File and SWRL Rules • OWL HEADER • <owl:Class rdf:ID="MarriageRecord"/> • <owl:Class rdf:ID="Person"/> • <owl:Class rdf:ID="NameU"/> • <owl:DatatypeProperty rdf:ID="NameUValue"> • <rdfs:domain rdf:resource="#NameU"/> • <rdfs:range rdf:resource="&xsd;string"/> • </owl:DatatypeProperty> • PERSON - NAMEU • <owl:ObjectProperty rdf:ID="Person-NameU"> • <rdfs:domain rdf:resource="#Person"/> • <rdfs:range rdf:resource="#NameU"/> • <owl:inverseOf> • <owl:ObjectProperty rdf:ID="NameU-Person"/> • </owl:inverseOf> • </owl:ObjectProperty>
Sample RDF Triples Person_10 | sameAs | Person_10 Person_10 | type | Thing Person_10 | type | Person NameU_0 | NameUValue | “Christian Denman” NameU_0 | sameAs | NameU_0 NameU_0 | type | Thing NameU_0 |type | NameU NameM_4 | NameMValue | “Nicholas Patch” NameM_4 | sameAs | NameM_4 NameM_4 | type | Thing NameM_4 |type | NameM
SWRL Rules • Define OWL Class • Example – Husband • <owl:Class rdf:ID="Husband"/> • Define Rule • Example – Person with male name is a Husband • Person-NameM(?x,?y) -> Husband(?x) ?y ?x 25
Marriage – Person_10 to Person_4 • Person_10 • <Person rdf:ID="Person_10"> • <Person-NameU rdf:resource="#NameU_0" /> • </Person> • MarriageRecord_7 • <MarriageRecord rdf:ID="MarriageRecord_7"> • <MarriageRecord-Person rdf:resource="#Person_4" /> • <MarriageRecord-Person rdf:resource="#Person_10" /> • </rdf:MarriageRecord> • NameM_4 • <NameM rdf:ID="NameM_4"> • <NameMValue> Nicholas Patch</NameMValue> • </NameM> • Person_4 • <Person rdf:ID="Person_4"> • <Person-NameM rdf:resource="#NameM_4" /> • </Person>
Rule HEAD in OWL file <swrl:Imp rdf:ID="Def-Husband"> <swrl:head rdf:parseType=“Collection”> <swrl:ClassAtom> <swrl:argument1 rdf:resource="#x"/> <swrl:classPredicate rdf:resource="#Husband"/> </swrl:ClassAtom> </swrl:head>
Rule BODY in OWL file <swrl:body> <swrl:IndividualPropertyAtom> <swrl:propertyPredicate rdf:resource="#Person-NameM"/> <swrl:argument1 rdf:resource="#x"/> <swrl:argument2 rdf:resource="#y"/> </swrl:IndividualPropertyAtom> </swrl:body> </swrl:Imp>
Related Rules • NameF is populated then value in NameU is Husband Person-NameF(?w,?v) MarriageRecord-Person(?z,?w) MarriageRecord-Person(?z,?x) Person-NameU(?x,?y) -> Husband(?x) ?z ?v ?w ?x ?y 29
HusbandOf Rule Husband(?x) Wife(?y) MarriageRecord-Person(?z,?x) MarriageRecord-Person(?z,?y) -> HusbandOf(?x,?y) 30
NameM(?x) -> Name(?x) NameF(?x) -> Name(?x) NameU(?x) -> Name(?x) NameMValue(?x) -> NameValue(?x) NameFValue(?x) -> NameValue(?x) NameUValue(?x) -> NameValue(?x) Person-NameM(?x,?y) -> Person-Name(?x,?y) Person-NameF(?x,?y) -> Person-Name(?x,?y) Person-NameU(?x,?y) -> Person-Name(?x,?y) Auxiliary Name Rules 31
SPARQL Query Who is Husband of Christian Denman? PREFIX : http://www.deg.byu.edu/ontology/Marriage# SELECT ?Husband WHERE { ?X :NameValue "Christian Denman" . ?Y :Person-Name ?X . ?W :HusbandOf ?Y . ?W :Person-Name ?V . ?V :NameValue ?Husband } 32
Query Results Husband ======================================= "Nicholas Patch"^^http://www.w3.org/2001/XMLSchema#string 33
Query Results Husband ======================================= "Nicholas Patch"^^http://www.w3.org/2001/XMLSchema#string South Petherton Marriages same day 1576 Nicholas Patch and Christian Denman 26 Jan 1605 Richard Patch and Joan Lavor 25-Sep 1613 John Elliott and Joan Woodbery 7-Aug 1615 Thomas Prime and Maria Parry 29-Jan 1616 William Woodbery and Elizabeth Patch 2-May 1620 William Hillerd and Fortu: Patch 17-Sep 1622 Nicholas Patch and Elizabeth Owsley 22-Jan 1627 Richard Patch and Mary White 15-Jan 1630 Andrew Elliott and Joan Patch 12-Feb 1639 Andrew Elliott and Joan Pitts “Nicholas Patch” because: NameValue(“Nicholas Patch”) and Name-NameValue(n1, “Nicholas Patch”) and Name(n1) is NameM(n1) and Person-NameM(p1, n1) NameValue(“Christian Denman”) and Name-NameValue(n2, “Christian Denman”) and Name(n2) is NameU(n2) and Person-NameU(p2, n2) Husband(p1) because: Person-NameM(p1, n1) Wife(p2) because: Person-NameU(p2, n2) and Person-MarriageRecord(p2, r1) and MarriageRecord-Person(r1, p1) and Person-NameM(p1, n1) HusbandOf(p1, p2) because: Husband(p1) and Wife(p1) and MarriageRecord-Person(r1, p1) and MarriageRecord-Person(r1, p1) 34
5. Experimental Results • Extraction Results • American Extraction Problem • Rule Results 35
American Difficulty BIRTH WOODBURY, Charles Henry [Charles William, P. R. 4.], s. Henry [housewright. dup.] and Henrietta (Galloup), Dec. 4, 1845. • Extra information inside brackets & parentheses • Charles William – twin of Charles Henry • Henry [housewright – identified as NAME • Henrietta (Galloup) –identified as NAME 37
Rules Results • 100% Precision and Recall (Once rules are well-defined, the results are perfect.) • Database Size (The RDF database is much larger when rule triples are added.) • NEW PROPERTIES – husband, wife, parent, child • NEW LINKS 38
6. Conclusions • Speed up data indexing • Make production of a full index easier • Ground the index in original documents • Provide for inferred facts • Simplify as well as augment record search • Help link records and form family groups and ancestral lines 40