300 likes | 439 Views
Automatic Extraction of Individual and Family Information from Primary Genealogical Records. By Charla Woodbury October 17, 2006. Digital Images – Human Index. Large number of competing family history websites Digital images Human indexes – Double entry
E N D
Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006
Digital Images – Human Index • Large number of competing family history websites • Digital images • Human indexes – Double entry • Researchers hunting through records and indexes • to put families together
Problem • Large amounts of primary genealogical data • Big projects to index and extract records • Two independent indexers and adjudication • Millions of human hours used to index or match records for names and families
Automated Extraction Solution • Create a specialized extraction ontology to interpret and label genealogical data • Develop expert logic and rules that • Match and merge individuals • Group them into families
Methods • Prepare for the records extraction • Run a 1st PASS to extract the information • Run a 2nd PASS to match individuals and link families • Evaluate and optimize the results
Prepare for Records Extraction • Build an Ontology • BYU ontology software Ontos to interpret and correctly label genealogical data using • Dataframes • Regular expressions • Lexicons • Conversion functions “encapsulates knowledge about the appearance, behavior, and context of a collection of data elements” Dr. David Embley • Collect machine-readable records
MALE Anders –And. Andreas Christen –Kristen Christian –Kristian Erik –Eric Gregers Hans Ib –Jep –Jeppe Jacob Jens Johan – Johannes – Joh. Jorgen –Jørgen Knud Lars – Laurs – Laurids –Lauritz Mads –Mats - Mats FEMALE Ane – Anna – Anne Birthe – Birte Bodil Caroline Dorthe – Dorte Ellen -Helene -Elene Elisabeth –Elsbeth –Lisbeth Else –Ilse Ingeborg Inger Karen Kirsten –Christen –Kirstine –Christine –Kirstine –Chirstine Malene Maren Danish GIVEN NAME LEXICON
MONTHS January –Jan –Januar -11br Februrary –Feb –Februar -12br March –Mar –Marts April – Apr –Apl May –Mai June –Jun –Juni July –Jul –Juli -5br August –Aug –Augst -6br September –Sep –Sept -7br –Septembre October –Oct -8br –Octobre November –Nov -9br –Novembre December –Dec -10br TIME Year –yr –aar –år Month –mo –maaned –m. Week –uge –ug. Day –dag –d. Hour – h. –hr. FEAST DATES Easter – Paaske –Påske –Paasche –Påsche –P. Pentecost – Pent –Pinse -Pin Trinity –Tr –Trin –Trinitatis DAYS OF WEEK Sunday –Sun –Dominico –Dom. Monday –Mon –Mondag –Mond. Tuesday –Tue –Tirsdag –Tirsd. Wednesday –Wed -Onsdag –Onsd. Thursday – Thur –Tørsdag –Tørsd. Friday –Fri –Fredag –Fred. Saturday –Sat –Lørsdag –Lørs DATE Lexicon Adds Thesaurus of Synonyms
CONVERSION FUNCTIONSinside the ontology • Compute birth date from age at death Death date – 22 Mar 1743 Age - 23 yr 2 m -> BIRTH Jan 1720 • Compute dates from feast dates Sunday 23rd after Trinity 1751 ->14 Nov 1751
2 Run a 1st pass to extract the information • Annotate the genealogical record with the ontology • Populate RDF data file
Annotated Town Record • SOURCE –Beverly town records • [PAGE HEADER] Births page 391 • [BODY]WOODBURY, Benjamin,s.Nickolas and Anne,bp.26 : 2 m : 1668. NAME <NAME> DATE <DATE> PLACE <PLACE> RELATIONSHIP <RELATION> OCCUPATION <OCCUPATION> RECORD_TYPE <RTYPE> SOURCE <SOURCE>
Annotated Danish Parish • SOURCE -Tvilum Parish Register • [PAGE HEADER] Fødde1751 page 3 • [BODY]Truust Dom. 23 p: Trinit: laest over Niels BachesSØRENfadd.Johannes Michelsens og NielsMollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens BachissønPeder og Niels Thylkess.Peder af Truust
Populate RDF-data file • Hilton Campbell’s design • PERSON • EVENT • LINKS – PERSON(S) to EVENT
3 Run a SECOND PASS to match individuals and to link families • FORMULATE RULES in Rule Engine language for RDF-data file • Match individuals • Check family data • Link families up • APPLY RULES through the Java Rules API
Evaluate and Optimize Results • Evaluate the preliminary results • Optimize the rules • Improve the whole process
VALIDATION I Classification by Record Type: • RECALL = .769 240 entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 312 entries ACTUAL BIRTHS • PRECISION = .976 240 entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 246 Entries TOTAL LABELED ‘BIRTH’ The higher the number, the better
VALIDATION II Correctness of the Extraction: • RECALL = .95 950 entries CORRECTLY LABELED ‘NAME’ ________________________________________ 1000 entries ACTUAL NAMES • PRECISION = .969 950 entries CORRECTLY LABELED ‘NAME’ ________________________________________ 980 Entries TOTAL LABELED ‘NAME’ The higher the number, the better
Isaac WOODBURY • Children • Robert 4 Jul 1672 • Mary 6 Oct 1674 • Christian 3 Mar 1677/8 • Isaac 6 Apr 1680 • Deliverance 1 Feb 1682/3 • Joshua 1 Jan 1684/5 • Elizabeth 17 Jan 1688 • Nickolas 12 Aug 1688 • Ann 29 Jun 1689 • Lidia 1 Feb 1691/2 • Elisabeth about 1694 • Isaac 20 Jul 1697 • Benjamin 20 Aug 1699
Isaac WOODBURY SON of HUMPHREY Mary WILKES MARRIAGE 9 Oct 1671 Robert 4 Jul 1672 Mary 6 Oct 1674 Christian 3 Mar 1677/8 Isaac 6 Apr 1680 Deliverance 1 Feb 1682/3 Joshua 1 Jan 1684/5 Elizabeth 17 Jan 1688 Isaac WOODBURY SON of NICHOLAS Elizabeth MARRIAGE ________ Nickolas 12 Aug 1688 Ann 29 Jun 1689 Lidia 1 Feb 1691/2 Elisabeth about 1694 Isaac 20 Jul 1697 Benjamin 20 Aug 1699
VALIDATION III • Grouping by FAMILY: total # merges + splits to correct families after 2nd PASS ___________________________________ total # merges + splits to correct families after 1st PASS The lower the number, the better
Optimize the Rules • Add • Remove • Fine-tune • Change the order • Improve the whole process Until the metrics no longer improve
Unstructured genealogical data Searchable annotated genealogical data Families in RDF-data file AUTOMATIC EXTRACTION