1 / 30

Automatic Extraction of Individual and Family Information from Primary Genealogical Records

Automatic Extraction of Individual and Family Information from Primary Genealogical Records. By Charla Woodbury October 17, 2006. Digital Images – Human Index. Large number of competing family history websites Digital images Human indexes – Double entry

alamea
Download Presentation

Automatic Extraction of Individual and Family Information from Primary Genealogical Records

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Automatic Extraction of Individual and Family Information from Primary Genealogical Records By Charla Woodbury October 17, 2006

  2. Digital Images – Human Index • Large number of competing family history websites • Digital images • Human indexes – Double entry • Researchers hunting through records and indexes • to put families together

  3. Problem • Large amounts of primary genealogical data • Big projects to index and extract records • Two independent indexers and adjudication • Millions of human hours used to index or match records for names and families

  4. Automated Extraction Solution • Create a specialized extraction ontology to interpret and label genealogical data • Develop expert logic and rules that • Match and merge individuals • Group them into families

  5. Methods • Prepare for the records extraction • Run a 1st PASS to extract the information • Run a 2nd PASS to match individuals and link families • Evaluate and optimize the results

  6. Prepare for Records Extraction • Build an Ontology • BYU ontology software Ontos to interpret and correctly label genealogical data using • Dataframes • Regular expressions • Lexicons • Conversion functions “encapsulates knowledge about the appearance, behavior, and context of a collection of data elements” Dr. David Embley • Collect machine-readable records

  7. Ontology – Entity Level

  8. MALE Anders –And. Andreas Christen –Kristen Christian –Kristian Erik –Eric Gregers Hans Ib –Jep –Jeppe Jacob Jens Johan – Johannes – Joh. Jorgen –Jørgen Knud Lars – Laurs – Laurids –Lauritz Mads –Mats - Mats FEMALE Ane – Anna – Anne Birthe – Birte Bodil Caroline Dorthe – Dorte Ellen -Helene -Elene Elisabeth –Elsbeth –Lisbeth Else –Ilse Ingeborg Inger Karen Kirsten –Christen –Kirstine –Christine –Kirstine –Chirstine Malene Maren Danish GIVEN NAME LEXICON

  9. MONTHS January –Jan –Januar -11br Februrary –Feb –Februar -12br March –Mar –Marts April – Apr –Apl May –Mai June –Jun –Juni July –Jul –Juli -5br August –Aug –Augst -6br September –Sep –Sept -7br –Septembre October –Oct -8br –Octobre November –Nov -9br –Novembre December –Dec -10br TIME Year –yr –aar –år Month –mo –maaned –m. Week –uge –ug. Day –dag –d. Hour – h. –hr. FEAST DATES Easter – Paaske –Påske –Paasche –Påsche –P. Pentecost – Pent –Pinse -Pin Trinity –Tr –Trin –Trinitatis DAYS OF WEEK Sunday –Sun –Dominico –Dom. Monday –Mon –Mondag –Mond. Tuesday –Tue –Tirsdag –Tirsd. Wednesday –Wed -Onsdag –Onsd. Thursday – Thur –Tørsdag –Tørsd. Friday –Fri –Fredag –Fred. Saturday –Sat –Lørsdag –Lørs DATE Lexicon Adds Thesaurus of Synonyms

  10. CONVERSION FUNCTIONSinside the ontology • Compute birth date from age at death Death date – 22 Mar 1743 Age - 23 yr 2 m -> BIRTH Jan 1720 • Compute dates from feast dates Sunday 23rd after Trinity 1751 ->14 Nov 1751

  11. Collect Machine-Readable Records

  12. English Parish – Wirksworth, Derby1608-1813

  13. Danish Parish – Maglebye1646-1813

  14. Sample Danish marriages

  15. New England – Beverly, Mass.1668-1849

  16. 2 Run a 1st pass to extract the information • Annotate the genealogical record with the ontology • Populate RDF data file

  17. Annotated Town Record • SOURCE –Beverly town records • [PAGE HEADER] Births page 391 • [BODY]WOODBURY, Benjamin,s.Nickolas and Anne,bp.26 : 2 m : 1668. NAME <NAME> DATE <DATE> PLACE <PLACE> RELATIONSHIP <RELATION> OCCUPATION <OCCUPATION> RECORD_TYPE <RTYPE> SOURCE <SOURCE>

  18. Annotated Danish Parish • SOURCE -Tvilum Parish Register • [PAGE HEADER] Fødde1751 page 3 • [BODY]Truust Dom. 23 p: Trinit: laest over Niels BachesSØRENfadd.Johannes Michelsens og NielsMollers hustruer af Søebyevad, Peder Rasmussen af Søebyevad, Jens BachissønPeder og Niels Thylkess.Peder af Truust

  19. Populate RDF-data file • Hilton Campbell’s design • PERSON • EVENT • LINKS – PERSON(S) to EVENT

  20. EVENT – birth of RachelPERSON’s – Sarah and Rachel

  21. 3 Run a SECOND PASS to match individuals and to link families • FORMULATE RULES in Rule Engine language for RDF-data file • Match individuals • Check family data • Link families up • APPLY RULES through the Java Rules API

  22. Evaluate and Optimize Results • Evaluate the preliminary results • Optimize the rules • Improve the whole process

  23. VALIDATION I Classification by Record Type: • RECALL = .769 240 entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 312 entries ACTUAL BIRTHS • PRECISION = .976 240 entries CORRECTLY LABELED ‘BIRTH’ ________________________________________ 246 Entries TOTAL LABELED ‘BIRTH’ The higher the number, the better

  24. VALIDATION II Correctness of the Extraction: • RECALL = .95 950 entries CORRECTLY LABELED ‘NAME’ ________________________________________ 1000 entries ACTUAL NAMES • PRECISION = .969 950 entries CORRECTLY LABELED ‘NAME’ ________________________________________ 980 Entries TOTAL LABELED ‘NAME’ The higher the number, the better

  25. Isaac WOODBURY • Children • Robert 4 Jul 1672 • Mary 6 Oct 1674 • Christian 3 Mar 1677/8 • Isaac 6 Apr 1680 • Deliverance 1 Feb 1682/3 • Joshua 1 Jan 1684/5 • Elizabeth 17 Jan 1688 • Nickolas 12 Aug 1688 • Ann 29 Jun 1689 • Lidia 1 Feb 1691/2 • Elisabeth about 1694 • Isaac 20 Jul 1697 • Benjamin 20 Aug 1699

  26. Isaac WOODBURY SON of HUMPHREY Mary WILKES MARRIAGE 9 Oct 1671 Robert 4 Jul 1672 Mary 6 Oct 1674 Christian 3 Mar 1677/8 Isaac 6 Apr 1680 Deliverance 1 Feb 1682/3 Joshua 1 Jan 1684/5 Elizabeth 17 Jan 1688 Isaac WOODBURY SON of NICHOLAS Elizabeth MARRIAGE ________ Nickolas 12 Aug 1688 Ann 29 Jun 1689 Lidia 1 Feb 1691/2 Elisabeth about 1694 Isaac 20 Jul 1697 Benjamin 20 Aug 1699

  27. VALIDATION III • Grouping by FAMILY: total # merges + splits to correct families after 2nd PASS ___________________________________ total # merges + splits to correct families after 1st PASS The lower the number, the better

  28. Optimize the Rules • Add • Remove • Fine-tune • Change the order • Improve the whole process Until the metrics no longer improve

  29. Unstructured genealogical data Searchable annotated genealogical data Families in RDF-data file AUTOMATIC EXTRACTION

  30. Questions?

More Related