1 / 20

An Automated Record Linkage System for the Canadian Census, 1871-1881

An Automated Record Linkage System for the Canadian Census, 1871-1881. L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria) K. Inwood (University of Guelph) J. A. Ross (University of Guelph).

christmas
Download Presentation

An Automated Record Linkage System for the Canadian Census, 1871-1881

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. An Automated Record Linkage System for the Canadian Census, 1871-1881 L. Antonie (University of Guelph) P. Baskerville (Universities of Alberta and Victoria) K. Inwood (University of Guelph) J. A. Ross (University of Guelph) Record Linkage Workshop, May 24th-25th, 2010, University of Guelph

  2. What we are working towards 1881 Census 1871 Census 1891 Census 1851 Census ‘Unbiased’ links connecting individuals/households over several census years A comprehensive infrastructure of longitudinal data 1901 Census US1880 Census 1906 Census US 1900 Census 1911 Census 1916 Census

  3. Current Work 100% of 1871 Census 100% of 1871 Census 100% of 1871 Census 100% of 1881 Census 100% of 1871 Census Automatic Linking 3,601,663 records 4,277,807 records Partners and collaborators: FamilySearch, Church of Latter Day Saints, Minnesota Population Center, Université de Montréal, University of Alberta

  4. Existing (True) Links • Ontario Industrial Proprietors – 8429 links • Logan Township – 1760 links • St. James Church, Toronto – 232 links • Quebec City Boys – 1403 links • Bias • family- context • others? Guelph Logan Twp

  5. Attributes for Automatic Linking • Last Name - string • First Name - string • Gender – binary • Age - number • Birthplace - number • Marital status – single, married, divorced, widowed, unknown

  6. Automatic Linkage • The challenges: 1) Identify the same person 2) Deal with attribute characteristics 3) Manage computational expense • The system:

  7. Data Cleaning and Standardization • Cleaning • Names – remove non-alpha numerical characters; remove titles • Age – transform non-numerical representations to corresponding numbers (e.g. 3 months); • All attributes - deal with English/French notations (e.g. days/jours, married/mariee) • Standardization • Birthplace codes and granularity • Marital status

  8. Computational Expense • Very expensive to compare all the possible pairs of records • Computing similarity between 3.5 million records (1871 census) with 4 million records (1881 census) • Run-time estimate of : ( (3.5M x 4M)record pairs x 2 attributes being compared ) / (4M comparisons per second) / 60 (sec/min) / 60 (min/hour) / 24 (hours/day) = 40.5 days.

  9. Managing Computational Expense • Blocking • By first letter of last name • By birthplace • Using HPC • Running the system on multiple processors

  10. Record Comparison • Comparing Strings • Jaro-Winkler • Edit Distance • Double Metaphone • Age • +/- 2 years • Exact matches • Gender • Birthplace

  11. Classification • Classifier • Support Vector Machines • 5-fold cross validation • Training Data • True links found by experts • Ontario proprietors • Classes • Match • Non-match

  12. Linkage Results

  13. Linkage Results - Evaluation

  14. Linkage Results - Evaluation

  15. Directions to Improve • Common patterns in incorrect links • Big age difference • Change in marital status for females • First name change • Probability estimate score of the classifier

  16. Results – Common Patterns Before After

  17. Results – Common Patterns Before After

  18. Results – Classification Scores 0.8 0.85 0.9

  19. Conclusions • Linking people across 1871-1881 Canadian censuses • Preliminary automated linkage system • More evaluation and experimentation is needed

  20. Acknowledgements University of Guelph Ontario Ministry of Research and Innovation SHARCNET FamilySearch, Church of Latter Day Saints Minnesota Population Center University of Alberta Université de Montréal/PRDH Université Laval/CIEQ

More Related