140 likes | 235 Views
Anonymised Integrated Event History Datasets for Researchers. Johan Heldal Statistics Norway. Contents. The Social Security Database FD-trygd About the Norwegian Social Science Data Service Event history data from FD-trygd Data to researchers – principles laid down Anonymisation
E N D
Anonymised Integrated Event History Datasets for Researchers Johan Heldal Statistics Norway
Contents • The Social Security Database FD-trygd • About the Norwegian Social Science Data Service • Event history data from FD-trygd • Data to researchers – principles laid down • Anonymisation • Establishing measure of disclosure risk
Social security event history data baseFD-trygd • Contains all events related to the Norwegian social security system for every person with residence I Norway since 1992. • All benefits and associated variables • All dates for events • All demographic histories (birth/immigration, sex, marital status, children) • All places of residence • Are kept in different oracle “tables” that can be exactly merged by a personal identification code.
FD-trygd • Has high data quality • Can also be merged to • Education histories • Incomes from the yearly assessment • Is extremely valuable for research purposes.
NSDThe Norwegian Social Science Data Services • Established (1971) to simplify access to data for researchers and students in Norway. • Distributes anonymised survey datasets from Statistics Norway and others. • NSD requested (2009) a 20 % sample of all individual histories in FD-trygd for anonymisation to researchers at their own premises. • The request has been approved by Statistics Norway, but • SN has (by law) the responsibility for the confidentiality • Anonymisation and dissemination must take place according to rules set by SN.
Classes of event-variables in FD-trygd • Demographic variables • Pensions • Supports • Rehabilitation • Labour market • Education • Income from assessment
Principles laid down • Combining tables in FD-trygd is Data Integration. • Must respect the principles laid down in Principles and guidelines on Confidentiality Aspects of Data integration Undertaken for Statistical or Related Research Purposes • It should not with reasonable means be possible to identify someone in the dataset. • Important for SN to establish clear rules for NSD’s anonymisation based on this.
Rules should • Manage realistic disclosure scenarios • Be able to stand scrutiny from investigating journalists • Be transparent for the researchers • Adapt to each researchers needs as well as possible (Need to know principle) • Creating one complete anonymised 20 % sample is out of question.
Restrict information to researchers wrt. • Sample size (from the 20% sample) • Variable scope • Length of event histories • Detail for each variable • Details for dates If too large: • Different samples with different variables for different analyses • To find the best balance is a challenge
For strict rules: • Need to establish a model for risk for this type of data. • Can the μ-Argus risk measure (Franconi &Polettini 2004) be extended to event history data? • Must take into account increased risk from • Identifying variables given at all times • Model for memory on event history • Precision of times for events
Preliminary rules • Limit sample sizes to 10 percent of the target population as represented in the 20 % sample, i.e. about 2 % of the total target. • Restrict detail for the most visible identifying variables. • Round economic benefits associated with states • Restrict all datings to YYYYMM. • Only five levels for education • Positive incomes only in quintiles of distribution. • NSD has started test deliveries based on these rules • The tests will be evaluated next year.
With a good measure of risk • the researchers could be able to choose larger sample size and less variable scope • or variable detail • or smaller sample size and larger variable scope and detail • as long as the total risk stays within a limit. • We hope the experiences from the test deliveries will be useful here