220 likes | 367 Views
Understanding Research Data Centres. Chuck Humphrey Data Library University of Alberta. Outline. Discuss a common goal of the DLI and RDC programs to show how they complement one another. Discuss some differences between the DLI and RDC in how they provide access to data.
E N D
Understanding Research Data Centres Chuck Humphrey Data Library University of Alberta Atlantic DLI Workshop
Outline • Discuss a common goal of the DLI and RDC programs to show how they complement one another. • Discuss some differences between the DLI and RDC in how they provide access to data. • Discuss what happens behind the security door of an RDC. Atlantic DLI Workshop
Common Input Goal • A goal of DLI is to create affordable and equitable access to “standard data products” for post-secondary institutions. • A goal of the RDC program is to provide access to confidential data for approved research projects using procedures allowed under the conditions of the Statistics Act. Atlantic DLI Workshop
Open Restricted Free Expensive Statistics Data ACCESS CHANNELS Custom Tabulations Research Data Centres Statistics Canada Website Data Liberation Initiative Remote Job Submission Depository Service Program Continuum of Access Atlantic DLI Workshop
Open Restricted Free Expensive Statistics Data ACCESS CHANNELS Continuum of Access Custom Tabulations Research Data Centres Statistics Canada Website Data Liberation Initiative Remote Job Submission Depository Service Program Atlantic DLI Workshop
Access Problem Being Solved • DLI: The problem that was being solved was the high costs of “standard data products.” • RDC: The problem that was being solved was access to the confidential files of the longitudinal surveys begun in the 1990’s. Atlantic DLI Workshop
Access: Some Differences • DLI: Access is determined by a paid institutional membership and a license that defines approved users and uses of these data products. • RDC: Access is determined by a peer-approval process for projects, a security clearance prior to establishing “deemed employee” status, and a contract. Institutions must pay a $100,000 per year service fee to operate an RDC. Atlantic DLI Workshop
Access: Some Differences • DLI: Access is to “standard data products”, which have been created for public dissemination. • RDC: Access is to confidential data, which are protected under the Statistics Act and are only available to STC employees or “deemed employees” who have been given approval to use the data. These data products have not been created for dissemination. Atlantic DLI Workshop
Behind the Closed Doors • We’ve discussed in the past the conditions of working in an RDC: • Approved peer-reviewed research project; • Signed contract with STC to deliver a report based on the project; • Swear an oath to the Statistics Act; • Participate in an orientation; • Work only with the data approved in the project proposal; • Restricted printing and removal of output. Atlantic DLI Workshop
Behind the Closed Doors • What does the RDC Analyst do? • Administer researcher procedures, including the researcher orientation, contracts, security procedures, and setting up accounts. • Administer operations within the RDC, including the management of the data and supporting the Academic Director. • Provide support to researchers by consulting on the the data and offering technical advice. • Participate in collaborative research and independent research. • Conduct Disclosure Analysis. Atlantic DLI Workshop
Disclosure Issues • Direct Identifiers (name, address, health services number, etc.) that uniquely identify a respondent. These are all stripped from released data files. • Indirect Identifiers refer to variables such as age, marital status, occupation, ethnicity, postal code, type of business etc.) that when combined could identify a respondent. • Source: Irene Wong, RDC Analyst
Disclosure Issues • Sensitive variables refer to information or characteristics relating to a respondent’s private life or business which are usually unknown to others (income, illness, behaviour etc.).
Disclosure Risk • Combining indirect identifiers with sensitive variables poses a disclosure risk. • However, researchers often seek these kind of relationships in data and try to explain them. • Control methods are therefore introduced: restricted access, data reduction, disclosure analysis.
Identity Disclosure • Identity Disclosure - When a respondent can be identified from the released data. • Combine identifier with sensitive variables Example: • Income, gender, occupation and residence within a Wolfville postal code
Attribute Disclosure • Attribute Disclosure - When confidential information is revealed and then be attributed to an individual or a group. • All persons with characteristic x have characteristic y Examples: • 100% of female respondents of age 13 in Wolfville reported that they experimented with X
Residual Disclosure • Residual disclosure - when confidential information is disclosed by combining previously released output or information. • Risk of residual disclosure is high in: • Subsequent cycles of longitudinal data files (e.g. NLSCY, NPHS, etc.) • Sample from dependent surveys (e.g. SLID and LFS) • Research projects using the same data file • Overlapping some geographical areas (e.g. Health Region and Economic Region)
Lowering Disclosure Risk General rules used with household sample surveys: • Do not report statistics or table cells with small number of respondents (e.g. fewer than 5 respondents) • No anecdotal information may be given about specific respondents • ‘Zero’ and ‘Full’ cell restriction • Min. and Max. value restriction • Saturated models, covariance/correlation matrices treated like underlying tables
Low frequency cells F, 0 is a low frequency cell. Solution? • Collapse column ‘M’ and ‘F’ = column ‘total’ • Collapse row ‘1’ and ‘0’ = row ‘total’ • Report either column ‘M’ and row ‘1’ but not along with the ‘total’
Frequency distributions Frequency curve, e.g.: user wishes to release the the value of observation at the 99th percentile * child 1: family 1 child 2: family 1 child 3: family 2 child 4: family 2 child 5: family 3…. If < 5 respondents are above the 99th percentile, there is a problem. One solution is to describe the distribution using the 95th percentile. * If the survey is multilevel (NLSCY), then the 5 or more respondents from level 1 (child) must come from at least 3 different units from level 2 (household).
‘Zero’ and ‘Full’ cell • (F, 1) is a full cell • (F, 0) is a non-structural zero cell • Both could pose confidentiality problem • (Married, age <12) is a structural zero cell • Not a data confidentiality problem • Do not expect anyone to be in this category
Implied Tables - residual disclosure • Implied tables are tables produced by subtracting results from one or more published tables from another published table • In this example, ‘non-married’ individuals can easily be calculated
Reporting Information • Writing a report is no different than working with table output, avoid statements such as: • “… responded incomes ranging from $2,498 to $579,789.” • If necessary, give general indications (e.g. “no income was above $600,000”.) • “… all respondents of age 16 reported experimenting with drugs.” • This is equivalent to a full cell situation.