1.05k likes | 1.07k Views
Access to Confidential Data for Statistical Analysis. Kenneth Harris, Director of Research Data Center. National Center for Health Statistics (NCHS).
E N D
Access to Confidential Data for Statistical Analysis Kenneth Harris, Director of Research Data Center
National Center for Health Statistics (NCHS) Despite the wide dissemination of its data through publications, CD-ROMs, etc., the inability to release files with, for instance, lower levels of geography, severely limits the utility of some data for research, policy, and programmatic purposes and sets a boundary on one of the Center’s goals to increase its capacity to provide state and local area estimates.
NCHS (cont.) In pursuit of this goal and in response to the research community’s interest in restricted data, NCHS established the Research Data Center (RDC), a mechanism whereby researchers can access detailed data files in a secure environment, without jeopardizing the confidentiality of the respondents.
Research Data Center The NCHS Research Data Center, established in 1998, is a facility at the NCHS headquarters in Hyattsville, Maryland, where researchers are granted access to restricted data files needed to complete approved projects. Restricted data files may contain information, such as lower levels of geography, but do not contain direct identifiers (e.g., name or social security number).
Data Restrictions Section 308 (d) of the Public Health Service Act and the NCHS Staff Confidentiality Manual do not permit the release of data that are either identified or identifiable to persons outside of NCHS.
Data Restrictions (cont.) Identifiable data include not only direct identifiers such as name, social security number, etc., but also data that can serve to allow inferential identification of either individual or institutional respondents by a number of means.
Data Restrictions (cont.) Research indicates that identifiability is greatly enhanced if geographic identifiers for state, county, census tract, block-group or block are released on public use files.
Key Issues for Research Data Availability CONFIDENTIALITY The dissemination of data in a manner that would allow public identification of the respondent or would in any way be harmful to him/her is prohibited and the data are immune from legal process.
Key Issues for Research Data Availability (cont.) DISCLOSURE Disclosure relates to inappropriate attribution of information to a data subject, whether an individual or an organization. Disclosure occurs when a data subject is identified from a released file (identity disclosure), sensitive information about a data subject is revealed through the released file (attribute disclosure), or the released data make it possible to determine the value of some characteristic of an individual more accurately than otherwise would have been possible (inferential disclosure).
Appendix I – Rules for the Release of Micro Data Files • The data file must not contain any detailed information about the subject that could facilitate identification and that is not essential for research purposes (e.g., exact date of the subject’s birth). • Geographic places that have fewer than 100,000 people are not to be identified on the data file. • Characteristics of an area are not to appear on the data file if they would uniquely identify an area of less than 100,000 people.
Appendix I – Rules for the Release of Micro Data Files (cont.) • Information on the drawing of the sample which might assist in identifying a data subject must not be released outside the Center. Thus, the identities of primary sampling units are not to be made available outside the Center. • Before any new or revised micro data files are published, they, together with their full documentation, must be approved for publication by the NCHS Director or Deputy Director. • A micro data file containing confidential data on unidentified individuals or facilities may not be released to any person or organization outside NCHS until that person, or a responsible representative of that organization, has first signed the statement on the Order Form, whereby he gives assurance that the data provided will be used only for statistical reporting or research purposes.
Why NCHS Does Not Release Files With Lower Levels of Geography Research suggests that in the case of personal surveys nine commonly collected variables result in the table below.
Why NCHS Does Not Release Files With Lower Levels of Geography (cont.) Notes: A geopolitical area may be a county, city, town, or other place with well- defined boundaries. In this case, identification refers to certaintyidentification.
How Does RDC Operate? • On-Site Access • Remote Access • Staff Assisted Analytical Session
User Procedures To gain access to NCHS restricted data through either method, user must: • Submit a research proposal. • An advisory and proposal review committee receives, reviews, and approves researcher proposals • Proposals are evaluated primarily on the confidentiality disclosure risk. • Scientific merit isnot an evaluation criteria. • Sign an affidavit of confidentiality and promise not to use any method to attempt to identify respondents.
User Procedures (cont.) • Not take any materials or equipment into RDC unless approved by RDC staff. • Submit data files to be merged onto NCHS data ahead of time – allmerging is done by RDC staff. • Subject all output and/or materials removed from the RDC to a disclosure limitation review. • May not remove any NCHS restricted data files nor linked data files.
Researcher Affidavit of Confidentiality I certify that no confidential data or information viewed or otherwise obtained while I am a researcher in the National Center for Health Statistics (NCHS), Research Data Center (RDC) will be removed from NCHS. Further, I understand that NCHS will perform a disclosure review and must provide approval to me before I remove any data from the RDC, whether it be in electronic or paper form. I acknowledge NCHS Confidentiality Statute, 308(d) of the Public Health Service Act stated below and fully understand my legal obligations to NCHS to protect all confidential data. Further I understand any violation I may perform is punishable under 18 United States Code (USC), 1001 which carries a fine of up to $10,000 or up to 5 years in prison.
Researcher Affidavit of Confidentiality(cont.) NCHS 308(d) Confidentiality Statute - No information, if an establishment or person supplying the information or described in it is identified, obtained in the course of activities undertaken or supported under section 304, 305, 306, 307, or 309 may be used for any purpose other than the purpose for which it was supplied unless such establishment or person has consented to its use for such other purpose and in the case of information obtained in the course of health statistical or epidemiological activities under section 304 or 306, such information may not be published or released in other form if the particular establishment or person supplying the information or described in it is identifiable unless such establishment or person has consented to its publication or release in other form.
Researcher Affidavit of Confidentiality(cont.) 18 United States Code, 1001 - Deliberately making a false statement in any matter within the jurisdiction of any Department or Agency of the Federal Government violates 18 USC 1001 and is punishable by a fine of up to $10,000 or up to 5 years in prison. ____________________ _______________ Researcher’s Signature Date ____________________ _______________ NCHS Witness Date
Can Researcher Merge his/her Data with NCHS ? • Must Interact with RDC staff to ensure that their data can be merged with the NCHS data. • User-supplied data will be merged with NCHS data by RDC staff only. • The NCHS RDC policy states that merged and user-supplied data will not be made available for analysis to anyone without the written consent of the user.
The Cost per Project On Site $200 per day (2 day minimum) Remote Access • NSFG-CDF = $500/ year • NHIS-polio = $500/ year • NHIS Linked Mort. File = $250/Month • NHANES Linked Mort. File = $250/Month
The Cost per Project (cont.) • Files <= 130k records = $500 per month • Files > 130k records = $1000 per month Staff Assisted Variable File Construction and Setup For Mortality Files = $250 per day For all Other Files = $500 per day
Do Doctors perform “defensive Cesareans”? Overview: This topic re-examined the issues of “defensive medicine” and state reforms designed to limit malpractice risk on the use of cesarean section delivery. NCHS Data Used: National Hospital Discharge Survey (NHDS) Years of Data Used: 1980 through 1992, inclusive. User’s Data Merged with NCHS?Yes Method of Access to NCHS Data:Remote and On-site Access Statistical Software Used:SAS
Economic Model to Explain the Incidence of Sexual Activity, Contraceptive Use, STD, and Pregnancy Among Teenage Girls. Overview: National Survey of Family Growth Data provide extensive socio-demographic information and reports of the sexual histories of these women. Researcher focused on the effects of a number of policies measured at the state-level. These included: • Parental notification of consent laws. • Medicaid funding of abortions. • Welfare generosity. NCHS Data Used:National Survey of Family Growth (NSFG) User’s Data Merged with NCHS? Yes Method of Access to NCHS Data:Remote Access Statistical Software Used:SAS
Nursing Home Admission and Payment Source? Overview: This project tested if patients with Medicare were being discriminated against because their reimbursement rate was significantly below the private pay rate for nursing homes. NCHS Data Used: National Nursing Home Survey (NNHS) Years of Data Used: 1985, 1995, and 1997 User’s Data Merged with NCHS? No Method of Access to NCHS Data: Remote Access Statistical Software Used:SAS
Hardware and Software All RDC hardware and software are standard. Hardware Pentium IV computers with Windows 2000 Software SAS (only language on ANDRE) Sudaan Fortran HLM Stata Limdep text editors/viewers • Onsite workstations do NOT have email or internet access • Only access to printer is through RDC staff
U.S. DEPARTMENT OF HEALTH AND HUMAN SERVICES Centers for Disease Control and Prevention National Center for Health Statistics Record Linkage for Epidemiologic Research: Accessing Linked data at the NCHS Research Data Center Christine S. Cox NCHS Data Users Conference July 12, 2006
What is Record Linkage? NCHS Surveys Administrative records Linked Data File
NCHS Linked Data: Major Activities • Mortality • National Death Index • Health Care Utilization and Costs • Medicare Data • Retirement and Disability • Social Security Data
NCHS Linked Data: Mortality • Eligibility status • Assigned vital status • Date of death • Age at death • Underlying and multiple causes of death • Adjusted sample weights
Research Potential of Linked Mortality Data The Income-Associated Burden of Disease in the United States P Muennig, P Franks, H Jia, E Lubetkin and MR Gold Excess Deaths Associated with Underweight, Overweight, and ObesityKM Flegal, BI Graubard, DF Williamson; MH GailJAMA. 2005;293:1861-1867. Living and Dying in the USA: Behavioral, Health, and Social Differentials of Adult Mortality RG Rogers, CB Nam, RA Hummer A Semiparametric Analysis of the Body Mass Index’s Relationship to Mortality JT Gronniger
NCHS Linked Data: Medicare • Medicare entitlement and health care utilization and payment data for 1991-2000 • Denominator file • MEDPAR Inpatient hospitalization • MEDPAR Skilled nursing facility • Hospital outpatient • Home Health Care • Hospice • Carrier (physician/supplier Part B file) • Durable Medical Equipment
Research Potential ofLinked Medicare Data • Examine risk factors for health conditions • Examine reliability of survey data • Examine survey report of disability with program participation eligibility criteria • Compare survey reported health conditions to claims records • Examine disparities in Medicare service utilization
NCHS Linked Data: Retirement/Disability • Social Security data from Retirement, Survivors, and Disability Insurance (RSDI) and Supplemental Security Insurance (SSI) programs • Master Beneficiary Record (MBR) • 1962-2003 • Payment History Update System (PHUS) • 1984-2003 • Supplemental Security Record (SSR) • 1974-2003
Research Potential of Linked Social Security Data • Examine reliability of survey information for SSA program participation and benefits • Compare the health characteristics of those who take early (age 62) Social Security benefits to those who postpone benefits • Policy analysis using validated survey data • Predicting the number of people who will become disabled based upon survey reported health conditions • Determining whether current disability entitlement funding levels will be adequate as the population ages
Mortality (NDI) Medicare (CMS) Retirement & Disability (SSA) NHIS 1986-2000 X NHIS 1994-1998 X X X LSOA II X X X NHANES I X X X NHANES II X X NHANES III X X X NNHS 1985 X X Summary NCHS Data Linkage
www.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htmwww.cdc.gov/nchs/r&d/nchs_datalinkage/data_linkage_activities.htm
Why can’t you just give me the data? • NCHS does not “own” the linked administrative data • NCHS data confidentiality rules prohibit the release of potentially identifiable data – special considerations concerning the protection of linked data • The RDC is the only option for access for now….
Overview: Data Access Procedures • Proposal Requirements • Access Methods • Helpful Tips • Where to get help?
Proposal Requirements • Proposal is evaluated by review committee • Review criteria • Scientific and technical feasibility • Availability of RDC resources • Disclosure risk for restricted information • The extent to which project is in accordance with the mission of NCHS • Special note: NCHS does not try to determine if proposals are duplicative
Proposal Requirements • Cover letter • Project title • Abstract (maximum 300 words summarizing project) • Full contact information • Institutional affiliation • Mail address, phone, email • Dates of proposed time at RDC (or indication of using remote access) • Source of funding for proposed research
Proposal Requirements • Study background • Key study questions or hypotheses • Public health benefits • Methods • Analytic approach and statistical methods • Statistical software requirements • Description of intended output for nondisclosure review, e.g. • Table shells • Model equations • Test statistics that researcher plans to remove from RDC
Proposal Requirements • Explanation of why restricted data are needed, e.g. describe why publicly available data are insufficient • Summary of data requirements to be included in analytic file • Identification of sample • Identification of variables • Description of additional data to be supplied by researcher to be merged with NCHS or other data source (must clearly identify source of other data)
Proposal Requirements: Appendices • Current Curriculum Vitae or resume for each investigator • Data dictionary – complete listing of specific data requested and its source(s) and indicate if public use or restricted access variables • specific files and years • sample • variables (dependent, independent, matching/linking)
Proposal Requirements: Appendices • For remote-access applicants • Description of the computer and email system to be used to receive output • Security provisions for the computer and email systems • For students • Letter from department chair or academic advisor stating that student is working under the direction of the department
Overview: RDC Data Access Procedures • Proposal Requirements • Access Methods • Helpful Tips • Where to get help?
Access Methods • Once approved, three methods to access restricted data • on-site - use local computing resources in the NCHS RDC, Hyattsville, MD • remote – submit programs electronically to be executed in the RDC with output returned by email • staff assisted – RDC staff provide on-site programming for off-site approved researchers • For all methods of access, restricted data files remain in RDC and output is inspected for disclosure violations
On-Site Access • RDC staff constructs necessary data files, including merged user data • Most statistical packages available with sufficient lead time • Output subject to disclosure review • Open only during normal working hours
Remote Access Method • RDC staff constructs necessary data files, including merged user data • SAS programs only (certain procedures and functions not allowed) – additional software options expected • Both submitted programs and output undergo a programmed disclosure limitation review
RDC Staff-assisted Programming Method • Subcontract with the RDC staff to perform programming tasks • Useful for those planning to use statistical software not available for the remote system and who are not able to travel to the RDC facility • Cost is estimated for each research project