250 likes | 386 Views
Privacy Analytics. For organizations that want to safeguard and enable data for secondary purposes …. A utomates the masking and de-identification of data using a risk-based approach to anonymization
E N D
Privacy Analytics • For organizations that want to safeguard and enable data for secondary purposes … • Automates the masking and de-identification of data using a risk-based approach to anonymization • Integrated capabilities to anonymize structured and unstructured data from multiple sources • Peer-reviewed methodologies and value-added services that certify data as de-identified
Our Product and Services • Anonymize unstructured data in text and XML documents. • Automate the measurement ofre-identification riskand anonymize data Develop internal expertise around managing re-identification risks Audit re-identification risk using threat models and scenarios
How PARAT CORE Works • Measure Risk • ManageReleases • Select Data • Anonymize Data
HIPAA De-identification Methods Our software automates statistical de-identification methods in an integrated way to 1) ensure that we safeguard our customers’ data; and 2) maximize its analytic utility
How We Enable Analytic Utility External Structured Data Internal Structured Data Structured & Unstructured Data • MRN: 589 • rwong@ • Robert Wong • Born Jan29, 1978 • Zip code: 12346 • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 • MRN: 123 • cwright@ Primary Structured and Unstructured Data • Income = $82,000 • Plan # 54678 Safe Harbor Method (Data Masking) • Income = $82,000 • Plan # 54678 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012 • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 • MRN: 123 • cwright@ • MRN: 123 • cwright@ • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 Expert Determination (Statistical De-identification and Data Masking) • Income = $82,000 • Plan # 54678 • Income = $82,000 • Plan # 65123 • EMR data and notes at last PCP visit: • Admission date: 08/18/2012 • Discharge date: 08/20/2012 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012
Our Approach: De-identification Taking into Account the Risk of Disclosure Set Risk Threshold Measure Risk Based on plausible attacks, appropriate metrics are selected and used to measure actual re-identification risk from the data. Based on the characteristics of the data recipient, the data, and precedents, a quantitative risk threshold is set. Set Threshold Measure Risk De-identification Process Apply Transformations Transform Data This is an iterative process. The mitigating controls in place can be strengthened to get a more forgiving threshold. If the measured risk does not meet the threshold, specific transformations (such as generalization and suppression) are applied to reduce the risk.
Re-identification Risk: Example 3 Two quasi-identifiers matching in three cells within a dataset
Identifiability Spectrum A range of operational precedents exist based on the situational context of the data’s use and available mitigating controls that protect it. 10 11 8 • Recipients • and Trusted 16 5 • Highly Secure 3 • Public Data Sets 2 20 Significant De-identification Little De-identification
Identifiability Spectrum Leading research organizations apply these precedents to data release for secondary purposes. We’ve embedded these precedents into PARAT CORE. 10 11 8 • Recipients • and Trusted 16 5 • Highly Secure 3 • Public Data Sets 2 20 Significant De-identification Little De-identification
Why PARAT CORE? A scalable set of capabilities that enables the release of anonymized data for analysis, while safeguarding personal information to: Automate Audit Analyze
Why Privacy Analytics? Half of Fortune 50 healthcare companies have engaged Privacy Analytics. And it’s because of our: Software Methodology Expertise • Software and professional services delivered to more than 100 customers • Serves complex, large heterogeneous and homogenous data environments • Support for large structured and unstructured data sets • Defensible and auditableapproaches to meet regulatory obligations for Canada and the U.S. • Methodology, approach and algorithms peer reviewed in lead academicpublications Research into risk and statistical de-identification since 2004 Recognized by Privacy by Design as an Ambassador
Thank You Contact name: Title: Phone: Email:
EMR Software Vendor Analytic Outcomes: De-identified data to analyze: Post-marketing surveillance of adverse events Public health surveillance Prescription pattern analysis Health services analysis Post-marketing and Public Health Surveillance • Wanted to anonymize data on 550,000 patients from general practices • Longitudinal data needed to be used for on-going and on-demand analytics Challenges: • Significant size of the data set.Held more than five years of clinical, prescription, laboratory, scheduling and billing data of patients • Numerous release requests from more than 2500 clinics and 5000 physicians
Clinical Data Analytic Outcomes: Reduced Ethics Review Board approval for data release from many months to two weeks Made linked cancer data available for health services research Provided richer levels of individual health information by linking multiple different data sets • Sharing Cancer Data for Health Services Research • De-identified and linked clinical cancer data with administrative data • For the last few years this was the only mechanism to release microdata Challenges: • Highly sensitive data on individual interactions with health system • Multiple data sources of individual health information
Public Policy Data Registry Analytic Outcomes: Faster release of data for analysis to researchers and public health with auditable, automated data sharing agreements Deeper, richer data sets from which to make public policy decisions Streamlined interactions with ethics review Research on Public Health • Large linked registry available researchers and analysts in Canada and abroad • Data sharing needed to meet rigorous requirements of a prescribed registry Challenges: • Highly sensitive data on mother and child interactions with health system • Required a defensible process to release high quality individual-level data
National Institutes of Health Analytic Outcomes: De-identified data will allow researchers to: Test hypotheses for new research Confirm potential sample sizes for proposed research Find collaborators for cross-disciplinary research studies Accelerate Research Using Unstructured Data • Wants to anonymize unstructured text data from more than 400,000 patients • Seeks to augment currently available data in de-identified format Challenges: • Large volume of free-form text data on thousands of patients that was difficult to analyze because it could not be shared • Limits utility of the clinical data
Open Clinical Data Competition Analytic Outcomes: Enabled researchers to: Explore new analytic approaches for a large data set Established robust anonymization practices, and mitigating controls, standard practices to ensure data was used properly Inspiring Innovation through Competition • 6.7M claims from the State of Louisiana—anyone competing would have access • De-identified data that is realistic provides a compelling framework for innovation Challenges: • Anonymize a claims database of 200k patients for a competition aimed at improving healthcare • The data needed to look real, with the same data formats used before anonymization
Customer’s Data Landscape Data type Homogenous Heterogeneous Large Data set size Mid-sized State of Louisiana
Balancing Privacy withData Utility The Analytic Benefits of our Approach 1 2 3 Data Quality Analytic Granularity Depth of Insight Allowing users to configure the extent of de-identification to match the characteristics of the analysis that is anticipated Enabling analysis of the total patient health experience, to compile a complete picture of this experience from multiple data sources and types Ensuring de-identified data has analytic usefulness by minimizing the amount of distortion but still ensure that re-identification risk is very small
How We Enable Analytic Utility (Before) External Structured Data Internal Structured Data Structured & Unstructured Data • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 • MRN: 123 • cwright@ Primary Structured and Unstructured Data • Income = $82,000 • Plan # 54678 Safe Harbor Method (Data Masking) • Income = $82,000 • Plan # 54678 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012 • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 • MRN: 123 • cwright@ • MRN: 123 • cwright@ • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 Expert Determination (Statistical De-identification and Data Masking) • Income = $82,000 • Plan # 54678 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012
How We Enable Analytic Utility (After) External Structured Data Internal Structured Data Structured & Unstructured Data • MRN: 589 • rwong@ • Robert Wong • Born Jan29, 1978 • Zip code: 12346 • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 • MRN: 123 • cwright@ Primary Structured and Unstructured Data • Income = $82,000 • Plan # 54678 Safe Harbor Method (Data Masking) • Income = $82,000 • Plan # 54678 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012 • Chris Wright • Born Jan 15, 1978 • Zip code: 12345 • MRN: 123 • cwright@ Expert Determination (Statistical De-identification and Data Masking) • EMR data and notes at last PCP visit: • Admission date: 08/18/2012 • Discharge date: 08/20/2012 • Income = $82,000 • Plan # 65123 • EMR data and notes at last PCP visit: • Admission date: 08/15/2012 • Discharge date: 08/17/2012