270 likes | 419 Views
Anonymity through Data cubes. Athos Antoniades. Introduction. Why Share Data? What are the current legal and ethical limitations? How have scientists shared medical data so far? Key Problems Perturbation Cell Suppression. The Problem. Why share data: Replication Testing
E N D
Anonymity through Data cubes Athos Antoniades
Introduction • Why Share Data? • What are the current legal and ethical limitations? • How have scientists shared medical data so far? • Key Problems • Perturbation • Cell Suppression
The Problem • Why share data: • Replication Testing • Statistical Power • Multiple Testing Problem • Legal and Ethical Issues • AnonymizationvsPseudoanonimization • Limitations derived from consent form signed by subjects • Other, regional, study, or subject specific issues.
How have scientists shared medical data Contingency Table and Data Cube example
16 year old widow Problem A paper that analyzes data from a specific study reports:
16 year old widow Problem A paper that analyzes data from a specific study reports:
16 year old widow Problem A paper that analyzes data from a specific study reports:
Categorization Differences Paper 2 that analyzes data from the same study reports: Paper 1 that analyzes data from a specific study reports:
Perturbation and Cell Suppression Perturbation (+-1) and Cell Suppression (<5) Original Data
Evaluation • Most common parameters tested • Perturbation:[0], [-1,1], [-3,3], [-5,5], [-10,10] • Cell Supression: <0, <=1, <=3,<=5,<=10 • Standard main effect test usingChi Square • Pearson’s Correlation Coefficient used to evaluate deviation of each parameter combination to original results. • A-priory defined threshold for Pearson’s correlation coefficient <=0.95.
Linked2Safety’s Data Analysis Space Objectives: • Design and develop the data mining techniques and the scalable infrastructure for the identification of phenotypic and genetic associations related to adverse events. • Develop new and implement existing state of the art analytical approaches for genetic data. • Define and implement the knowledge extraction and filtering mechanisms and the knowledge base • Integrate the knowledge base into a lightweight decision support system (Adverse events early detection mechanism)
Quality Control Subspace Provides the tools for identifying and removing erroneous data or data that do not conform to the quality standards that a user might define. • Tools: • Hardy-Weinberg Equilibrium Test • Allele Frequency Test • Missing Data Test
Feature Selection Subspace Provides the tools for removing redundant or irrelevant features from a dataset. Tools: • Rough Set Feature Selection • Information Gain Feature Selection • Chi Squared Feature Selection
Single Hypothesis Testing Subspace Provides the tools for performing single hypothesis testing on a dataset and test for associations. • Tools: • Pearson’s Chi Square Test • Fisher’s Exact Test • Odds Ratio • Binomial Logistic Regression • Linkage Disequilibrium • Genetic Region Based Association Testing
Data Mining Subspace Provides the tools for performing data mining analyses on a dataset and extract association rules. • Tools: • Association Rules (apriori) • Decision Trees with Percentage Split (C4.5) • Decision Trees with Cross Validation (C4.5) • Random Forest with Percentage Split • Random Forest with Cross Validation
Knowledge Extraction and Filtering Mechanism • Knowledge Extraction Mechanism • This mechanism is responsible for storing statistically significant associations and important association rules in the Linked2Safety knowledge database • Has two steps: • Logging system • Storing important knowledge • Filtering mechanism • This mechanism allows users to insert or delete associations and association rules
Adverse Event Early Detection Mechanism • Uses the knowledge in the L2S knowledge base • Runs in the background to identify new associations and association rules • Reruns analyses when updated datasets are available • Creates alerts for patients profiles associated with adverse events
Patterns Discovery Common Variable Selection • Overlapping non genetic data of at least 2 data providers:
Conclusion and future work on utilizing data cubes • We were able to identify for a given dataset the maximum noise that can be added to the data without significantly affecting the outcomes. • Results presented are only relevant to MASTOS, all other datasets need to repeat the analytical approach described to determine the maximum noise that can be added to the results. • Further investigation is necessary to identify the minimum parameter settings to satisfy legal and ethical requirements.
Who to Contact • Athos Antoniades • University of Cyprus • email: athos@cs.ucy.ac.cy