Data Perturbation An Inference Control Method for Database Security

Data PerturbationAn Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009

I. Introduction • Most security concerns can be handled with the grant command. • Others require a view approach • But what happens if we wish to disclose partial information in a table field but not the individual records?

I. Introduction – The Problem • The problem is to allow for statistical analysis of data but still protecting individual records. • Example: Given a database of cancer patients. Allow for a researcher to know what the cancer rate is, but not that patient X has cancer.

I. Introduction – The Problem

II. Related Work • Suppression • Anonymization • Partitioning • Data Logging • Conceptual • Hybrid • Perturbation

II. Related Work-Suppression • Must access n records • Only n queries per day • There are known methods to get around these protections.

II. Related Work-Anonymization • Replace the identifying fields with special characters. • This method can still be compromised.

II. Related Work- Anonymization

II. Related Work-Partitioning • All queries must access more than one band of records.

II. Related Work-Partitioning

II. Related Work –Logging • A log of every query ran is kept. • Before a query is allowed all possible inferences are checked. If it releases one record, then that query is not permitted. • Soon there are no queries allowed.

II. Related work – Conceptual • Design the database so that no confidential information is stored.

II. Related Work –Hybrid • Try using a combination of several of these methods.

II. Related Work - Perturbation • Output Perturbation • Data Perturbation • Liew Perturbation • Nielson Perturbation • Note: Perturbation means data changing

II. Related Work –Output Perturbation • Output perturbation works by changing the output of the query not the physical data.

II. Related Work – Output Perturbation

II. Related Work –Data Perturbation • Data perturbation works by changing the physical data. • Two common methods: • To add a random value to each value • To multiple each value by a random value

II. Related Work – Data Perturbation

II. Related Work –Data Perturbation

II. Related Work –Liew Perturbation • Liew perturbation steps: • Calculate the average, standard deviation, and count of the data • Generate a new data set with the same average, standard deviation and count • Sort both data sets in ascending order • Swap the perturbed values with each other.

II. Related Work–Liew Perturbation

II. Related Work – Liew Perturbation

III Hypothesis and Proof • Prove: • H1: Nielson perturbation is better than No Perturbation • H2: Nielson perturbation is better than data perturbation (20%) • H3: Nielson perturbation is better than Liew perturbation (20%)

III Hypothesis and Proof • Disprove: • H1: Nielson perturbation is not better than No Perturbation • H2: Nielson perturbation is not better than data perturbation (20%) • H3: Nielson perturbation is not better than Liew perturbation (20%)

IV. Methodology • What is Nielson Perturbation? • Calculating the absolute error . . . • Finding optimal values for Nielson perturbation . . . • Experimental design . . . • Conducting the experiment . . .

IV. Methodology-Nielson Perturbation • Nielson Perturbation is a form of data perturbation. • Each value is multiplied by a random value between alpha and beta for the first gamma records in the data set. • This value is randomly negated.

IV Methodology- Nielson Perturbation

IV. Methodology - Nielson

IV. Methodology-Alpha/Beta/Gamma • What are the best values? • An evolutionary algorithm was deployed. • The results after several days of computation were: • Alpha = 2.09 • Beta = 1.18 • Gamma = 66.87

IV. Methodology- Evolutionary Results

IV. Methodology-Nielson Perturbation

IV. Methodology- The Method • Calculate the average error of each method. • Use the law of large numbers: An average of averages approaches a normal distribution as the sample size grows.

IV. Methodology- The Method • Use a t-test to calculate whether two sample means are statistically different from each other with a significance of 95%

IV. Methodology- Monte Carlo Simulation • Randomly generate 100,000 databases and execute 100’s of queries. • I will use arrays to test the accuracy. Speed is of major importance here. • Arrays vs. databases do not matter for calculating the accuracy of query outputs

IV. Methodology- Calculating the average error • The error should be bigger with smaller query sizes. • The error should be smaller with larger query sizes.

Methodology-The Fitness Function e=|x-x’| If q < n/2 fitness=100-e Else fitness=e Smaller fitness scores are better

V. Results and Conclusions

V. Results and ConclusionsSignificance • There is a real need for partial disclosure of a field in a table. • My method insures a higher degree of security. • My method still allows for release of averages and totals.

VI. Further Studies • Transformation Times • On the fly perturbing

Data Perturbation An Inference Control Method for Database Security

Data Perturbation An Inference Control Method for Database Security

Presentation Transcript

Database and Data Mining Security

Data Security, Data Administration and Database Administration

Private Inference Control

Information Theory and the Security of Binary Data Perturbation

Database Security and Data Protection

Perturbation method, lexicographic method

Inference for Categorical Data

Database Security for Privacy

Data Security and Cryptology, XIII Database Security. Newtwork Security

WISER method database - wiser.eu/results/method-database/

History matching under geological control The probability perturbation method

Statistical Inference for Frequency Data

Private Inference Control

Best Practices for Database Security and Data Governance

Database, data quality and construction method

An Effective Method to Control Interrupt Handler for Data Race Detection

Partial Radiative Perturbation Cloud Radiative Forcing Method

Database Security: An Introduction

An Effective Method to Control Interrupt Handler for Data Race Detection

Information Theory and the Security of Binary Data Perturbation