420 likes | 564 Views
Data Perturbation An Inference Control Method for Database Security. Dissertation Defense Bob Nielson Oct 23, 2009. I. Introduction. Most security concerns can be handled with the grant command. Others require a view approach
E N D
Data PerturbationAn Inference Control Method for Database Security Dissertation Defense Bob Nielson Oct 23, 2009
I. Introduction • Most security concerns can be handled with the grant command. • Others require a view approach • But what happens if we wish to disclose partial information in a table field but not the individual records?
I. Introduction – The Problem • The problem is to allow for statistical analysis of data but still protecting individual records. • Example: Given a database of cancer patients. Allow for a researcher to know what the cancer rate is, but not that patient X has cancer.
II. Related Work • Suppression • Anonymization • Partitioning • Data Logging • Conceptual • Hybrid • Perturbation
II. Related Work-Suppression • Must access n records • Only n queries per day • There are known methods to get around these protections.
II. Related Work-Anonymization • Replace the identifying fields with special characters. • This method can still be compromised.
II. Related Work-Partitioning • All queries must access more than one band of records.
II. Related Work –Logging • A log of every query ran is kept. • Before a query is allowed all possible inferences are checked. If it releases one record, then that query is not permitted. • Soon there are no queries allowed.
II. Related work – Conceptual • Design the database so that no confidential information is stored.
II. Related Work –Hybrid • Try using a combination of several of these methods.
II. Related Work - Perturbation • Output Perturbation • Data Perturbation • Liew Perturbation • Nielson Perturbation • Note: Perturbation means data changing
II. Related Work –Output Perturbation • Output perturbation works by changing the output of the query not the physical data.
II. Related Work –Data Perturbation • Data perturbation works by changing the physical data. • Two common methods: • To add a random value to each value • To multiple each value by a random value
II. Related Work –Liew Perturbation • Liew perturbation steps: • Calculate the average, standard deviation, and count of the data • Generate a new data set with the same average, standard deviation and count • Sort both data sets in ascending order • Swap the perturbed values with each other.
III Hypothesis and Proof • Prove: • H1: Nielson perturbation is better than No Perturbation • H2: Nielson perturbation is better than data perturbation (20%) • H3: Nielson perturbation is better than Liew perturbation (20%)
III Hypothesis and Proof • Disprove: • H1: Nielson perturbation is not better than No Perturbation • H2: Nielson perturbation is not better than data perturbation (20%) • H3: Nielson perturbation is not better than Liew perturbation (20%)
IV. Methodology • What is Nielson Perturbation? • Calculating the absolute error . . . • Finding optimal values for Nielson perturbation . . . • Experimental design . . . • Conducting the experiment . . .
IV. Methodology-Nielson Perturbation • Nielson Perturbation is a form of data perturbation. • Each value is multiplied by a random value between alpha and beta for the first gamma records in the data set. • This value is randomly negated.
IV. Methodology-Alpha/Beta/Gamma • What are the best values? • An evolutionary algorithm was deployed. • The results after several days of computation were: • Alpha = 2.09 • Beta = 1.18 • Gamma = 66.87
IV. Methodology- The Method • Calculate the average error of each method. • Use the law of large numbers: An average of averages approaches a normal distribution as the sample size grows.
IV. Methodology- The Method • Use a t-test to calculate whether two sample means are statistically different from each other with a significance of 95%
IV. Methodology- Monte Carlo Simulation • Randomly generate 100,000 databases and execute 100’s of queries. • I will use arrays to test the accuracy. Speed is of major importance here. • Arrays vs. databases do not matter for calculating the accuracy of query outputs
IV. Methodology- Calculating the average error • The error should be bigger with smaller query sizes. • The error should be smaller with larger query sizes.
Methodology-The Fitness Function e=|x-x’| If q < n/2 fitness=100-e Else fitness=e Smaller fitness scores are better
V. Results and ConclusionsSignificance • There is a real need for partial disclosure of a field in a table. • My method insures a higher degree of security. • My method still allows for release of averages and totals.
VI. Further Studies • Transformation Times • On the fly perturbing