OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining Zhengli Huang and Wenliang (Kevin) Du Department of EECS Syracuse University

Data Mining/Analysis Data cannot be published directly because of privacy concern

Background:Randomized Response The true answer is “Yes” Do you smoke? Yes Head Biased coin: No Tail

RR for Categorical Data Si Si+1 Si+2 Si+3 q1 q2 q3 q4 True Value: Si M

A Generalization • Several RR Matrices have been proposed • [Warner 65] • [R.Agrawal et al. 05], [S. Agrawal et al. 05] • RR Matrix can be arbitrary • Can we find optimal RR matrices?

What is an optimal matrix? • Which of the following is better? Privacy:M2is better Utility:M1 is better So, what is an optimal matrix?

Optimal RR Matrix • An RR matrix M is optimal if no other RR matrix’s privacy and utility are both better than M (i, e, no other matrix dominates M). • Privacy Quantification • Utility Quantification • A number of privacy and utility metrics have been proposed. We use the following: • Privacy: how accurately one can estimate individual info. • Utility: how accurately we can estimate aggregate info.

Optimization Methods • Approach 1: Weighted sum: w1 Privacy + w2 Utility • Approach 2 • Fix Privacy, find M with the optimal Utility. • Fix Utility, find M with the optimal Privacy. • Challenge: Difficult to generate M with a fixed privacy or utility. • Our Approach: Multi-Objective Optimization

Evolutionary Multi-ObjectiveOptimization (EMOO) • Genetic algorithms has difficulty of dealing with multiple objectives. • We use the EMOO algorithm • We use SPEA2.

Our SPEA2-based algorithm

EMOO • Evolution • Crossover • Mutation • Fitness Assignment (SPEA2) • Strength Value S(M): the number of matrix dominated by M. • Raw fitness F’(M): the sum of the strength of the RR matrices that dominate M. The lower the better. • Density d(M): discriminate the matrices with the same fitness.

Diversity Worse M5 M4 M3 M2 Utility M1 Better Privacy

The Output of Optimization • Pareto Fronts • The optimal set is often plotted in the objective space and the plot is called the Pareto front. Utility (error) 0 Privacy

Experiments For normal distribution with different δ

For First attribute of Adult data

For normal distribution (δ=0.75)

Summary • We use an evolutionary multi-objective optimization technique to search for optimal RR matrices. • The evaluation shows that our scheme achieves better performance than the existing RR schemes.

OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

OptRR: Optimizing Randomized Response Schemes For Privacy-Preserving Data Mining

Presentation Transcript

Regression for Data Mining

Chapter 2 Data Mining

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

Data Mining

Data Mining Tools

CS 277: Data Mining Notes on Classification

CS 490 Sample Project Mining the Mushroom Data Set

INTRODUCTION TO DATA MINING

Web Mining : A Bird ’ s Eye View

CS590D: Data Mining Prof. Chris Clifton

Privacy: Is It Any of Your Business? A Primer on Key Emerging Privacy Issues

Mining Complex Types of Data

Privacy-Preserving Location Services

Data Mining 2

Asian Data Privacy Laws 2013 Roundtable

DATA WAREHOUSING AND DATA MINING

CS590D: Data Mining Chris Clifton

Cryptography and Privacy Preserving Operations Lecture 1

DATA MINING LECTURE 4