A Privacy Preserving Index for Range Queries

A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik

Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002] • A client wants to store data on a remote server & run queries on it • BUT he does not trust the server • Solution: Encrypt the data & store it • How do you query the encrypted data ? Untrusted Trusted True Results Encrypted Results Query Post Processor Encrypted & Indexed Client Data Server Query Translator Query over Encrypted Data User Original Query Service Provider Client

Data storage in DAS Client side storage Meta data Server side data buckets Z0 Z1 Z2 Z3 Z4 0 200 450 600 650 700 Server side Table (encrypted + indexed) RA Original Table (plain text)R Bucket-tags

Querying in DAS Select * from R where R.sal  [400K, 600K] Client-side query Server-side query SelectetuplefromRAwhereRA.salA= z1 ∨z2 Server side Table (encrypted + indexed) RA Client side Table (plain text)R Client side Table (plain text)R Bucket-tags

Issues in partitioning • How many buckets should one use ? • How to partition the data ?

Data Privacy in DAS • Adversary Access to sever-side data + Malicious Intentions • Privacy issue in partitioned data Small range of a bucket B + 1 sample value from B • Privacy goal of client To hide all useful information from A Put all values of an attribute in a single bucket ! Adversary (A) “Almost total” disclosure of all elements in B

Research challenges & our contributions • Precision: how to partition data • Definition • Optimal partitioning to maximize precision • Privacy: quantifying disclosure • Adversary’s goals • Measures of information disclosure • Privacy-Precision trade-off • Controlled diffusion algorithm  • Experiments & Conclusion Privacy Precision

Precision of range queries • Given a partition of data into M parts • Precision (q) = 1 – (# false positives / # tuples returned for q) • Recall = 1 • Workload: All O(N2) range queries are equiprobable (uniform) # false positive α∑ NB*FB= 5*32 + 5*18 = 250 B Precision = 1 – 20/50 = 0.6 q M = 2 10 10 Frequency NB=5,FB=18 6 4 4 4 4 4 N = 10 (domain size) 2 2 1 2 3 4 5 6 7 8 9 10 Salary (100K’s)

Query optimal buckets (QOB) • Optimization problem: For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e. 4 Minimize ∑ NB*FB B=1 Optimal solution to a sub-problem Cost of rightmost bucket QOB (1,10,4) = QOB (1,7,3) + Cost(8,10) 10 10 Frequency NB*FB = 24 6 4 4 4 4 4 N = 10 (domain size) 2 2 1 2 3 4 5 6 7 8 9 10 Salary (100K’s)

QOB (cont.) 4 Optimal cost =∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3= 110 1 B1 B2 B3 B4 10 10 6 Frequency 4 4 4 4 4 2 2 1 2 3 4 5 6 7 8 9 10 Salary(100K’s) Time complexity =O(n2M),Space =O(nM) n= # distinct values in dataset;M= # buckets

Outline • Optimal data partitioning for range queries • Adversarial goals & privacy measures • Balancing privacy and precision • Experiments & conclusion

Adversary’s learning model Need to learn bucketproperties to estimate sensitive values Model A’s Domain knowledge + Sample values from buckets Worst case assumption for Privacy Analysis: A knows exact value distribution for every bucket A learns distribution ofvalues in buckets

Adversarial Goal (I) Individual Centric Information: Eg: “What is thesalaryof an individual I” Value Estimation Power (VEP) of A Variance of bucket-distribution is an inverse measure of VEP Average error of value estimation for Adversary Preferred: Large variance Small variance Large Small Bucket range Bucket range

Adversarial Goal (II) Query Centric Information: Eg: “Which individuals have salary  [100k,150k]” Set Estimation Power (SEP) of A Entropy of bucket-distribution is an inverse measure of SEP* Best case: high entropy + large variance Average error of query-set estimation for Adversary low entropy + large variance Large Small 100k 150k 100k 150k H(X) = - ∑ pilogpi Bucket range Bucket range

Privacy-Precision Trade-off • Optimal buckets might offer less privacy than desired • Small variance partialdisclosureof numeric value • Small entropy  Total disclosure with high probability (e.g. categorical data) Partialdetection of query-sets (for all cases) Algorithm that allows trading-off bounded amount of query precision for greater variance and entropy Objective

The controlled diffusion algorithm A simple observation Q • Let a query Q overlap only with B0 • If elements of B0 are distributed • into CB1, CB2 & CB3 randomly • Now Q overlaps with CB1, CB2 & CB3 • With new buckets, the precision for Q drops by factor of • (|CB1|+|CB2|+|CB3|) / |B0| • Any re-distribution scheme where ∀ Bithis ratio≤ K  precision degradation is bounded above byK B0 CB1 CB2 CB3

Controlled diffusion Algorithm • Compute optimal buckets on data set DB1 … BM • Fix max degradation factor = K • Initialize M empty composite buckets CB1 … CBM • Set target size of each CB to fCB = |D|/M (equidepth) • ∀Bi • select diCB’s at random, where di = K*|Bi|/fCB • Diffuse elements of Bi into these uniformly at random

Controlled Diffusion (Example) Degradation factor k = 2 Query optimal buckets Metadata size increases from O(M) to O(KM) 10 10 10 Freq B1 B2 B3 B4 6 Final set of buckets on server 4 4 4 4 4 2 2 1 2 3 4 5 6 7 8 9 10 2 4 2 2 2 2 Values CB1 CB1 4 2 2 3 CB2 CB2 2 2 2 3 4 CB3 CB3 CB4 3 4 2 3 CB4 1 2 3 4 5 6 7 8 9 10 Composite Buckets

Some features of the diffusion algorithm • Many consecutive optimal buckets might get diffused into common set of CB’s  • Observed precision degradation < K • Elements with same values can go to multiple buckets • Giving it an extra degree of freedom compared to hashing • Not best for point queries • Random choice in the algorithm • Each bucket distribution approaches data distribution as K increases  reducing information gained by adversary by learning buckets

Experiments • Data sets • Synthetic Data: 105Integers in [0,999] uniformly at random • Real Data:104 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive) • Query workloads (2 of size 104 each) • End points chosen uniformly at random from the respective ranges

Relative decrease in precision of composite buckets • Relative increase in standard deviation in composite buckets • Relative increase in entropy in composite buckets

Composite buckets (sample) K = 6, M = 350 K = 10, M = 250

Visualizing trade-offs for various bucketization parameters • Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2 • The same point in the precision vs standard deviation trade-off space  • Provides an easy way to visualize the design space and choose parameters of interest

Summary • Anoptimalalgorithm for partitioning data for range queries • Statistical measures of data privacy • Variance • Entropy • Fast & simple algorithm forre-bucketizingdata • Bounded amount of precision degradation • Substantial increase in privacy level

Related work • Hacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”. • Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”. • Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.

THANK YOU ! Questions ?

Privacy in DAS • Here goal of “Data Privacy” is not just ensuring “non-disclosure of identity”. It is more general ! Privacy-preserving DM & Statistical DB DAS • Privacy criteria: Hide as much information as possible (even at the aggregate level) • Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy) • Privacy criteria: Protect against disclosure of identity • Utility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible

Individual Privacy Measure Average Squared Error of Estimation (ASEE) Error in approximating true value of a r.v XB by another r.v XB’(learned by A) ASEE(XB,XB’)= Var(XB) + Var(XB’) + (E(XB) – E(XB’))2 Varianceof bucket distribution, Var(XB) is our measure of individual privacy (lower bound)

Set oriented Privacy Measure Entropy of bucket distribution is our measure for query-centric privacy • Measures uncertainty associated with a r.v (Eg. True class of an element for categorical data) • An inverse measure of the quality of partial solution sets* that A can derive for a query H(X) = - ∑ pilogpi

Meta data size increase in diffusion • The meta data increases from O(M) to K*|B1|/fcb + K*|B2|/fcb + … + K*|BM|/fcb = (K/fcb) * (|B1| + |B2| + … + |BM|) = (KM/|D|)*|D| = O(KM)

A Privacy Preserving Index for Range Queries

A Privacy Preserving Index for Range Queries

Presentation Transcript

Covering Index for Branching Path Queries

Privacy-Preserving Data Mashup

A Privacy-Preserving Index for Range Queries

Privacy Preserving In LBS

data privacy-preserving

A Privacy-Preserving Index for Range Queries

Privacy-preserving DRM

Privacy-Preserving Authentication: A Tutorial

Sensor Networks: privacy-preserving queries

A Distributed Privacy-Preserving Scheme for Location-Based Queries

POP-SNAQ: Privacy-preserving Open Platform for Social Network Application Queries

A Privacy – Preserving Index for Range queries

Privacy Preserving Data Mining Lecture 3 Non-Cryptographic Approaches for Preserving Privacy

Privacy Preserving OLAP

Privacy-Preserving Computation

Privacy-Preserving Data Sharing

A Privacy-Preserving Interdomain Audit Framework

Privacy-Preserving Transaction Escrow

A Privacy-Preserving Framework for Personalized Social Recommendations

Privacy-Preserving Clustering

Privacy-Preserving Data Mining

Privacy Preserving Data Mining