320 likes | 413 Views
A Privacy Preserving Index for Range Queries. Bijit Hore , Sharad Mehrotra, Gene Tsudik. Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002]. A client wants to store data on a remote server & run queries on it BUT he does not trust the server
E N D
A Privacy Preserving Index for Range Queries Bijit Hore, Sharad Mehrotra, Gene Tsudik
Database as a Service (DAS) [Hacigumus et. al, SIGMOD2002] • A client wants to store data on a remote server & run queries on it • BUT he does not trust the server • Solution: Encrypt the data & store it • How do you query the encrypted data ? Untrusted Trusted True Results Encrypted Results Query Post Processor Encrypted & Indexed Client Data Server Query Translator Query over Encrypted Data User Original Query Service Provider Client
Data storage in DAS Client side storage Meta data Server side data buckets Z0 Z1 Z2 Z3 Z4 0 200 450 600 650 700 Server side Table (encrypted + indexed) RA Original Table (plain text)R Bucket-tags
Querying in DAS Select * from R where R.sal [400K, 600K] Client-side query Server-side query SelectetuplefromRAwhereRA.salA= z1 ∨z2 Server side Table (encrypted + indexed) RA Client side Table (plain text)R Client side Table (plain text)R Bucket-tags
Issues in partitioning • How many buckets should one use ? • How to partition the data ?
Data Privacy in DAS • Adversary Access to sever-side data + Malicious Intentions • Privacy issue in partitioned data Small range of a bucket B + 1 sample value from B • Privacy goal of client To hide all useful information from A Put all values of an attribute in a single bucket ! Adversary (A) “Almost total” disclosure of all elements in B
Research challenges & our contributions • Precision: how to partition data • Definition • Optimal partitioning to maximize precision • Privacy: quantifying disclosure • Adversary’s goals • Measures of information disclosure • Privacy-Precision trade-off • Controlled diffusion algorithm • Experiments & Conclusion Privacy Precision
Precision of range queries • Given a partition of data into M parts • Precision (q) = 1 – (# false positives / # tuples returned for q) • Recall = 1 • Workload: All O(N2) range queries are equiprobable (uniform) # false positive α∑ NB*FB= 5*32 + 5*18 = 250 B Precision = 1 – 20/50 = 0.6 q M = 2 10 10 Frequency NB=5,FB=18 6 4 4 4 4 4 N = 10 (domain size) 2 2 1 2 3 4 5 6 7 8 9 10 Salary (100K’s)
Query optimal buckets (QOB) • Optimization problem: For the uniform workload find a partition of the data into M buckets that minimizes total # false positives i.e. 4 Minimize ∑ NB*FB B=1 Optimal solution to a sub-problem Cost of rightmost bucket QOB (1,10,4) = QOB (1,7,3) + Cost(8,10) 10 10 Frequency NB*FB = 24 6 4 4 4 4 4 N = 10 (domain size) 2 2 1 2 3 4 5 6 7 8 9 10 Salary (100K’s)
QOB (cont.) 4 Optimal cost =∑NB*FB = 12*3 + 20*2 + 10*2 + 8*3= 110 1 B1 B2 B3 B4 10 10 6 Frequency 4 4 4 4 4 2 2 1 2 3 4 5 6 7 8 9 10 Salary(100K’s) Time complexity =O(n2M),Space =O(nM) n= # distinct values in dataset;M= # buckets
Outline • Optimal data partitioning for range queries • Adversarial goals & privacy measures • Balancing privacy and precision • Experiments & conclusion
Adversary’s learning model Need to learn bucketproperties to estimate sensitive values Model A’s Domain knowledge + Sample values from buckets Worst case assumption for Privacy Analysis: A knows exact value distribution for every bucket A learns distribution ofvalues in buckets
Adversarial Goal (I) Individual Centric Information: Eg: “What is thesalaryof an individual I” Value Estimation Power (VEP) of A Variance of bucket-distribution is an inverse measure of VEP Average error of value estimation for Adversary Preferred: Large variance Small variance Large Small Bucket range Bucket range
Adversarial Goal (II) Query Centric Information: Eg: “Which individuals have salary [100k,150k]” Set Estimation Power (SEP) of A Entropy of bucket-distribution is an inverse measure of SEP* Best case: high entropy + large variance Average error of query-set estimation for Adversary low entropy + large variance Large Small 100k 150k 100k 150k H(X) = - ∑ pilogpi Bucket range Bucket range
Outline • Optimal data partitioning for range queries • Adversarial goals & privacy measures • Balancing privacy and precision • Experiments & conclusion
Privacy-Precision Trade-off • Optimal buckets might offer less privacy than desired • Small variance partialdisclosureof numeric value • Small entropy Total disclosure with high probability (e.g. categorical data) Partialdetection of query-sets (for all cases) Algorithm that allows trading-off bounded amount of query precision for greater variance and entropy Objective
The controlled diffusion algorithm A simple observation Q • Let a query Q overlap only with B0 • If elements of B0 are distributed • into CB1, CB2 & CB3 randomly • Now Q overlaps with CB1, CB2 & CB3 • With new buckets, the precision for Q drops by factor of • (|CB1|+|CB2|+|CB3|) / |B0| • Any re-distribution scheme where ∀ Bithis ratio≤ K precision degradation is bounded above byK B0 CB1 CB2 CB3
Controlled diffusion Algorithm • Compute optimal buckets on data set DB1 … BM • Fix max degradation factor = K • Initialize M empty composite buckets CB1 … CBM • Set target size of each CB to fCB = |D|/M (equidepth) • ∀Bi • select diCB’s at random, where di = K*|Bi|/fCB • Diffuse elements of Bi into these uniformly at random
Controlled Diffusion (Example) Degradation factor k = 2 Query optimal buckets Metadata size increases from O(M) to O(KM) 10 10 10 Freq B1 B2 B3 B4 6 Final set of buckets on server 4 4 4 4 4 2 2 1 2 3 4 5 6 7 8 9 10 2 4 2 2 2 2 Values CB1 CB1 4 2 2 3 CB2 CB2 2 2 2 3 4 CB3 CB3 CB4 3 4 2 3 CB4 1 2 3 4 5 6 7 8 9 10 Composite Buckets
Some features of the diffusion algorithm • Many consecutive optimal buckets might get diffused into common set of CB’s • Observed precision degradation < K • Elements with same values can go to multiple buckets • Giving it an extra degree of freedom compared to hashing • Not best for point queries • Random choice in the algorithm • Each bucket distribution approaches data distribution as K increases reducing information gained by adversary by learning buckets
Outline • Optimal data partitioning for range queries • Adversarial goals & privacy measures • Balancing privacy and precision • Experiments & conclusion
Experiments • Data sets • Synthetic Data: 105Integers in [0,999] uniformly at random • Real Data:104 Real values in [-0.8,8.0] “Corel Image” dataset (UCI KDD archive) • Query workloads (2 of size 104 each) • End points chosen uniformly at random from the respective ranges
Relative decrease in precision of composite buckets • Relative increase in standard deviation in composite buckets • Relative increase in entropy in composite buckets
Composite buckets (sample) K = 6, M = 350 K = 10, M = 250
Visualizing trade-offs for various bucketization parameters • Eg: The marked points show the average entropy & precision we get for 100 buckets & degradation factor of 2 • The same point in the precision vs standard deviation trade-off space • Provides an easy way to visualize the design space and choose parameters of interest
Summary • Anoptimalalgorithm for partitioning data for range queries • Statistical measures of data privacy • Variance • Entropy • Fast & simple algorithm forre-bucketizingdata • Bounded amount of precision degradation • Substantial increase in privacy level
Related work • Hacigumus et. al, SIGMOD 2002, “Executing SQL over Encrypted Data in the Database Service Provider Model”. • Damiani et. al, ACM CCS 2003, “Balancing Confidentiality and Efficiency in Untrusted Relation DBMS”. • Bouganim et. al, VLDB 2002 “Chip-Secured Data Access: Confidential Data on Untrusted Servers”.
THANK YOU ! Questions ?
Privacy in DAS • Here goal of “Data Privacy” is not just ensuring “non-disclosure of identity”. It is more general ! Privacy-preserving DM & Statistical DB DAS • Privacy criteria: Hide as much information as possible (even at the aggregate level) • Utility criteria: Maintain only the necessary information required for server-side query evaluation (at desired degree of accuracy) • Privacy criteria: Protect against disclosure of identity • Utility criteria: Minimizing information loss i.e. maximize utility for data miners, retain as much aggregate level information as possible
Individual Privacy Measure Average Squared Error of Estimation (ASEE) Error in approximating true value of a r.v XB by another r.v XB’(learned by A) ASEE(XB,XB’)= Var(XB) + Var(XB’) + (E(XB) – E(XB’))2 Varianceof bucket distribution, Var(XB) is our measure of individual privacy (lower bound)
Set oriented Privacy Measure Entropy of bucket distribution is our measure for query-centric privacy • Measures uncertainty associated with a r.v (Eg. True class of an element for categorical data) • An inverse measure of the quality of partial solution sets* that A can derive for a query H(X) = - ∑ pilogpi
Meta data size increase in diffusion • The meta data increases from O(M) to K*|B1|/fcb + K*|B2|/fcb + … + K*|BM|/fcb = (K/fcb) * (|B1| + |B2| + … + |BM|) = (KM/|D|)*|D| = O(KM)