540 likes | 678 Views
Secure Data Outsourcing. Outline. Motivation Background Research issues Summary. Motivation. Cost of maintaining/mining large data 4-5 times of the cost of data acquisition DBAs are paid well More and more data service providers Low cost – cloud computing
E N D
Outline • Motivation • Background • Research issues • Summary
Motivation • Cost of maintaining/mining large data • 4-5 times of the cost of data acquisition • DBAs are paid well • More and more data service providers • Low cost – cloud computing • Maintain one database for one user multiple users • Examples: • Alentus.com • Datapipe.com • Discountasp.net • … • Concerns about data security and privacy • Untrusted service provider
Un-trusted service provider • Lazy: incentives to perform less • Curious: incentives to acquire information • Malicious: • Denial of service • Incorrect results • Possibly compromised
Challenges • Data confidentiality • Data need to be encrypted (?) • Utility of protected data? • Query utility • Mining utility • Access pattern privacy • Integrity • Data integrity • Query integrity • Correct • Complete • Fresh
Why is it hard for query services? • Arbitrary expressivity • SQL statements • Often, restricted for certain type of query for simplicity (e.g. range query, knn query) • Cost • Communication • Computation (server side vs client side)
Why it is hard for mining services? • Many data mining models • Different utilities to preserve • No one-size-for-all solutions
Data confidentiality • Bucketization method (crypto-index) • Order preserving encryption • Perturbations
Bucketization method • Hacigumus (SIGMOD02)
Main steps • Partition sensitive attributes • Order preserving: supports comparison • Random: query rewriting becomes hard • Build index on the partitions • Rewrite queries to target partitions • ‘john doe’ 105 • Select * from T’ where name=105 • Execute queries and return results • Prune/post-process results on client
Trade off between confidentiality and overhead • Larger partition increased privacy increased overheads
Order preserving encryption • Agrawal2004, Boldyreva2009 • The set of data is securely transformed so that the order is preserved but the distribution and domain are changed • Benefits: indexing/searching on OPE encrypted data • Weakness: once the original distribution is known, OPE is broken
Not attribute-wise order preserving • Order preserving encryption (OPE, Agrawal et al 2004) is not resilient to distribution-based attacks Bucket based Estimation OPE Original Xi distribution is known Transformed Xi’ distribution
Data perturbation • Definition 1. randomly change the original data 2. the attacker cannot effectively recover the original data 3. the desired properties are preserved • Techniques • Single dimension: noise addition • Multidimensional • Geometric perturbation • Random projection • RASP random space perturbation
Noise addition • Y = X+ R • X: original data column, R: random noise (distribution published), Y: published data • Applications in data mining • Reconstructing column distribution • Rakesh Agrawal SIGMOD 2000 • Applied to privacy-preserving decision tree, naïve bayes classifier • Attacks • Spectral filtering (Kargupta ICDM 2004) • PCA reconstruction (Huang SIGMOD2005)
Multiplicative perturbations • Geometric data perturbation for outsourced data mining • Random Projection • RASP perturbation for query services (range query, kNN query).
Perturbation-based framework Mining service
Geometric data perturbation • Y=RX+T+D • R: secret rotation matrix (preserve Euclidean distances) • T: secret random translation matrix, D: secret random noise matrix • Distances are approximately preserved (D) • Resilient to most attacks to rotation perturbation • Applications • Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms • Attacks • Population based attacks (when covariance matrix is revealed)
Random Projection • Y=AX+D • A: random projection, e.g., entries from N(0,1) • Distances are approximately preserved • Applications • Many classification and clustering algorithms • Worse accuracy than geometric perturbation • Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record) • Attacks • Possibly more resilient than other two perturbation methods • But utility (distance) is not well preserved
RASP perturbation k-dimensional numeric data, n records, represented as a k x n matrix, x: a record (1) Extend x to k+2 dimensions • (K+1) th dimension is always 1 – homogeneous dimension • (K+2) th dimension v is a real random number drawn from (2) Encryption - A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values - A is shared by all records
Properties • Not an OPE • Preserves convexity of the dataset • Convex dataset in Rk another convex dataset in Rk+2. • Good for range query • Each range query in Rk hyperplane based query range query in Rk+2.
RASP properties • Convexity preserving • Queried range (hypercube) is convex • RASP transforms the range to another convex (polyhedron) half space: wTx<=a wTx=a The intersection of convex sets is also convex.
illustration of convexity preserving Encrypted space Original space
Secure query transformation • A naïve solution • Based on the convexity preserving property Problems: (1) A-1 can be probed (2) is . . If a is known, the whole dimension i is breached.
Secure query transformation • Enhanced solution • Xk+2 is always positive • (Xi-a) 0 (Xi-a)Xk+2 0 • Correspondingly, in the encrypted space yTy 0, Problems addressed: (1) A-1 cannot be derived from (2) (Xi-a)Xk+2 0 contains the random component Xk+2 that protects the condition (Xi-a) 0
Efficient two-stage query processing • illustrated Stage2: Filter out the junk records Stage1: Querying this bounding box Original space Transformed space A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server.
Stage 1: The client calculates the large bounding box; The server uses the index to find the results. Stage 2: filter the initial results with the conditions yTiy 0 for 1…2k Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory. Otherwise, use linear scan with stage 2 filtering.
RASP-based data mining • Preserving range query linear classifier • Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013)
Access pattern privacy • On database queries • Problem is the same as PIR • Attackers may use the access pattern to breach data confidentiality • Each of previous approaches should handle this problem!
PIR is impractical • Solutions based on private Information retrieval (PIR) • PIR is still impractical
For Bucktization approach • Based on the architecture of Hacigumus (SIGMOD02) • Hore VLDB04 • For range query • Privacy concern: reveal the distribution of value in each bucket • “Diffusion”: split buckets and combine parts of different buckets • Trade off: now the server needs to return more noisy results larger size
For OPE • Use queries to find out the distributions, then break the encryption
For RASP • Secure query transformation • Attacks to transformed queries
Oblivious RAM • Access pattern: read/write data items • Setting: • Client has a small secure memory • Server has large insecure storage, semi-honest • Data items are encrypted • Client cannot hide the accessed locations • An active area
Existing Approaches • Inside a level • Some real blocks • Useful data • Some dummy blocks • Random data • Randomly permuted • Only the client knows the permutation
Existing Approaches • Reading • Read a block from each level • One realblock. • Remaining are dummy blocks dummy real dummy dummy dummy dummy Client Server
Existing Approaches • Writing • Shuffle consecutively filled levels. • Write into next unfilled level. • Clear the source levels Server (after) Client Server (before) shuffle blocks
Continuous Shuffling … To write:
Integrity guarantee • Merkle hash tree H(H(x1)+H(x2)) , + is string concatenation Can be stored with tree like structure : index, xml
Using merkle tree Example: 5<=q<=10 LUB(q) = 4 GLB(q) = 11
Operations: • Selections, projections, equijoins, set ops • Issues • Works only on data with verification objects • Query expressiveness • Expensive • Related work • Pang et. al (ICDE04, SIGMOD05), using ElGamal function • Sion VLDB05: challenge token • F.Li SIGMOD06: freshness
Secure keyword search • Simple information retrieval • For a keyword, find the documents containing the keyword • What if the documents are encrypted word by word • and if the keyword is also encrypted
Secure keyword search • Song 2000 • Seed is random, different for • each Wi • Key idea: Li and Ri are self- • verifiable • Advantage of XOR
Setting of ki • Ki = Fk’(Wi), k’ is secret • User publishes W and k = Fk’(W) • Server checks CiW whether <Li, Fk(Li)> == CiW It reveals nothing if Ci is not the ciphertext for W. And Li is random for different Wi – server cannot find any information from Li.
Hidden search • In previous schemes, W is revealed • Weakness: each search will have to release k for W • Easy to collect information • Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)>