1 / 54

Secure Data Outsourcing

Secure Data Outsourcing. Outline. Motivation Background Research issues Summary. Motivation. Cost of maintaining/mining large data 4-5 times of the cost of data acquisition DBAs are paid well  More and more data service providers Low cost – cloud computing

maegan
Download Presentation

Secure Data Outsourcing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Secure Data Outsourcing

  2. Outline • Motivation • Background • Research issues • Summary

  3. Motivation • Cost of maintaining/mining large data • 4-5 times of the cost of data acquisition • DBAs are paid well  • More and more data service providers • Low cost – cloud computing • Maintain one database for one user  multiple users • Examples: • Alentus.com • Datapipe.com • Discountasp.net • … • Concerns about data security and privacy • Untrusted service provider

  4. Un-trusted service provider • Lazy: incentives to perform less • Curious: incentives to acquire information • Malicious: • Denial of service • Incorrect results • Possibly compromised

  5. Challenges • Data confidentiality • Data need to be encrypted (?) • Utility of protected data? • Query utility • Mining utility • Access pattern privacy • Integrity • Data integrity • Query integrity • Correct • Complete • Fresh

  6. Why is it hard for query services? • Arbitrary expressivity • SQL statements • Often, restricted for certain type of query for simplicity (e.g. range query, knn query) • Cost • Communication • Computation (server side vs client side)

  7. Why it is hard for mining services? • Many data mining models • Different utilities to preserve • No one-size-for-all solutions

  8. Data confidentiality • Bucketization method (crypto-index) • Order preserving encryption • Perturbations

  9. Bucketization method • Hacigumus (SIGMOD02)

  10. Main steps • Partition sensitive attributes • Order preserving: supports comparison • Random: query rewriting becomes hard • Build index on the partitions • Rewrite queries to target partitions • ‘john doe’  105 • Select * from T’ where name=105 • Execute queries and return results • Prune/post-process results on client

  11. Trade off between confidentiality and overhead • Larger partition  increased privacy  increased overheads

  12. Order preserving encryption • Agrawal2004, Boldyreva2009 • The set of data is securely transformed so that the order is preserved but the distribution and domain are changed • Benefits: indexing/searching on OPE encrypted data • Weakness: once the original distribution is known, OPE is broken

  13. Not attribute-wise order preserving • Order preserving encryption (OPE, Agrawal et al 2004) is not resilient to distribution-based attacks Bucket based Estimation OPE Original Xi distribution is known Transformed Xi’ distribution

  14. Data perturbation • Definition 1. randomly change the original data 2. the attacker cannot effectively recover the original data 3. the desired properties are preserved • Techniques • Single dimension: noise addition • Multidimensional • Geometric perturbation • Random projection • RASP random space perturbation

  15. Noise addition • Y = X+ R • X: original data column, R: random noise (distribution published), Y: published data • Applications in data mining • Reconstructing column distribution • Rakesh Agrawal SIGMOD 2000 • Applied to privacy-preserving decision tree, naïve bayes classifier • Attacks • Spectral filtering (Kargupta ICDM 2004) • PCA reconstruction (Huang SIGMOD2005)

  16. Multiplicative perturbations • Geometric data perturbation for outsourced data mining • Random Projection • RASP perturbation for query services (range query, kNN query).

  17. Perturbation-based framework Mining service

  18. Geometric data perturbation • Y=RX+T+D • R: secret rotation matrix (preserve Euclidean distances) • T: secret random translation matrix, D: secret random noise matrix • Distances are approximately preserved (D) • Resilient to most attacks to rotation perturbation • Applications • Outsourced privacy preserving data mining, applicable for many classification and clustering algorithms • Attacks • Population based attacks (when covariance matrix is revealed)

  19. Random Projection • Y=AX+D • A: random projection, e.g., entries from N(0,1) • Distances are approximately preserved • Applications • Many classification and clustering algorithms • Worse accuracy than geometric perturbation • Good for sparse high-dimensional data (text data), i.e., sketch methods (A is randomly generated for EACH record) • Attacks • Possibly more resilient than other two perturbation methods • But utility (distance) is not well preserved

  20. RASP perturbation k-dimensional numeric data, n records, represented as a k x n matrix, x: a record (1) Extend x to k+2 dimensions • (K+1) th dimension is always 1 – homogeneous dimension • (K+2) th dimension v is a real random number drawn from (2) Encryption - A is a (k+2)x(k+2) invertible real value matrix, with at least two non-zero values for each row and the last column of A has all non-zero values - A is shared by all records

  21. Properties • Not an OPE • Preserves convexity of the dataset • Convex dataset in Rk another convex dataset in Rk+2. • Good for range query • Each range query in Rk  hyperplane based query  range query in Rk+2.

  22. RASP properties • Convexity preserving • Queried range (hypercube) is convex • RASP transforms the range to another convex (polyhedron) half space: wTx<=a wTx=a The intersection of convex sets is also convex.

  23. illustration of convexity preserving Encrypted space Original space

  24. Secure query transformation • A naïve solution • Based on the convexity preserving property Problems: (1) A-1 can be probed (2) is . . If a is known, the whole dimension i is breached.

  25. Secure query transformation • Enhanced solution • Xk+2 is always positive • (Xi-a)  0  (Xi-a)Xk+2 0 • Correspondingly, in the encrypted space yTy  0, Problems addressed: (1) A-1 cannot be derived from  (2) (Xi-a)Xk+2 0 contains the random component Xk+2 that protects the condition (Xi-a)  0

  26. Efficient two-stage query processing • illustrated Stage2: Filter out the junk records Stage1: Querying this bounding box Original space Transformed space A multidimensional tree index is been built on the encrypted data (in the transformed space) in the server.

  27. Stage 1: The client calculates the large bounding box; The server uses the index to find the results. Stage 2: filter the initial results with the conditions yTiy  0 for 1…2k Note: the two-stage strategy works, if the output of stage 1 is significantly smaller than the original database and can be fit into the memory. Otherwise, use linear scan with stage 2 filtering.

  28. RASP-based data mining • Preserving range query  linear classifier • Use the boosting framework to get strong classifiers (PerturBoost, in ICDM 2013)

  29. Access pattern privacy • On database queries • Problem is the same as PIR • Attackers may use the access pattern to breach data confidentiality • Each of previous approaches should handle this problem!

  30. PIR is impractical • Solutions based on private Information retrieval (PIR) • PIR is still impractical

  31. For Bucktization approach • Based on the architecture of Hacigumus (SIGMOD02) • Hore VLDB04 • For range query • Privacy concern: reveal the distribution of value in each bucket • “Diffusion”: split buckets and combine parts of different buckets • Trade off: now the server needs to return more noisy results  larger size

  32. For OPE • Use queries to find out the distributions, then break the encryption

  33. For RASP • Secure query transformation • Attacks to transformed queries

  34. Oblivious RAM • Access pattern: read/write data items • Setting: • Client has a small secure memory • Server has large insecure storage, semi-honest • Data items are encrypted • Client cannot hide the accessed locations • An active area

  35. Existing Approaches • Inside a level • Some real blocks • Useful data • Some dummy blocks • Random data • Randomly permuted • Only the client knows the permutation

  36. Existing Approaches • Reading • Read a block from each level • One realblock. • Remaining are dummy blocks dummy real dummy dummy dummy dummy Client Server

  37. Existing Approaches • Writing • Shuffle consecutively filled levels. • Write into next unfilled level. • Clear the source levels Server (after) Client Server (before) shuffle blocks

  38. Continuous Shuffling … To write:

  39. The Problem with Existing Approaches

  40. Integrity guarantee • Merkle hash tree H(H(x1)+H(x2)) , + is string concatenation Can be stored with tree like structure : index, xml

  41. Hash chains

  42. Query correctness with merkleby Devanbu et. al.

  43. Using merkle tree Example: 5<=q<=10 LUB(q) = 4 GLB(q) = 11

  44. Operations: • Selections, projections, equijoins, set ops • Issues • Works only on data with verification objects • Query expressiveness • Expensive • Related work • Pang et. al (ICDE04, SIGMOD05), using ElGamal function • Sion VLDB05: challenge token • F.Li SIGMOD06: freshness

  45. Secure keyword search • Simple information retrieval • For a keyword, find the documents containing the keyword • What if the documents are encrypted word by word • and if the keyword is also encrypted

  46. Secure keyword search • Song 2000 • Seed is random, different for • each Wi • Key idea: Li and Ri are self- • verifiable • Advantage of XOR

  47. How to set K?

  48. Setting of ki • Ki = Fk’(Wi), k’ is secret • User publishes W and k = Fk’(W) • Server checks CiW  whether <Li, Fk(Li)> == CiW It reveals nothing if Ci is not the ciphertext for W. And Li is random for different Wi – server cannot find any information from Li.

  49. Hidden search • In previous schemes, W is revealed • Weakness: each search will have to release k for W • Easy to collect information • Solution: encrypt Wi with an private key, then xor with <Li, Fk(Li)>

More Related