1 / 76

Towards Achieving Anonymity

Towards Achieving Anonymity. An Zhu. Introduction. Collect and analyze personal data Infer trends and patterns Making the personal data “public” Joining multiple sources Third party involvement Privacy concerns Q: How to share such data?. Example: Medical Records. De-identified Records.

morton
Download Presentation

Towards Achieving Anonymity

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Achieving Anonymity An Zhu

  2. Introduction • Collect and analyze personal data • Infer trends and patterns • Making the personal data “public” • Joining multiple sources • Third party involvement • Privacy concerns • Q: How to share such data?

  3. Example: Medical Records

  4. De-identified Records

  5. Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database

  6. Not Sufficient! [Sweeney 00’] Unique Identifiers! Public Database

  7. Anonymize the Quasi-Identifiers! Unique Identifiers! Public Database

  8. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

  9. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

  10. k-anonymized Table [Samarati 01’]

  11. k-anonymized Table [Samarati 01’] Each row is identical to at least k-1 other rows

  12. Definition: k-anonymity • Input: a table consists of n row, each with m attributes (quasi-identifiers) • Output: suppress some entries such that each row is identical to at least k-1 other rows • Objective: minimize the number of suppressed entries

  13. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

  14. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

  15. Graph Representation 4 A: B: C: D: E: F: A B 2 3 F C 3 4 E D 6 W(e)=Hamming distance between the two rows

  16. Edge Selection I A: B: C: D: E: F: A B 2 2 0 3 F C 2 E D Each node selects the lightest weight edge k=3

  17. Edge Selection II A: B: C: D: E: F: A B 2 3 0 3 F C 2 E D For components with <k vertices, add more edges k=3

  18. Lemma • Total weight of edges selected is no more than OPT • In the optimal solution, each vertex pays at least the weight of the (k-1)st lightest weight edge • Forest: at most one edge per vertex • By construction, the edge weight is no more than the (k-1)st lightest weight edge per vertex

  19. Grouping • Ideally, each connected component forms a group • Anonymize vertices within a group • Total cost of a group: (total edge weights) (number of nodes) • (2+2+3+3)6 A B 2 3 0 3 F C 2 E D Small groups: O(k)

  20. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Aim: all sub-trees <k <k <k <k <k k k k k

  21. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • Rotate the tree if necessary k k k

  22. Dividing a Component • Root tree arbitrarily • Divide if Sub-trees & rest k • T. condition: max(2k-1, 3k-5) <k <k <k <k <k

  23. An Example A B A: B: C: D: E: F: 2 3 0 3 F C 2 E D

  24. An Example C A: B: C: D: E: F: 2 2 3 B E F 3 A 0 D

  25. An Example C A: B: C: D: E: F: 2 2 B E F 3 A 0 D Estimated cost: 43+33 Optimal cost: 33+33

  26. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for2-anonymity • 2-approximation for 3-anonymity

  27. 1.5-approximation 1 A: B: C: D: E: F: A B 6 6 F C 0 5 E D 6 W(e)=Hamming distance between the two rows

  28. Minimum {1,2}-matching 1 A: B: C: D: E: F: A B 0 F C 0 1 E D Each vertex is matched to 1 or 2 other vertices

  29. Properties • Each component has 3 nodes >3 Not possible (degree  2) Not Optimal

  30. Qualities • Cost  2OPT • For binary alphabet: 1.5OPT a p q r  p,q OPT pays: 2a We pay: 2a OPT pays: p+q+r We pay: 3(p+q)  2(p+q+r)

  31. Past Work and New Results • [MW 04’] • NP-hardness for a large size alphabet • O(k logk)-approximation • [AFKMPTZ 05’] • NP-hardness even for ternary alphabet • O(k)-approximation • 1.5-approximation for 2-anonymity • 2-approximation for 3-anonymity

  32. Open Problems • Can we improve O(k)? • (k) for graph representation

  33. Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k  d / 2

  34. Open Problems • Can we improve O(k)? • (k) for graph representation 1111111100000000000000000000000000000000 0000000011111111000000000000000000000000 0000000000000000111111110000000000000000 0000000000000000000000001111111100000000 0000000000000000000000000000000011111111 k = 5, d = 16,c = k  d / 2

  35. Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2  d

  36. Open Problems • Can we improve O(k)? • (k) for graph representation 10101010101010101010101010101010 11001100110011001100110011001100 11110000111100001111000011110000 11111111000000001111111100000000 11111111111111110000000000000000 k = 5, d = 16,c = 2  d

  37. Q: How to share such data? • Anonymize the quasi-identifiers • Suppress information • Privacy guarantee: anonymity • Quality: the amount of suppressed information • Clustering • Privacy guarantee: cluster size • Quality: various clustering measures

  38. Clustering Approach [AFKKPTZ 06’]

  39. Transfers into a Metric…

  40. Clusters and Centers

  41. Clusters and Centers

  42. Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes Cost: Gather-k: 10 Cellular-k: 624

  43. Measure • How good are the clusters • “Tight” clusters are better • Minimize max radius: Gather-k • Minimize max distortion error: Cellular-k •  radius  num_nodes • Handle outliers • Constant approximations!

  44. Comparison • k = 5 • 5-anonymity • Suppress all entries • More distortion • Clustering • Can pick R5 as the center • Less distortion • Distortion is directly related with pair-wise distances

  45. Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well

  46. Results [AFKKPTZ 06’] • Gather-k • Tight 2-approximation • Extension to outlier: 4-approximation • Cellular-k • Primal-dual const. approximation • Extensions as well

  47. 2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. R 2R A

  48. 2-approximation • Assume an optimal value R • Make sure each node has at least k – 1 neighbors within distance 2R. • Pick an arbitrary node as a center and remove all remaining nodes within distance 2R. Repeat until all nodes are gone. • Make sure we can reassign nodes to the selected centers.

  49. Example: k = 5

  50. Optimal Solution R 1 2

More Related