Anonymizing Tables for Privacy Protection

Anonymizing Tables for Privacy Protection Gagan Aggarwal, Tomás Feder, Krishnaram Kenthapadi, Rajeev Motwani, Rina Panigrahy, Dilys Thomas, An Zhu

An example: Medical Records

Medical Records: De-identify&Release

Quasi-identifiers: reveal less information k-anonymity model Not sufficient! [Swe02, SS98] Uniquely identify you! Public Database

k-anonymity – Problem Definition • Input: Database consisting of n rows, each with m attributes drawn from a finite alphabet. • Goal: Suppress some entries in the table such that each modified row becomes identical to at least k-1 other rows. • More the suppression, lesser the utility of the modified table. • Objective: Minimize the number of suppressed entries.

Medical Records: 2-anonymized table Suppressentries Cost = 10

k-anonymity – Results • [MW04] • NP-hardness for a linear size alphabet • O(k log k) - approximation algorithm • NP-hardness (even for ternary alphabet) • O(k) - approximation for k-anonymity • 1.5 - approximation for 2-anonymity • 2 - approximation for 3-anonymity

2 3 2 3 3 1 O(k)-approximation algorithm (for k=3) • Create a complete graph s.t. • Each row vector in the table is a vertex. • Weight of an edge is the number of attributes on which the two rows differ (Hamming distance).

O(k)-approximation algorithm (for k=3) • We create a forest as follows: • Each node picks its nearest neighbor and connects to it. • If the resulting graph has a component with only two nodes, connect this component to the second nearest neighbor of one of the two nodes.

An example graph 3 2 7 5 10 9 9 7 12 7 4 5 1 1 3 2 Nearest-neighbor edge Other edges

The forest obtained 3 2 4 1 1 3 2

O(k)-approximation algorithm (for k=3) • The forest has: • Components of size at least 3. • The total cost of edges in the forest is no more than the cost of the optimal solution. • In optimal solution, each node has at least as many *s as its Hamming distance to its secondnearest neighbor. • Each node has at most as many *s as the cost of the tree containing the node. • If there is any component with size greater than 5, break it into components of size at least 3 (resp. k).

The final partition 3 2 4 3 1 1 3 2

Analysis of the algorithm • Cluster the row vectors according to this partition • Cost incurred ≤OPT * (size of largest partition) = 5 * OPT. • For general k, the cost of this solution is within max{3k-5,2k-1} of the cost of optimal solution.

Better than O(k)-approximation? • Not possible, using only the graph representation • Lose information about the structure of the problem • There exist two instances with: • Same underlying graph • k-anonymity costs differing by a factor of O(k)

Open problems • Lower bounds on the approximation factor (without assuming the graph representation) • Extend the k-anonymity model to account for changes in the database: • Handle inserts, deletes and updates

Anonymizing Tables for Privacy Protection