Anonymization of Set-Valued Data via Top-Down, Local Generalization

Anonymization of Set-Valued Data via Top-Down, Local Generalization Yeye He Jeffrey F. Naughton University of Wisconsin-Madison

Overview • The problem: • Anonymizing set-valued data presents challenges not seen in relational data • Previous solutions explored parts but not all of the problem space • Our goals: • Develop a scalable algorithm for the new variant of the problem • Perform experiments to explore strengths and weaknesses of the approach

What’s set-valued data • “Relational data” • One sensitive attribute for each tuple • “Set-valued data” • Logically: (personid, {item1, item2, …, itemn}) • Multiple sensitive values in one record possible

An attack scenario • Retailer publishes market basket data • The adversary knows Alice has bought milk, beer, and diapers • The adversary infers Alice has also bought pregnancy test and diabetes medicine • beer, milk, diapers • beer, milk, diapers, pregnancy test, diabetes medicine

Existing work: a priori QI/SI partition • Scenarios where a priori partitioning set elements into Quasi-Identifier Item & Sensitive Item possible • {beer, milk, diapers, pregnancy test, diabetes medicine} • Substantial existing work & good algorithms • [Ghinita+08] [Xu+08a] [Xu+08b] [Nergiz+07] • But what if a priori partitioning not possible? • Individuals may have different privacy requirements • The adversary may see sensitive items and use as QI Set-valued data anonymization a priori QI/SI partition possible ?

Existing work: no QI/SI partition • Prior work [Terrovitis+08] proposed the km-anonymity model • km-anonymity • For any transaction (data record) T, for any subset of m items in T, there are at least k-1 other transactions with the same m items Set-valued data anonymization a priori QI/SI partition possible No a priori QI/SI partition

The m in km-anonymity [Terrovitis+08] • Attack revisited • The data 103anonymized, the adversary sees {beer, milk, diapers} • Cannot tell Alice’s transaction from the other 9 • Effective assuming the adversary never sees more than m=3 items • m in km-anonymity • requires some identified ms.t. no adversary will ever see more than m items • What about the case where there is no such m? • The case we consider Set-valued data anonymization a priori QI/SI partition possible No a priori QI/SI partition Has identified m No identified m

Our model: k-anonymityfor set-valued data • Transactional database D is k-anonymous if • Every transaction (data record) occurs at least k times • Different from km-anonymity [Terrovitis+08] • no limit on m, i.e., valid for any m • thus a stronger privacy model

k-anonymity subsumes km-anonymity[Terrovitis+08] • Every database D that satisfies k-anonymity also satisfieskm-anonymity • There exists a database D that satisfies km-anonymity for all m but not k-anonymity • Example: 23-anonymous but not 2-anonymous T1 = {A, B, C} T2 = {A, B, C} T3 = {A, B} No QI/SI partition Set-valued data anonymization a priori QI/SI partition possible k-anon km-anon

Problem statement • Given a transactional database D, find a transformation D’ of Ds.t.: • D’ satisfies k-anonymity • the transformation minimizes information loss between (D, D’)

Hierarchical generalization All Alcohol Health care • Transaction generalization • Ti: {“Beer”, “Wine”, “Diaper”}  {“Alcohol”, “Health care”} • Duplicates removed Beer Wine Diaper Pregnancy test

Information loss metric • Normalized Certainty Penalty (NCP) [Xu+06] • Also used in previous work [Terrovitis+08] All Health care Alcohol • Example: • Generalize “Beer” to “Alcohol”: (2/4 = 0.5) info loss • Generalize “Beer” to “All”: (4/4 = 1) info loss Beer Wine Diaper Pregnancy test

Our algorithm: Partition-based anonymization • Top-down • Generalize everything to the root representation • Resulting one initial partition • Divide and conquer • Choose a node to specialize for each partition • Based on information gain heuristics • Recursively partition on resulting sub-partitions

Example: 2-anonymization

Generalize all data to root {ALL} • {a1} {ALL} • {a1,a2} {ALL} • {b1,b2} {ALL} • {b1,b2} {ALL} • {a1,a2,b2} {ALL} • {a1,a2,b2} {ALL} • {a1,a2,b1,b2} • One initial partition

Initial partition:specialize using ALL  {A, B} {ALL} {A} {ALL} {A} {ALL} {B} {ALL} {B} {ALL} {A, B} {ALL} {A, B} {ALL} {A, B} • Produces three sub-partitions

Green partition: specialize using A  {a1, a2} • {a1} • {a1,a2} {B} {B} {A, B} {A} {A, B} {A} {A, B} • Specialization violates 2-anonymity, rolls back

Blue partition: specialize using B  {b1, b2} {B} • {b1,b2} {B} • {b1,b2} {A, B} {A} {A, B} {A} {A, B} • Specialization ok, reaches leave level, stop

Red partition: specialize using A  {a1, a2} • {b1,b2} • {b1,b2} {A, B} • {a1,a2,B} {A} {A, B} • {a1,a2,B} {A} {A, B} • {a1,a2,B} • Choosing A over B based on max info gain heurisitcs

Red partition: specialize using B  {b1, b2} • {b1,b2} • {b1,b2} • {a1,a2,B} • {a1,a2,b2} {A} • {a1,a2,B} • {a1,a2,b2} {A} • {a1,a2,B} • {a1,a2,b1,b2} • Specializing B violating 2-anonymity, rolls back

Main advantages • Effective (less information loss) • Even though we impose a stronger privacy criteria • Local recoding vs. Global recoding • Efficient (less execution time) • Divide and conquer vs. bottom-up (exhaustive) enumeration • Linear in the input data & level of the hierarchy vs. worst case exponential in previous work

Experimental setup: market basket data • Real-world benchmark data • BMS-WebView-1, BMS-WebView-2, BMS-POS • No accompanying hierarchy data • Used synthetic hierarchy (as in the previous work) • Comparing our Partition-based algorithm (Partition), with previous Apriori-Anonymization (AA) [Terrovitis+08]

An order of magnitude faster on market basket data

Less information losson market basket data • Why? Local recoding

Sensitivity analysis: consistently faster with varied parameters

Sensitivity analysis: less information loss in most cases

Experimental setup: AOL query log • From a set-valued perspective • No accompanying hierarchy data, again • Use alphabetical hierarchy • Use WordNethierarchy • Compare with an early work [Adar07]

Less information loss than [Adar07] on AOL query log

Reasonably efficient on AOL query log • Efficient given the size of the query log (2.2GB) • Information loss not as satisfactory as in market basket data • Words generalized to “event”, “process”, “thing”…

Conclusion • Developed faster, better information preserving anonymization algorithm • for set-valued data with no QI/SI distinction • Performed well on market basket data • less satisfying for search log data • Open and important question: stronger privacy models • what is a good stronger privacy model than k-anonymity for set-valued data with no QI/SI distinction?

Anonymization of Set-Valued Data via Top-Down, Local Generalization

Anonymization of Set-Valued Data via Top-Down, Local Generalization

Presentation Transcript

Structure Preserving Anonymization of Router Configuration Data

Establishment of a National Local Government Workforce Data Set

Privacy-preserving Anonymization of Set Value Data

Program Analysis via 3-Valued Logic

TOP-DOWN !

Anonymization of Health Care Data in Hungary

Program Analysis via 3-Valued Logic

Shape Analysis via 3-Valued Logic

Data Anonymization (1)

Data Anonymization - Generalization Algorithms

Generalization

Shape Analysis via 3-Valued Logic

Types of Generalization

On the Anonymization of Sparse High-Dimensional Data

Privacy-preserving Anonymization of Set Value Data

Towards Publishing Recommendation Data With Predictive Anonymization

Publishing Set-Valued Data via Differential Privacy

Data Anonymization – Introduction and k-anonymity

Generalization of robots

Program Analysis via 3-Valued Logic

Program Analysis via 3-Valued Logic