80 likes | 230 Views
Semantic privacy protection using ontologies. Sergio Martínez, Aïda Valls, David Sánchez iTAKA group As part of IF-PAD group. Data privacy protection for unbouded categorical attributes. Values are textual: words or noun phrases. The set of possible values is not fixed a priori.
E N D
Semanticprivacyprotectionusing ontologies Sergio Martínez, Aïda Valls, David Sánchez iTAKA group As part of IF-PAD group
Data privacy protection for unbouded categorical attributes • Values are textual: words or noun phrases. • The set of possible values is not fixed a priori. • Semantic interpretation of the values using ontologies. • Ex. Textual answers to the question: “What has been the main reason to visit Delta del Ebre?” • Diversion, recreation, adventure, sport, scuba diving, swimming, beach, take photos, wildlife observation, birds watching, ...
(1) Hierarchy-based anonymization of categorical data • Existing work: • Based on an adhoc and small hierarchy, built in function on the input data. • Exhaustive generalization methods (too expensive with real ontologies as WordNet). • Values substituted only by more general ones (generalization). • Our method: • Substitution of sensible values with the most semantically similar one, using the WordNet ontology. • Generalizations, siblings or specializations. • Each substitution increases the level of k-anonymity. • The utility of the data, from a semantic point of view, is kept during the anonymization.
Evaluation • Data mining with semantic clustering Anonymized data Clusters Anonymizations Clustering Based on ontologies Original data Based on VGHs Based on Discernability Semantic Clustering Comparison
(2) Record Linkage of categorical data • Existing work: • Semantic approach of categorical data anonymization: • Generalization: values are substituted by more general ones. • Disclosure risk estimation based on direct matching between values (MRL). • Our method: • Linkage of values with the most semantically similar one, using the WordNet ontology (SRL). • Semantic similarity measures studied: • Path length • Wu & Palmer • Super-concept distance
Evaluation: MRL vs SRL • Real data with 975 records, 2 textual attributes • Dataset is anonymizedusing a generalization schema based on VGH. • We have made a comparison of Record Linkage using Semantics or Matching. VGH3
Future work • Combine several ontologies as background knowledge in order to complement knowledge modelled for each of them. • Propose other anonymization methods for textual attributes (noise addition, micro-aggregation, ...) • Team (ITAKA group at URV): • Aïda Valls (aida.valls@urv.cat) • David Sánchez • Sergio Martínez
Publications • Conferences: • IPMU 2010. Dortmund, Germany, June 2010 • Anonymizing Categorical Data with a Recoding Method based on Semantic Similarity. Sergio Martínez, Aida Valls, David Sánchez • MDAI 2010. Perpignan, France, October 2010 • Ontology-based anonymization of categorical values. Sergio Martínez, Aida Valls, David Sánchez • CCIA 2010. L’Espluga Francolí, Spain, October 2010 • The role of ontologies in the anonymization of textual variables. Sergio Martínez, David Sánchez, Aida Valls, Montserrat Batet • Journals: • Special issue of “Information Fusion”, Elsevier. (DOI: 10.1016/j.inffus.2011.03.004) • Privacy protection of textual attributes through a semantic-based masking method. Sergio Martínez, Aida Valls, David Sánchez, Montserrat Batet • International Journal of Innovative Computing, Information and Control. (Submitted) Towards the evaluation of the disclosure risk of masking methods dealing with textual attributes. Sergio Martínez, Aida Valls, David Sánchez