200 likes | 342 Views
ESTEEM: Quality- and Privacy-Aware Data Integration. Monica Scannapieco , Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”. Outline. Privacy-aware integration Privacy risk assessment Private record linkage
E N D
ESTEEM:Quality- and Privacy-Aware Data Integration Monica Scannapieco, Carola Aiello, Tiziana Catarci, Diego Milano Dipartimento di Informatica e Sistemistica Università di Roma “La Sapienza”
Outline • Privacy-aware integration • Privacy risk assessment • Private record linkage • Quality-aware integration • Flexible and fully automatic record linkage Summary New!!! New!!! Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Linkage of Anonymous Data T1 QUASI-IDENTIFIER T2 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Our Proposal • A framework for assessing privacy risk that takes into accounts both facets of privacy • based on statistical decision theory • Definition and analysis of: • disclosure policies modelled by disclosure rules • several privacy risk functions • Estimated risk as an upper-bound of true risk and related complexity analysis • Algorithm for finding the disclosure rule minimizing the privacy risk Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
The Formal Framework Disclosure Rule δ • Risk R(δ,)=f(l(δ,)) • identification • sensitivity Loss function l(δ,) - representing attacker’s knowledge Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
K-anonimity • K anonimity is SIMPLY a special case of our framework in which: • θtrue= relation T, more strict assumption on the attacker’s knowledge. We proved that under some assumption we can bound the true risk by our “more general” risk • is a costant, questionable: independence on the type of disclosed attributes (HIV result same loss as last doctor visit) • is underspecified, we can specify the set of disclosure rules in several ways Our framework underlies some questionable hypotheses of k-anonimity!!! Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Private Record Linkage • Being P and Q be two peers owning the relations RP (A1,…An) and RQ(B1,…,Bn), respectively, the privacy-preserving record matching problem is to perform record matching between RP and RQ, such that at the end of the process • P will know only a set PMatch, consisting of records in RP that match with records in RQ. Similarly Q will know only the set QMatch. • Of particular importance is that no information will be revealed to P and Q concerning records that do not match each other • Published at SIGMOD 07 Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Key Ideas and Solutions (1) • Cannot just encrypt data and then compute distances among them • by definition encryption functions do not preserve distances • Let’s work on numbers, instead of records!!! • Mapping of records in a vector space, and record matching performed in such a space Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Key Ideas and Solutions (2) • Third-party based protocol in which: • The two parties build together the embedding space by using a method (SparseMap) with “secure” features • Each of the two parties embeds its own dataset and sends it to the third party • The third party W performs the intersection and sends back to the parties • Mapping of records in a vector space, and record matching performed in such a space Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Key Ideas and Solutions (3) • Th1: Given the two relations RP (D1,…,Ds) and RQ (D1,…,Dx), the set of matching records RecMatch, DBSize the database, the following result is proven, the record matching protocol ¯finds the matched records between the two relations with the following assurance: • RecMatch is not disclosed to W; • RP - RecMatch is not disclosed to Q • RQ - RecMatch is not disclosed to P • DBSize is disclosed to W and bounded by P and Q Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Schema Matching Features • Th2: Given the schemas RP and RQ, owned by parties P and Q respectively and the set of matching attributes AttrMatch, the schema matching protocol finds the attributes common to the two schemas with the following assurance: • AttrMatch is not disclosed to W • AttrMatch is not disclosed to P and Q • AttrMatchSize is not disclosed to P and Q Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
How good are we? • Time: better than record linkage without privacy preservation • Effectiveness: Comparable wrt recall and precision Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Flexible and Automatic RL • P2P systems are loosely coupled, dynamic, open • Manual phases of record linkage can be problematic: • Time consuming vs. dynamic feature/open • Syncronous interactions vs. loosely coupled systems • Need for flexible and automatic RL Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Background: Record Linkage Techniques • Comparison Functions: • Edit distance • Smith-Waterman • Q-grams • Jaro string comparator • Soundex code • TF-IDF • … • Search Space Reduction: • Sorted Neighborhood Method • Blocking • Hierarchical grouping • … • Decision Rules: • Probabilistic: Fellegi&Sunter • Empirical • Knowledge-based Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Key Idea • Record Linkage is a complex process and should be decomposed as much as possible in its constituting phases • For each phase the most appropriate technique should be chosen depending on application and data requirements • In order to dynamically build ad-hoc record linkage workflows • RELAIS: toolkit serving such a purpose • developed at Istat • UNIROMA contribution on data profiling stuff (wait a couple of slides ) Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
RELAIS Toolkit • Database Features: • Size • Quality • Domain features • … • Application Constraints: • Admissible error-rates • Privacy issues • Cost • … RELAIS Record Linkage Workflow Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
RecLink WF Appl1 Preprocessing UpperLowerCase Normalization Normalization UpperLowerCase RecLink WF Appl2 Schema reconciliation SNM Search Space Reduction Blocking Blocking SNM Comparison Function Equality Jaro Edit Distance Jaro Equality Decision Model Probabilistic Probabilistic Empirical Empirical RL Workflows Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Making Automatic Some Phases • Data profiling for choosing matching keys • Automatic extraction of: • Completeness • Consistency • Identification power • On going Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Status of RELAIS • Currently guided execution of RL workflows with all phases automatic • Future: • Definition of RELAIS's architecture as a service-oriented, web-accessible architecture. Formal specification of (i) input/output of services, and (ii) pre/post conditions by semantic Web Services technologies • Automatic generation of RL workflows by reasoning on service specification usage of either automatic [Berardi et al VLDB 2005] or semi automatic [Bouguettaya et al. VLDBJ 2003] service composition techniques Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia
Implementation View • Data Source profiling • (quality metadata) • Quality-based trust evaluation • Automatic and flexible RL Q-RELAIS PQ-RELAIS • Privacy risk assessment • Private RL P-RELAIS Monica Scannapieco, ESTEEM Meeting, 7 Maggio 2007, Ischia Record Linkage Workflow