170 likes | 297 Views
KD2R: a Key Discovery method for semantic Reference Reconciliation. Danai Symeonidou , Nathalie Pernelle and Fatiha Sa ϊ s LRI ( University Paris-Sud) WOD’2013 June , 3th. More and more heterogeneous RDF sources Links can be asserted between them
E N D
KD2R: a Key Discovery method for semantic Reference Reconciliation DanaiSymeonidou, Nathalie Pernelle and Fatiha Saϊs LRI (University Paris-Sud) WOD’2013 June, 3th
More and more heterogeneous RDF sources • Links can be asserted between them • Same as is one of the most important types of links: combine information given in different data sources • LOD: the number of already existing links is very small • How to create links automatically ? Danai Symeonidou, WOD’2013 Data Linking Linked Open Data cloud
FirstName: George LastName: Thomson SSN : 011223456 Job : Artist Danai Symeonidou, WOD’2013 Data Linking Problem Dataset1 Dataset2 P1 FirstName: George LastName: Thomson SSN : 011223456 Age : 45 P3 FirstName: George LastName: Thomson SSN : 444223456 Job: Professor P2
FirstName: George LastName: Thomson SSN : 011223456 Job : Artist Danai Symeonidou, WOD’2013 Data Linking Problem Dataset1 Dataset2 P1 SameAs FirstName: George LastName: Thomson SSN : 011223456 Age : 45 P3 FirstName: George LastName: Thomson SSN : 444223456 Job: Professor P2
FirstName: George LastName: Thomson SSN : 011223456 Job : Artist Danai Symeonidou, WOD’2013 Data Linking Problem Dataset1 Dataset2 P1 SameAs FirstName: George LastName: Thomson SSN : 011223456 Age : 45 P3 SameAs FirstName: George LastName: Thomson SSN : 444223456 Job: Professor P2
No knowledge given about the properties: • all the properties have the same importance. • Knowledge given by an expert: • Specific expert rules [Arasu and al.’09, Low and al.’01, Volz and al.’09 (Silk)] Example: max(jaro(phone-number;phone-number; jaro-winkler(SSN;SSN)) > 0.88 • Key constraints [Saïs, Pernelle and Rousset’09] Example: hasKey(Museum (museumName) (museumAddress)) • OWL2 Key for a class expression: a combination of (inverse) properties which uniquely identify an entity • hasKey( CE ( OPE1 ... OPEm ) ( DPE1 ... DPEn ) ) Example: hasKey(Museum (museumName) (museumAddress)) expresses: Museum(x1)∧Museum(x2)∧museumName(x1, y)∧museumName(x2, y) ∧museumAddress(x1, w)∧museumAddress(x2, w) sameAs(x1, x2) Danai Symeonidou, WOD’2013 Data Linking with or without key constraints
Problem: when data sources contain numerous data and/or complex ontologies • Some keys are not obvious to find. • Erroneous keys can be given by the expert. • Aim: automatic discovery of a complete set of keys from data • Naïve automatic way to discover keys: examine all the possible combinations of properties • Example: given an instance described by 15 properties the number of candidate keys is 215-1 = 32767 • For each candidate key we have to scan all the instances of the data • Objective: find efficiently keys by: • Reducing the combinations • Partially scanning the data Danai Symeonidou, WOD’2013 Key DiscoveryProblem
RDF data sources (conforming to an OWL 2 ontology) Mappings between classes and properties of the different ontologies Open world assumption (incomplete data) and multivalued properties may exist How to discover keys when we do not know if : i1 =?= i2 =?=i3 =?=i4 hasFriend(i1,i4), hasFriend(i2, i3) …. ?? firstName(i1, Elodie) … ? Danai Symeonidou, WOD’2013 Key DiscoveryProblem
Unique Name Assumption (UNA): two different URIs refer to distinct entities (data sources generated from relational databases , Yago) • i1 <> i2<> i3 <> i4 • Two literals that are syntactically different are semantically different • (e.g. “Napoleon Bonaparte” <> “Napoleon”) Danai Symeonidou, WOD’2013 Key DiscoveryProblem:Assumptions
Heuristic 1 - Pessimistic: • Not instantiated property all the values are possible • Example: hasFriend(i2, i3), hasFriend(i4, i2) are possible. • Instantiated property only given values are considered • Example: not hasFriend(i1, i4) Non keys: {lastName}, {hasFriend} Keys:{firstName}, {lastName, firstName}, {firstName, hasFriend} Undetermined keys: {hasFriend, lastName} Danai Symeonidou, WOD’2013 Key Discovery:Heuristics
Heuristic 1 - Optimistic: • Not instantiated property value not one of the already existing ones • Example: not hasFriend(i2, i3), not hasFriend(i2, i1), not hasFriend(i2, i4). • Instantiated property only given values are considered • Example: not hasFriend(i1, i4) Non keys: {lastName}, {hasFriend} Keys:{firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} Danai Symeonidou, WOD’2013 Key Discovery:Heuristics
Topological sort of the classes (subsumption) • Key Finder • Discover non keys • Ex: {lastName}, {hasFriend} • Derive keys using non keys • Ex: {firstName}, {lastName, firstName}, {firstName, hasFriend}, {hasFriend, lastName} • Key Merge • Cartesian product of minimal key sets in S1,S2 • Ex. Ks1 = {firstName} Ks2 = {hasFriend} Ks1-s2= {firstName, hasFriend} Danai Symeonidou, WOD’2013 KD2R approach Technical report available: https://www.lri.fr/~bibli/Rapports-internes/2013/RR1559.pdf
Danai Symeonidou, WOD’2013 • Computation of maximal non keys and undetermined keys • Represent data in a prefix-tree (a compact representation of the data of one class) KD2Rapproach: Key Finder
Datasets where KD2R has been tested: Danai Symeonidou, WOD’2013 Validation of approach
Ontologies • Data conforming to one ontology • RDF data • DbpediaNaturalPlace dataset (78400 instances) • OAEIPersondataset (2000 instances) • Data linking • Link data using LN2R • Measure quality of linking using: • recall • precision • f-measure Danai Symeonidou, WOD’2013 Demo
QUESTIONS??? Danai Symeonidou, WOD’2013
THANK YOU!!! Danai Symeonidou, WOD’2013