250 likes | 309 Views
ICDE 06 04.05.06. Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (EPFL) Andras Feher (T.U. Darmstadt). Overview of the talk. Data Integration in Large-Scale Information Systems
E N D
ICDE 06 04.05.06 Probabilistic Message Passing in Peer Data Management Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (EPFL) Andras Feher (T.U. Darmstadt)
Overview of the talk • Data Integration in Large-Scale Information Systems • Peer Data Management Systems (PDMS) • Query Routing in PDMS • Precision / Recall tradeoff • Probabilistic Message Passing • Deriving quality measures for the mappings • Conclusions
Classical Data Integration: LAV/GAV • Traditional database techniques (e.g., LAV/GAV) rely on centralizedschemas to integrate data sources • Not applicable to large-scale, decentralized contexts • Scale (upper ontologies?) • Churn • Autonomy • How can we foster semantic interoperability in decentralized settings? Date m(myDate) = Date m(yourDate) = Date myDate yourDate
Peer Data Management Systems (1) Q2=<GUID>$p/GUID</GUID> FOR $p IN T12WHERE $p/Creator LIKE "%Robi%" Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" Extending data integration techniques to decentralized settings Photoshop (own schema) WinFS (known schema) <Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells</Item> <Item>Royal Council</Item> </Bag> </Subject> … </Photoshop_Image> <WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> … </WinFSImage> T12 = <Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage
<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:cDate> 05/08/2004 </es:cDate> weather article myRDF:Date xap:ModifyDate es:cDate myRDF:Date <myRDF:Date> Jan 1, 2005 </myRDF:Date> Peer Data Management Systems (2) • Pairwise mappings • Local mappings overcome global heterogeneity • Iterative query reformulation es:cDate xap:CreateDate
PDMS Examples • Some academic systems • Piazza • Hyperion • BestPeer • GridVine • … • Out there on the Internet • The Sequence Retrieval System (SRS) • 388 schemata (May 05, EBI repository) • 518 mappings (ID <-> ID) • Power-law distribution of node degrees • Clustering coefficient = 0.32 • Diameter = 9 • Semantic Overlay Networks • P2P + semi-structured data • The Semantic Web
VS Data in large-scale PDMS • Distributed Databases • Number of sources < 100 • Consistent data • Coordination • Structured data • E.g., Relational data model • Integrity constraints • Transactions • Powerful queries • E.g., SQL, aggregation • Schemas created by administrators • Relatively Fixed topology • Large-Scale PDMS • Number of sources > 100 • Unreliable data • Autonomy • Semi-structured data • E.g., XML/RDF • No integrity constraints • No transactions • Simple SP queries • E.g., triple patterns, ranking • Schemata created by end users • Network churn
Problem: Precision/Recall Tradeoff (1) • Semantic Query routing • To whom shall I forward a query posed against my local schema? • Some (most) mappings will be (partially) faulty • Low expressive power of mapping languages • samePropertyAs / sameClassAs / subclassOf • … or event worse (Microformats) • Automatic schema alignment techniques • Different views on conceptualizations • Local query resolution • Low recall • Flooding (PDMS so far) • Low precision
Problem: Precision/Recall Tradeoff (2) • Standard deductive integration is not sufficient • Uncertainty on mappings and conceptualizations • Probabilistic Message Passing • Deriving quality measures for the mappings • Reduces uncertainty • Used to route query / optimize mappings • Based on a notion of agreement on conceptualizations • Decentralized decision making, Emergent Semantics • From Schema Matching to Probabilistic Message Passing • Automatic Schema Matching • INPUT: 2 schemas + data • OUTPUT: 1 mapping • Probabilistic Message Passing • INPUT: n schemas and m mappings • OUTPUT: quality measures for the mappings
Probabilistic Message Passing • Link-based analysis of the PDMS • -Automatically deriving quality measures for the mappings • Transitive closures of mapping operations • Mapping Cycles • Parallel Paths m0 q:art/Creator? m4 f0 m3 qVSm3(m4(m0(q))) art/Creator? VS art/creatDate?
On Cycles / parallel paths m0 m1 m4 m5 f0 m2 m3
Computing a Marginal for one cycle unknown observed • P(m0, m3, m4, f0) = P(m0) P(m3) P(m4) P(f0 | m0, m3, m4,) • P(m0| f0)= m3, m4 P(m0, m3, m4, f0) P(f0)-1 • But: feedbacks on different cycles are correlated • One wrong mapping will affect several cycles/paths • Need to express a global probabilistic model for the mapping graph
A Brief Intro to Factor-Graphs • g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)
Deriving PDMS Factor-Graphs Abductive reasoning on transitive closures of mappings a priori information on mapping
PDMS Factor-Graphs • Cyclic graph • Junction Tree? Clustering / Stretching of variables? • Centralization • Computational + communicational overhead • Iterative Sum-Product • Approximate results • How to perform iterative sum-product by message passing on the mapping graph? • Message passing in factor graph does not correspond to connectivity of mapping graph • We want to rely on decentralized computations only • Locality VS Globality of nodes in the factor graph • Mappings: local • Feedback factor: common, global knowledge • Observed feedback variables: neighborhood
Message Passing • Decentralized computations • Computationally inexpensive • Sums and Products • Message-Passing Schedules • Periodic • Lazy (piggybacking on query forwarding) • No message overhead
Implemented System • Schemas • Import from OWL (Web Ontology Language) • Mappings • KnowledgeWeb Ontology Alignment API • Import from RDF/XML • Automated on-the-fly creation • Comparison to standard alignments Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing Query routing based on the quality measures Precision / recall tradeoff
Some (Preliminary) Results: Convergence (undirected example graph, prior 0.7 delta 0.1)
Fault-tolerance (faulty links) (undirected example graph, prior 0.8 delta 0.1)
Detecting Erroneous Mappings (random network of 50 schemas and 200 mappings, no prior information)
Conclusions • Deriving quality measures for PDMS mappings • Automated process • Decentralized computations • Based on agreements on conceptualizations • Emergent Semantics • Current work • More expressive mappings • E.g., subsumption • Integration in the GridVine semantic overlay network • Application to other domains • Web Services composition?
Thank you for your attention Web page: lsirpeople.epfl.ch/cudre • Questions?