560 likes | 567 Views
Explore the challenges and techniques for data integration and sharing in decentralized systems, including picture sharing and semantic interoperability. Discover how peer data management systems and self-organizing networks can enhance data management in large-scale networks.
E N D
IBM T.J. Watson 02.06.06 Data Management in Large-Scale Decentralized Information Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (advisor @ EPFL)
Overview • Motivation • Picture Sharing in Decentralized Settings • Data Integration in large-scale networks • Peer Data Management Systems • Semantic Gossiping • Probabilistic Message-Passing • Aspects of self-organization • Self-Healing semantic networks • Analyzing semantic interoperability in the large • Applications 0. P-Grid • GridVine • PicShark • Conclusions & Future Work
MMS HTTP SMTP 1. Motivation: Picture Sharing • Profusion of Digital Images • Variety of powerful devices • gigabytes of pictures is the new norm • Most of the images are kept local • Some are shared • Mostly point-to-point • Primitive search capabilities
Opportunity • More and more software use metadata to organize images locally • (Semi) Structured metadata (e.g., XML, PSA) • Ontological metadata (e.g., RDF, XMP) • Type-based metadata (e.g., WinFS) <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?> <x:xapmeta xmlns:x='adobe:ns:meta/'> <rdf:RDF xmlns:rdf= 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'> <rdf:Description about='' xmlns:xap='http://ns.adobe.com/xap/1.0/'> <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> <xap:Creator> John Doe </xap:Creator> </rdf:Description> …
Hurdle: Metadata Heterogeneity • Why not taking advantage of those metadata in a distributed setting? X Syntactic discrepancies X Semantic heterogeneity • All the aforementioned standards are extensible • Shared representation is not enough VS <es:cDate> 05/08/2004 </es:cDate> <rdf:Property rdf:ID=“Length-Y"> <rdfs:label>Length-Y</rdfs:label> <rdfs:subPropertyOf rdf:resource="#length"/> </rdf:Property> <rdf:Property rdf:ID="width"> <rdfs:label>Width</rdfs:label> <rdfs:subPropertyOf rdf:resource="#length"/> </rdf:Property> VS
Beyond Keyword Search searching semantically richer objects in large scale heterogeneous networks <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:DofCreation> 05/08/2004 </es:DofCreation> ? ? ? ? ? <myRDF:Date> Jan 1, 2005 </myRDF:Date>
VS 2. Data Integration in large-scale systems • Distributed Databases • Number of sources < 100 • Consistent data • Coordination • Structured data • E.g., Relational data model • Integrity constraints • Transactions • Powerful queries • E.g., SQL, aggregation • Schemas created by administrators • Relatively Fixed topology • Large Scale Information Systems (e.g., WWW) • Number of sources > 1000 • Unreliable data • Autonomy • Semi-structured data • E.g., XML/RDF • No integrity constraints • No transactions • Simple SP queries • E.g., triple patterns, ranking • Schemata created by end users • Network churn
Data Integration: LAV/GAV • Traditional database techniques (e.g., LAV/GAV) rely on centralizedschemas to integrate data sources • Not applicable to our context • Scale (upper ontologies?) • Churn • Autonomy • How can we foster semantic interoperability in decentralized settings? Date m(myDate) = Date m(yourDate) = Date myDate yourDate
Semantic Interoperability Q2=<GUID>$p/GUID</GUID> FOR $p IN T12WHERE $p/Creator LIKE "%Robi%" Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" Extending semantic interoperability techniques to decentralized settings Photoshop (own schema) WinFS (known schema) <Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells</Item> <Item>Royal Council</Item> </Bag> </Subject> … </Photoshop_Image> <WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> … </WinFSImage> T12 = <Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage
<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:cDate> 05/08/2004 </es:cDate> myRDF:Date xap:ModifyDate es:cDate myRDF:Date <myRDF:Date> Jan 1, 2005 </myRDF:Date> 2.1. Peer Data Management Systems • Pairwise mappings • Peer Data Management Systems (PDMS) • Local mappings overcome global heterogeneity • Iterative query rewriting es:cDate xap:CreateDate weather article
Problem: Precision/Recall Tradeoff • Semantic Query routing • To whom shall I forward a query posed against my local schema? • Some (most) mappings will be (partially) faulty • Low expressive power of mappings • samePropertyAs / sameClassAs / subclassOf • … or event worse (Microformats) • Automatic schema alignment techniques • Different views on conceptualizations • Local query resolution • Low recall • Flooding (PDMS so far) • Low precision • Standard deductive integration is not sufficient • Uncertainty on mappings and conceptualizations
2.2. Semantic Gossiping • Local translations enabling global agreements • Selective query forwarding paradigm • Syntactic distances • Lost predicates • Semantic distances • Results analysis • Cycles analysis Precision/Recall tradeoff
Semantic Gossiping • Selectively reformulate queries through mapping links πTitleCreature=Joe (R5) X πTitleCreator=Joe (R3) πTitleAuthor=Joe (R2) πTitleCreator=Joe (R4) πTitreAuteur=Joe (R1) X Author=Joe (R4)
2.3. Probabilistic Message Passing • Where do the mapping quality measures come from? • Link-based analysis of the PDMS • -Automatically deriving quality measures for the mappings • Transitive Closures on mapping operations • Mapping Cycles • Parallel Paths m0 m1 m4 m5 f0 m2 m3
On Cycles / parallel paths m0 q:art/Creator? m4 f0 m3 qVSm3(m4(m0(q))) art/Creator? VS art/creatDate?
Computing a Marginal for one cycle unknown observed • P(m0, m3, m4, f0) = P(m0) P(m3) P(m4) P(f0 | m0, m3, m4,) • P(m0| f0)= m3, m4, P(m0, m3, m4, f0) P(f0)-1 • But: feedbacks on different cycles are correlated • One wrong mapping will affect several cycles/paths • Need to express a global probabilistic model for the mapping graph
A Brief Intro to Factor-Graphs • g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)
Deriving PDMS Factor-Graphs Abductive reasoning on transitive closures of mappings a priori information on mapping
PDMS Factor-Graphs • Cyclic graph • Junction Tree? Clustering / Stretching of variables? • Not applicable (decentralization) • Iterative Sum-Product • Approximate results • How to perform iterative sum-product by message passing on the mapping graph? • Message passing in factor graph does not correspond to connectivity of mapping graph • We want to rely on decentralized computations only • Locality VS Globality of nodes in the factor graph • Mappings: local • Feedback factor: common, global knowledge • Observed feedback variables: neighborhood
Message Passing • Decentralized computations • Computationally inexpensive • Sums and Products • Message-Passing Schedules • Periodic • Lazy (piggybacking on query forwarding) • No message overhead
Implemented System • Schemas • Import from OWL (Web Ontology Language) • Mappings • KnowledgeWeb Ontology Alignment API • Import from RDF/XML • Automated on-the-fly creation • Comparison to standard alignments Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing
2.4. Aspects of Self-Organization • What can we do once we have gathered quality measures about the mappings? • Routing queries () • Correcting mappings • Analyzing global properties of the graph
2.4.1. Self-Healing Semantic Networks • Two types of self-organization • Static network • Self-organizing dissemination of queries • Dynamic network • Self-organizing network of mappings • Idea: • Quality evaluation of mappings through Semantic Gossiping • Modify low quality links • Reorganized network leads to different quality evaluation • Dynamic network changes self-organizing, self-referential semantic network
Some Results (1) • Sensitivity to TTL • (cycle analysis only, 25 schemas, 4 concepts)
Some Results (2) • Scalability • (results analysis only, 4 concepts, TTL=3, misclassification rate=0.1, 2 documents/peer on avg.)
2.4.2. Analyzing Semantic Inter. in the Large • What about interoperability at a global scale? • Modeling semantic interoperability: • The semantic connectivity graph • Idea: as for physical network analyses, define a connectivity layer • Unweighted, non-redundant version of the Schema-to-schema graph Schema-to-Schema Graph • Logical model • Directed • Weighted • Redundant
Semantic Interoperability in the Large • Definition Peers in a set Ps are semantically interoperable iff Ss is strongly connected, with Ss {s | p Ps, ps} • Observation 1 A set of peers Pscannot be semantically interoperable if |Es| <|Vs| • Observation 2 A set of peers Psis semantically interoperable if |Es| >|Vs| (|Vs|-1) - (|Vs|-1) • What happens between theses two bounds?
A Necessary Condition for interoperability in the large • Analyzing semantic interoperability in large-scale, decentralized networks • Percolation theory for directed graphs • Based on a recent graph-theoretic framework • Random graphs with specific degree distributions pjk, clustering coefficients cc and bidirectionality coefficient bc • Based on generatingfunctionality • Distribution of edges from first to second-order neighbors: • Necessary condition for semantic interoperability in the large: j,k (jk-j(bc+cc)-k)pjk ≥ 0 • Also: approximations of the size of semantically interoperable clusters
Some results (Poisson-distributed random graph, 10 000 vertices)
Analysis of a real system • Analysis of the Sequence Retrieval System (SRS) • Commercial information indexing and retrieval system for bioinformatic libraries • Schemas described in a custom language (Icarus) • Mappings (foreign keys) from one database to others • Crawling the EBI repository • 388 databanks • 518 (undirected) links • Power-law distribution of node degrees • Clustering coefficient = 0.32 • Diameter = 9 • Connectivity indicator ci = 25.4 • Super-critical state • Size of the giant component • 0.47 (derived) VS 0.48 (observed)
Query dissemination in weighted networks • Per-hop forwarding behaviors • Only forward if wi >= • = 0 : flooding • = 1 : exact answers • Degree distribution similar to SRS • Uniformly distributed weights between 0 and 1
Local view on global properties? SRS-like distribution, 1000 vertices, 4000 edges
3. Applications 0. P-Grid • A peer-to-peer access structure • GridVine • Self-organizing semantic overlay network • PicShark • Self-organizing middleware to export pictures and create mappings
Standard data management over overlay networks • Hard problem? • Strictly speaking impossible • CAP theorem: pick at most two of the following: • Consistency • Availability • Tolerance to network Partitions • Practical compromises: • E.g., Relaxing ACID properties S. Gilbert and N. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, 33(2), 2002.
3.0. The P-Grid Access Structure • Distributed, virtual binary search tree • complete decentralization • self-organization • efficient search • Gridella: a DHT based on P-Grid • decentralized load balancing • updates • replication • management of dynamic IP addresses and identities • Used in several large-scale research projects • MICS • Alvis • Bricks • Evergrow
0 1 00 01 10 0 : 1,1410 : 11,13 2 12,13,14 1 : 2,1200 : 9,4011: 3,10 14 4,5 1 : 12, 1301 : 5,14001: 9,4 7 0,1 1 : 8, 1300 : 7,9011: 3,10 5 4,5 10 1 : 6,800 : 1,7010: 5,14 6,7 1 : 1,311 : 2,12101: 8,13 6 8,9 4 1 : 6,1301 :10,14000: 1,7 2,3 peer identifier ID Prefix Routing data keys (2=0010 etc.) 2,3 1 : 12, 13 routing table entry query(101) @ 7 11 000 001 010 011 100 101 0 : 5,710 : 6,13 12 12,13,14 3 1 : 11,1200 : 1,9010: 5,14 6,7 11 0 : 4,711 : 2,12101: 8,13 8,9 0 : 5,911 : 2,12100: 6,11 13 10,11 1 1 : 12, 1301 : 5, 10001: 9,4 0,1 9 1 : 8,201 : 3, 10000: 1,7 2,3 8 10,11 0 : 4,911 : 2,12100: 6,11
3.1. GridVine • Building large-scale semantic systems • Self-organizing semantic overlay network • Principle of data independence • Scalable physical layer • Semantic logical layer
Semantic Mediation Layer Semantic Mediation Layer Correlated / Uncorrelated Overlay Layer Correlated / Uncorrelated “Physical” layer
Features • Self-organized, scalable, decentralized • Resolves key-based searches in O(log(n)) even for unbalanced trees • Semantic Web compliant • RDF triples, RDFS schemas, OWL mappings • Structured searches • Simple Triple Patterns, RDQL queries • Query resolution: iterative, distributed table lookup • Semantic Gossiping • Tradeoff precision / recall • Automatic reformulation of queries • One of the building-blocks of the Social Semantic Desktop • Nepomuk, large-scale EU research project
Indexing semi-structure in GridVine • Soft-states • Each triple has an expiration time (cf. CAP theorem) • Locality-preserving hash-function • Range searches Triple t = <lsir:GridVine> <dc:creator> <lsir:pcm> Put(Hash(lsir:GridVine), t) Put(Hash(dc:creator), t) Put(Hash(lsir:pcm), t)
Semantic Integration in GridVine (1) • Vertical integration: hierarchy of classes • Fostering semantic interoperability through reuse of conceptualizations • Simple, user-oriented constructs (cf. GUI) • Few, popular base classes bootstrapping interoperability through properties • RDFS entailment can be materialized for the instances
Semantic integration in GridVine (2) • Horizontal integration: mappings • Simple links relating properties • Cycle + feedback analysis to get probabilistic guarantees AutomaticSemantic Gossiping
Traversals of the Semantic Overlay Network • GridVine: structured P2P network • No more constraints on gossiping • Different query forwarding paradigms • Iterative forwarding • Recursive forwarding
3.2. PicShark • Where do the mappings come from? • Middleware for sharing semi-structured metadata attached to pictures and creating mappings 60 moments PicShark (Distributed) Hashtable (e.g., GridVine) Features Extractor Insert PSP Retrieve Metadata Extractor XMP Information Tracker WinFS
Features • Self-Organization of mappings • Based on low-level features extracted from • Picture (color moment, textures) • Structured Metadata (lexicographical analysis) • Self-Organization of annotations • Probabilistic propagation of annotations between similar individuals • Self-Organization of query dissemination • Schema distance based on probabilistic subsumption • Propagation within a certain diameter • Driven by user interaction • Scalable • Computationally expensive operations are local at the peers • Only simple in-network operations (look-ups) • (on-going) collaborative effort with Microsoft Research + MICS
4. Conclusions • Fundamental issue: Data interoperability and management in large scale decentralized environments • Content Sharing • Information search • Semantic Web? • Traditional techniques are not sufficient • Scale • Autonomy • Uncertainty • Self-organizing, decentralized stochastic processes • Data Indexing • Data Integration • Semantics as agreement • Abductivereasoning (on transitive closures of mappings) • Query dissemination