1 / 56

Data Management in Large-Scale Decentralized Information Systems

Explore the challenges and techniques for data integration and sharing in decentralized systems, including picture sharing and semantic interoperability. Discover how peer data management systems and self-organizing networks can enhance data management in large-scale networks.

aayala
Download Presentation

Data Management in Large-Scale Decentralized Information Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IBM T.J. Watson 02.06.06 Data Management in Large-Scale Decentralized Information Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (advisor @ EPFL)

  2. Overview • Motivation • Picture Sharing in Decentralized Settings • Data Integration in large-scale networks • Peer Data Management Systems • Semantic Gossiping • Probabilistic Message-Passing • Aspects of self-organization • Self-Healing semantic networks • Analyzing semantic interoperability in the large • Applications 0. P-Grid • GridVine • PicShark • Conclusions & Future Work

  3. MMS HTTP SMTP 1. Motivation: Picture Sharing • Profusion of Digital Images • Variety of powerful devices • gigabytes of pictures is the new norm • Most of the images are kept local • Some are shared • Mostly point-to-point • Primitive search capabilities

  4. Opportunity • More and more software use metadata to organize images locally • (Semi) Structured metadata (e.g., XML, PSA) • Ontological metadata (e.g., RDF, XMP) • Type-based metadata (e.g., WinFS) <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?> <x:xapmeta xmlns:x='adobe:ns:meta/'> <rdf:RDF xmlns:rdf= 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'> <rdf:Description about='' xmlns:xap='http://ns.adobe.com/xap/1.0/'> <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> <xap:Creator> John Doe </xap:Creator> </rdf:Description> …

  5. Hurdle: Metadata Heterogeneity • Why not taking advantage of those metadata in a distributed setting? X Syntactic discrepancies X Semantic heterogeneity • All the aforementioned standards are extensible • Shared representation is not enough VS <es:cDate> 05/08/2004 </es:cDate> <rdf:Property rdf:ID=“Length-Y"> <rdfs:label>Length-Y</rdfs:label> <rdfs:subPropertyOf rdf:resource="#length"/> </rdf:Property> <rdf:Property rdf:ID="width"> <rdfs:label>Width</rdfs:label> <rdfs:subPropertyOf rdf:resource="#length"/> </rdf:Property> VS

  6. Beyond Keyword Search searching semantically richer objects in large scale heterogeneous networks <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:DofCreation> 05/08/2004 </es:DofCreation> ? ? ? ? ? <myRDF:Date> Jan 1, 2005 </myRDF:Date>

  7. VS 2. Data Integration in large-scale systems • Distributed Databases • Number of sources < 100 • Consistent data • Coordination • Structured data • E.g., Relational data model • Integrity constraints • Transactions • Powerful queries • E.g., SQL, aggregation • Schemas created by administrators • Relatively Fixed topology • Large Scale Information Systems (e.g., WWW) • Number of sources > 1000 • Unreliable data • Autonomy • Semi-structured data • E.g., XML/RDF • No integrity constraints • No transactions • Simple SP queries • E.g., triple patterns, ranking • Schemata created by end users • Network churn

  8. Data Integration: LAV/GAV • Traditional database techniques (e.g., LAV/GAV) rely on centralizedschemas to integrate data sources • Not applicable to our context • Scale (upper ontologies?) • Churn • Autonomy • How can we foster semantic interoperability in decentralized settings? Date m(myDate) = Date m(yourDate) = Date myDate yourDate

  9. Semantic Interoperability Q2=<GUID>$p/GUID</GUID> FOR $p IN T12WHERE $p/Creator LIKE "%Robi%" Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" Extending semantic interoperability techniques to decentralized settings Photoshop (own schema) WinFS (known schema) <Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells</Item> <Item>Royal Council</Item> </Bag> </Subject> … </Photoshop_Image> <WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> … </WinFSImage> T12 = <Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage

  10. <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:cDate> 05/08/2004 </es:cDate> myRDF:Date xap:ModifyDate es:cDate  myRDF:Date <myRDF:Date> Jan 1, 2005 </myRDF:Date> 2.1. Peer Data Management Systems • Pairwise mappings • Peer Data Management Systems (PDMS) • Local mappings overcome global heterogeneity • Iterative query rewriting es:cDate  xap:CreateDate weather article

  11. Problem: Precision/Recall Tradeoff • Semantic Query routing • To whom shall I forward a query posed against my local schema? • Some (most) mappings will be (partially) faulty • Low expressive power of mappings • samePropertyAs / sameClassAs / subclassOf • … or event worse (Microformats) • Automatic schema alignment techniques • Different views on conceptualizations • Local query resolution • Low recall • Flooding (PDMS so far) • Low precision • Standard deductive integration is not sufficient • Uncertainty on mappings and conceptualizations

  12. 2.2. Semantic Gossiping • Local translations enabling global agreements • Selective query forwarding paradigm • Syntactic distances • Lost predicates • Semantic distances • Results analysis • Cycles analysis  Precision/Recall tradeoff

  13. Semantic Gossiping • Selectively reformulate queries through mapping links πTitleCreature=Joe (R5) X πTitleCreator=Joe (R3) πTitleAuthor=Joe (R2) πTitleCreator=Joe (R4) πTitreAuteur=Joe (R1) X  Author=Joe (R4)

  14. 2.3. Probabilistic Message Passing • Where do the mapping quality measures come from? • Link-based analysis of the PDMS • -Automatically deriving quality measures for the mappings • Transitive Closures on mapping operations • Mapping Cycles • Parallel Paths m0 m1 m4 m5 f0 m2 m3

  15. On Cycles / parallel paths m0 q:art/Creator? m4 f0 m3 qVSm3(m4(m0(q))) art/Creator? VS art/creatDate?

  16. Computing a Marginal for one cycle unknown observed • P(m0, m3, m4, f0) = P(m0) P(m3) P(m4) P(f0 | m0, m3, m4,) • P(m0| f0)= m3, m4, P(m0, m3, m4, f0) P(f0)-1 • But: feedbacks on different cycles are correlated • One wrong mapping will affect several cycles/paths • Need to express a global probabilistic model for the mapping graph

  17. A Brief Intro to Factor-Graphs • g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)

  18. Deriving PDMS Factor-Graphs Abductive reasoning on transitive closures of mappings a priori information on mapping

  19. PDMS Factor-Graphs • Cyclic graph • Junction Tree? Clustering / Stretching of variables? • Not applicable (decentralization) • Iterative Sum-Product • Approximate results • How to perform iterative sum-product by message passing on the mapping graph? • Message passing in factor graph does not correspond to connectivity of mapping graph • We want to rely on decentralized computations only • Locality VS Globality of nodes in the factor graph • Mappings: local • Feedback factor: common, global knowledge • Observed feedback variables: neighborhood

  20. Embedded Message-Passing (1)

  21. Embedded Message-Passing (2)

  22. Message Passing • Decentralized computations • Computationally inexpensive • Sums and Products • Message-Passing Schedules • Periodic • Lazy (piggybacking on query forwarding) • No message overhead

  23. Implemented System • Schemas • Import from OWL (Web Ontology Language) • Mappings • KnowledgeWeb Ontology Alignment API • Import from RDF/XML • Automated on-the-fly creation • Comparison to standard alignments  Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing

  24. 2.4. Aspects of Self-Organization • What can we do once we have gathered quality measures about the mappings? • Routing queries () • Correcting mappings • Analyzing global properties of the graph

  25. 2.4.1. Self-Healing Semantic Networks • Two types of self-organization • Static network • Self-organizing dissemination of queries • Dynamic network • Self-organizing network of mappings • Idea: • Quality evaluation of mappings through Semantic Gossiping • Modify low quality links • Reorganized network leads to different quality evaluation • Dynamic network changes  self-organizing, self-referential semantic network

  26. Some Results (1) • Sensitivity to TTL • (cycle analysis only, 25 schemas, 4 concepts)

  27. Some Results (2) • Scalability • (results analysis only, 4 concepts, TTL=3, misclassification rate=0.1, 2 documents/peer on avg.)

  28. 2.4.2. Analyzing Semantic Inter. in the Large • What about interoperability at a global scale? • Modeling semantic interoperability: • The semantic connectivity graph • Idea: as for physical network analyses, define a connectivity layer • Unweighted, non-redundant version of the Schema-to-schema graph Schema-to-Schema Graph • Logical model • Directed • Weighted • Redundant

  29. Semantic Interoperability in the Large • Definition Peers in a set Ps are semantically interoperable iff Ss is strongly connected, with Ss {s | p Ps, ps} • Observation 1 A set of peers Pscannot be semantically interoperable if |Es| <|Vs| • Observation 2 A set of peers Psis semantically interoperable if |Es| >|Vs| (|Vs|-1) - (|Vs|-1) • What happens between theses two bounds?

  30. A Necessary Condition for interoperability in the large • Analyzing semantic interoperability in large-scale, decentralized networks • Percolation theory for directed graphs • Based on a recent graph-theoretic framework • Random graphs with specific degree distributions pjk, clustering coefficients cc and bidirectionality coefficient bc • Based on generatingfunctionality • Distribution of edges from first to second-order neighbors: • Necessary condition for semantic interoperability in the large: j,k (jk-j(bc+cc)-k)pjk ≥ 0 • Also: approximations of the size of semantically interoperable clusters

  31. Some results (Poisson-distributed random graph, 10 000 vertices)

  32. Analysis of a real system • Analysis of the Sequence Retrieval System (SRS) • Commercial information indexing and retrieval system for bioinformatic libraries • Schemas described in a custom language (Icarus) • Mappings (foreign keys) from one database to others • Crawling the EBI repository • 388 databanks • 518 (undirected) links • Power-law distribution of node degrees • Clustering coefficient = 0.32 • Diameter = 9 • Connectivity indicator ci = 25.4 • Super-critical state • Size of the giant component • 0.47 (derived) VS 0.48 (observed)

  33. Query dissemination in weighted networks • Per-hop forwarding behaviors • Only forward if wi >=  •  = 0 : flooding •  = 1 : exact answers • Degree distribution similar to SRS • Uniformly distributed weights between 0 and 1

  34. Local view on global properties? SRS-like distribution, 1000 vertices, 4000 edges

  35. 3. Applications 0. P-Grid • A peer-to-peer access structure • GridVine • Self-organizing semantic overlay network • PicShark • Self-organizing middleware to export pictures and create mappings

  36. Standard data management over overlay networks • Hard problem? • Strictly speaking impossible • CAP theorem: pick at most two of the following: • Consistency • Availability • Tolerance to network Partitions • Practical compromises: • E.g., Relaxing ACID properties S. Gilbert and N. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, 33(2), 2002.

  37. 3.0. The P-Grid Access Structure • Distributed, virtual binary search tree • complete decentralization • self-organization • efficient search • Gridella: a DHT based on P-Grid • decentralized load balancing • updates • replication • management of dynamic IP addresses and identities • Used in several large-scale research projects • MICS • Alvis • Bricks • Evergrow

  38. 0 1 00 01 10 0 : 1,1410 : 11,13 2 12,13,14 1 : 2,1200 : 9,4011: 3,10 14 4,5 1 : 12, 1301 : 5,14001: 9,4 7 0,1 1 : 8, 1300 : 7,9011: 3,10 5 4,5 10 1 : 6,800 : 1,7010: 5,14 6,7 1 : 1,311 : 2,12101: 8,13 6 8,9 4 1 : 6,1301 :10,14000: 1,7 2,3 peer identifier ID Prefix Routing data keys (2=0010 etc.) 2,3 1 : 12, 13 routing table entry query(101) @ 7 11 000 001 010 011 100 101 0 : 5,710 : 6,13 12 12,13,14 3 1 : 11,1200 : 1,9010: 5,14 6,7 11 0 : 4,711 : 2,12101: 8,13 8,9 0 : 5,911 : 2,12100: 6,11 13 10,11 1 1 : 12, 1301 : 5, 10001: 9,4 0,1 9 1 : 8,201 : 3, 10000: 1,7 2,3 8 10,11 0 : 4,911 : 2,12100: 6,11

  39. 3.1. GridVine • Building large-scale semantic systems • Self-organizing semantic overlay network • Principle of data independence • Scalable physical layer • Semantic logical layer

  40. Semantic Mediation Layer Semantic Mediation Layer Correlated / Uncorrelated Overlay Layer Correlated / Uncorrelated “Physical” layer

  41. GridVine: Annotating Content

  42. Features • Self-organized, scalable, decentralized • Resolves key-based searches in O(log(n)) even for unbalanced trees • Semantic Web compliant • RDF triples, RDFS schemas, OWL mappings • Structured searches • Simple Triple Patterns, RDQL queries • Query resolution: iterative, distributed table lookup • Semantic Gossiping • Tradeoff precision / recall • Automatic reformulation of queries • One of the building-blocks of the Social Semantic Desktop • Nepomuk, large-scale EU research project

  43. Indexing semi-structure in GridVine • Soft-states • Each triple has an expiration time (cf. CAP theorem) • Locality-preserving hash-function • Range searches Triple t = <lsir:GridVine> <dc:creator> <lsir:pcm> Put(Hash(lsir:GridVine), t) Put(Hash(dc:creator), t) Put(Hash(lsir:pcm), t)

  44. Semantic Integration in GridVine (1) • Vertical integration: hierarchy of classes • Fostering semantic interoperability through reuse of conceptualizations • Simple, user-oriented constructs (cf. GUI) • Few, popular base classes bootstrapping interoperability through properties • RDFS entailment can be materialized for the instances

  45. Semantic integration in GridVine (2) • Horizontal integration: mappings • Simple links relating properties • Cycle + feedback analysis to get probabilistic guarantees  AutomaticSemantic Gossiping

  46. Traversals of the Semantic Overlay Network • GridVine: structured P2P network • No more constraints on gossiping • Different query forwarding paradigms • Iterative forwarding • Recursive forwarding

  47. 3.2. PicShark • Where do the mappings come from? • Middleware for sharing semi-structured metadata attached to pictures and creating mappings 60 moments PicShark (Distributed) Hashtable (e.g., GridVine) Features Extractor Insert PSP Retrieve Metadata Extractor XMP Information Tracker WinFS

  48. Features • Self-Organization of mappings • Based on low-level features extracted from • Picture (color moment, textures) • Structured Metadata (lexicographical analysis) • Self-Organization of annotations • Probabilistic propagation of annotations between similar individuals • Self-Organization of query dissemination • Schema distance based on probabilistic subsumption • Propagation within a certain diameter • Driven by user interaction • Scalable • Computationally expensive operations are local at the peers • Only simple in-network operations (look-ups) • (on-going) collaborative effort with Microsoft Research + MICS

  49. PicShark Prototype

  50. 4. Conclusions • Fundamental issue: Data interoperability and management in large scale decentralized environments • Content Sharing • Information search • Semantic Web? • Traditional techniques are not sufficient • Scale • Autonomy • Uncertainty • Self-organizing, decentralized stochastic processes • Data Indexing • Data Integration • Semantics as agreement • Abductivereasoning (on transitive closures of mappings) • Query dissemination

More Related