Data Management in Large-Scale Decentralized Information Systems

IBM T.J. Watson 02.06.06 Data Management in Large-Scale Decentralized Information Systems Philippe Cudré-Mauroux, EPFL Joint work with: Karl Aberer (advisor @ EPFL)

Overview • Motivation • Picture Sharing in Decentralized Settings • Data Integration in large-scale networks • Peer Data Management Systems • Semantic Gossiping • Probabilistic Message-Passing • Aspects of self-organization • Self-Healing semantic networks • Analyzing semantic interoperability in the large • Applications 0. P-Grid • GridVine • PicShark • Conclusions & Future Work

MMS HTTP SMTP 1. Motivation: Picture Sharing • Profusion of Digital Images • Variety of powerful devices • gigabytes of pictures is the new norm • Most of the images are kept local • Some are shared • Mostly point-to-point • Primitive search capabilities

Opportunity • More and more software use metadata to organize images locally • (Semi) Structured metadata (e.g., XML, PSA) • Ontological metadata (e.g., RDF, XMP) • Type-based metadata (e.g., WinFS) <?xpacket begin='' id='W5M0MpCehiHzreSzNTczkc9d'?> <x:xapmeta xmlns:x='adobe:ns:meta/'> <rdf:RDF xmlns:rdf= 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'> <rdf:Description about='' xmlns:xap='http://ns.adobe.com/xap/1.0/'> <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> <xap:Creator> John Doe </xap:Creator> </rdf:Description> …

Hurdle: Metadata Heterogeneity • Why not taking advantage of those metadata in a distributed setting? X Syntactic discrepancies X Semantic heterogeneity • All the aforementioned standards are extensible • Shared representation is not enough VS <es:cDate> 05/08/2004 </es:cDate> <rdf:Property rdf:ID=“Length-Y"> <rdfs:label>Length-Y</rdfs:label> <rdfs:subPropertyOf rdf:resource="#length"/> </rdf:Property> <rdf:Property rdf:ID="width"> <rdfs:label>Width</rdfs:label> <rdfs:subPropertyOf rdf:resource="#length"/> </rdf:Property> VS

Beyond Keyword Search searching semantically richer objects in large scale heterogeneous networks <xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:DofCreation> 05/08/2004 </es:DofCreation> ? ? ? ? ? <myRDF:Date> Jan 1, 2005 </myRDF:Date>

VS 2. Data Integration in large-scale systems • Distributed Databases • Number of sources < 100 • Consistent data • Coordination • Structured data • E.g., Relational data model • Integrity constraints • Transactions • Powerful queries • E.g., SQL, aggregation • Schemas created by administrators • Relatively Fixed topology • Large Scale Information Systems (e.g., WWW) • Number of sources > 1000 • Unreliable data • Autonomy • Semi-structured data • E.g., XML/RDF • No integrity constraints • No transactions • Simple SP queries • E.g., triple patterns, ranking • Schemata created by end users • Network churn

Data Integration: LAV/GAV • Traditional database techniques (e.g., LAV/GAV) rely on centralizedschemas to integrate data sources • Not applicable to our context • Scale (upper ontologies?) • Churn • Autonomy • How can we foster semantic interoperability in decentralized settings? Date m(myDate) = Date m(yourDate) = Date myDate yourDate

Semantic Interoperability Q2=<GUID>$p/GUID</GUID> FOR $p IN T12WHERE $p/Creator LIKE "%Robi%" Q1=<GUID>$p/GUID</GUID> FOR $p IN /Photoshop_Image WHERE $p/Creator LIKE "%Robi%" Extending semantic interoperability techniques to decentralized settings Photoshop (own schema) WinFS (known schema) <Photoshop_Image> <GUID>178A8CD8865</GUID> <Creator>Robinson</Creator> <Subject> <Bag> <Item> Tunbridge Wells</Item> <Item>Royal Council</Item> </Bag> </Subject> … </Photoshop_Image> <WinFSImage> <GUID>178A8CD8866</GUID> <Author> <DisplayName> Henry Peach Robinson <DisplayName> <Role>Photographer</Role> <Author> <Keyword> Tunbridge </Keyword> <Keyword>Council</Keyword> … </WinFSImage> T12 = <Photoshop_Image> <GUID>$fs/GUID</GUID> <Creator> $fs/Author/DisplayName </Creator></Photoshop_Image>FOR $fs IN /WinFSImage

<xap:CreateDate>2001-12-19T18:49:03Z</xap:CreateDate> <xap:ModifyDate>2001-12-19T20:09:28Z</xap:ModifyDate> date? <es:cDate> 05/08/2004 </es:cDate> myRDF:Date xap:ModifyDate es:cDate  myRDF:Date <myRDF:Date> Jan 1, 2005 </myRDF:Date> 2.1. Peer Data Management Systems • Pairwise mappings • Peer Data Management Systems (PDMS) • Local mappings overcome global heterogeneity • Iterative query rewriting es:cDate  xap:CreateDate weather article

Problem: Precision/Recall Tradeoff • Semantic Query routing • To whom shall I forward a query posed against my local schema? • Some (most) mappings will be (partially) faulty • Low expressive power of mappings • samePropertyAs / sameClassAs / subclassOf • … or event worse (Microformats) • Automatic schema alignment techniques • Different views on conceptualizations • Local query resolution • Low recall • Flooding (PDMS so far) • Low precision • Standard deductive integration is not sufficient • Uncertainty on mappings and conceptualizations

2.2. Semantic Gossiping • Local translations enabling global agreements • Selective query forwarding paradigm • Syntactic distances • Lost predicates • Semantic distances • Results analysis • Cycles analysis  Precision/Recall tradeoff

Semantic Gossiping • Selectively reformulate queries through mapping links πTitleCreature=Joe (R5) X πTitleCreator=Joe (R3) πTitleAuthor=Joe (R2) πTitleCreator=Joe (R4) πTitreAuteur=Joe (R1) X  Author=Joe (R4)

2.3. Probabilistic Message Passing • Where do the mapping quality measures come from? • Link-based analysis of the PDMS • -Automatically deriving quality measures for the mappings • Transitive Closures on mapping operations • Mapping Cycles • Parallel Paths m0 m1 m4 m5 f0 m2 m3

On Cycles / parallel paths m0 q:art/Creator? m4 f0 m3 qVSm3(m4(m0(q))) art/Creator? VS art/creatDate?

Computing a Marginal for one cycle unknown observed • P(m0, m3, m4, f0) = P(m0) P(m3) P(m4) P(f0 | m0, m3, m4,) • P(m0| f0)= m3, m4, P(m0, m3, m4, f0) P(f0)-1 • But: feedbacks on different cycles are correlated • One wrong mapping will affect several cycles/paths • Need to express a global probabilistic model for the mapping graph

A Brief Intro to Factor-Graphs • g(x1, x2, x3, x4) = fA(x1, x2)fB(x2, x3, x4)

Deriving PDMS Factor-Graphs Abductive reasoning on transitive closures of mappings a priori information on mapping

PDMS Factor-Graphs • Cyclic graph • Junction Tree? Clustering / Stretching of variables? • Not applicable (decentralization) • Iterative Sum-Product • Approximate results • How to perform iterative sum-product by message passing on the mapping graph? • Message passing in factor graph does not correspond to connectivity of mapping graph • We want to rely on decentralized computations only • Locality VS Globality of nodes in the factor graph • Mappings: local • Feedback factor: common, global knowledge • Observed feedback variables: neighborhood

Embedded Message-Passing (1)

Embedded Message-Passing (2)

Message Passing • Decentralized computations • Computationally inexpensive • Sums and Products • Message-Passing Schedules • Periodic • Lazy (piggybacking on query forwarding) • No message overhead

Implemented System • Schemas • Import from OWL (Web Ontology Language) • Mappings • KnowledgeWeb Ontology Alignment API • Import from RDF/XML • Automated on-the-fly creation • Comparison to standard alignments  Automatic derivation of quality measures P(m=correct | {F}) for the mappings using iterative message-passing

2.4. Aspects of Self-Organization • What can we do once we have gathered quality measures about the mappings? • Routing queries () • Correcting mappings • Analyzing global properties of the graph

2.4.1. Self-Healing Semantic Networks • Two types of self-organization • Static network • Self-organizing dissemination of queries • Dynamic network • Self-organizing network of mappings • Idea: • Quality evaluation of mappings through Semantic Gossiping • Modify low quality links • Reorganized network leads to different quality evaluation • Dynamic network changes  self-organizing, self-referential semantic network

Some Results (1) • Sensitivity to TTL • (cycle analysis only, 25 schemas, 4 concepts)

Some Results (2) • Scalability • (results analysis only, 4 concepts, TTL=3, misclassification rate=0.1, 2 documents/peer on avg.)

2.4.2. Analyzing Semantic Inter. in the Large • What about interoperability at a global scale? • Modeling semantic interoperability: • The semantic connectivity graph • Idea: as for physical network analyses, define a connectivity layer • Unweighted, non-redundant version of the Schema-to-schema graph Schema-to-Schema Graph • Logical model • Directed • Weighted • Redundant

Semantic Interoperability in the Large • Definition Peers in a set Ps are semantically interoperable iff Ss is strongly connected, with Ss {s | p Ps, ps} • Observation 1 A set of peers Pscannot be semantically interoperable if |Es| <|Vs| • Observation 2 A set of peers Psis semantically interoperable if |Es| >|Vs| (|Vs|-1) - (|Vs|-1) • What happens between theses two bounds?

A Necessary Condition for interoperability in the large • Analyzing semantic interoperability in large-scale, decentralized networks • Percolation theory for directed graphs • Based on a recent graph-theoretic framework • Random graphs with specific degree distributions pjk, clustering coefficients cc and bidirectionality coefficient bc • Based on generatingfunctionality • Distribution of edges from first to second-order neighbors: • Necessary condition for semantic interoperability in the large: j,k (jk-j(bc+cc)-k)pjk ≥ 0 • Also: approximations of the size of semantically interoperable clusters

Some results (Poisson-distributed random graph, 10 000 vertices)

Analysis of a real system • Analysis of the Sequence Retrieval System (SRS) • Commercial information indexing and retrieval system for bioinformatic libraries • Schemas described in a custom language (Icarus) • Mappings (foreign keys) from one database to others • Crawling the EBI repository • 388 databanks • 518 (undirected) links • Power-law distribution of node degrees • Clustering coefficient = 0.32 • Diameter = 9 • Connectivity indicator ci = 25.4 • Super-critical state • Size of the giant component • 0.47 (derived) VS 0.48 (observed)

Query dissemination in weighted networks • Per-hop forwarding behaviors • Only forward if wi >=  •  = 0 : flooding •  = 1 : exact answers • Degree distribution similar to SRS • Uniformly distributed weights between 0 and 1

Local view on global properties? SRS-like distribution, 1000 vertices, 4000 edges

3. Applications 0. P-Grid • A peer-to-peer access structure • GridVine • Self-organizing semantic overlay network • PicShark • Self-organizing middleware to export pictures and create mappings

Standard data management over overlay networks • Hard problem? • Strictly speaking impossible • CAP theorem: pick at most two of the following: • Consistency • Availability • Tolerance to network Partitions • Practical compromises: • E.g., Relaxing ACID properties S. Gilbert and N. Lynch: Brewer's conjecture and the feasibility of consistent, available, partition-tolerant web services, ACM SIGACT News, 33(2), 2002.

3.0. The P-Grid Access Structure • Distributed, virtual binary search tree • complete decentralization • self-organization • efficient search • Gridella: a DHT based on P-Grid • decentralized load balancing • updates • replication • management of dynamic IP addresses and identities • Used in several large-scale research projects • MICS • Alvis • Bricks • Evergrow

0 1 00 01 10 0 : 1,1410 : 11,13 2 12,13,14 1 : 2,1200 : 9,4011: 3,10 14 4,5 1 : 12, 1301 : 5,14001: 9,4 7 0,1 1 : 8, 1300 : 7,9011: 3,10 5 4,5 10 1 : 6,800 : 1,7010: 5,14 6,7 1 : 1,311 : 2,12101: 8,13 6 8,9 4 1 : 6,1301 :10,14000: 1,7 2,3 peer identifier ID Prefix Routing data keys (2=0010 etc.) 2,3 1 : 12, 13 routing table entry query(101) @ 7 11 000 001 010 011 100 101 0 : 5,710 : 6,13 12 12,13,14 3 1 : 11,1200 : 1,9010: 5,14 6,7 11 0 : 4,711 : 2,12101: 8,13 8,9 0 : 5,911 : 2,12100: 6,11 13 10,11 1 1 : 12, 1301 : 5, 10001: 9,4 0,1 9 1 : 8,201 : 3, 10000: 1,7 2,3 8 10,11 0 : 4,911 : 2,12100: 6,11

3.1. GridVine • Building large-scale semantic systems • Self-organizing semantic overlay network • Principle of data independence • Scalable physical layer • Semantic logical layer

Semantic Mediation Layer Semantic Mediation Layer Correlated / Uncorrelated Overlay Layer Correlated / Uncorrelated “Physical” layer

GridVine: Annotating Content

Features • Self-organized, scalable, decentralized • Resolves key-based searches in O(log(n)) even for unbalanced trees • Semantic Web compliant • RDF triples, RDFS schemas, OWL mappings • Structured searches • Simple Triple Patterns, RDQL queries • Query resolution: iterative, distributed table lookup • Semantic Gossiping • Tradeoff precision / recall • Automatic reformulation of queries • One of the building-blocks of the Social Semantic Desktop • Nepomuk, large-scale EU research project

Indexing semi-structure in GridVine • Soft-states • Each triple has an expiration time (cf. CAP theorem) • Locality-preserving hash-function • Range searches Triple t = <lsir:GridVine> <dc:creator> <lsir:pcm> Put(Hash(lsir:GridVine), t) Put(Hash(dc:creator), t) Put(Hash(lsir:pcm), t)

Semantic Integration in GridVine (1) • Vertical integration: hierarchy of classes • Fostering semantic interoperability through reuse of conceptualizations • Simple, user-oriented constructs (cf. GUI) • Few, popular base classes bootstrapping interoperability through properties • RDFS entailment can be materialized for the instances

Semantic integration in GridVine (2) • Horizontal integration: mappings • Simple links relating properties • Cycle + feedback analysis to get probabilistic guarantees  AutomaticSemantic Gossiping

Traversals of the Semantic Overlay Network • GridVine: structured P2P network • No more constraints on gossiping • Different query forwarding paradigms • Iterative forwarding • Recursive forwarding

3.2. PicShark • Where do the mappings come from? • Middleware for sharing semi-structured metadata attached to pictures and creating mappings 60 moments PicShark (Distributed) Hashtable (e.g., GridVine) Features Extractor Insert PSP Retrieve Metadata Extractor XMP Information Tracker WinFS

Features • Self-Organization of mappings • Based on low-level features extracted from • Picture (color moment, textures) • Structured Metadata (lexicographical analysis) • Self-Organization of annotations • Probabilistic propagation of annotations between similar individuals • Self-Organization of query dissemination • Schema distance based on probabilistic subsumption • Propagation within a certain diameter • Driven by user interaction • Scalable • Computationally expensive operations are local at the peers • Only simple in-network operations (look-ups) • (on-going) collaborative effort with Microsoft Research + MICS

PicShark Prototype

4. Conclusions • Fundamental issue: Data interoperability and management in large scale decentralized environments • Content Sharing • Information search • Semantic Web? • Traditional techniques are not sufficient • Scale • Autonomy • Uncertainty • Self-organizing, decentralized stochastic processes • Data Indexing • Data Integration • Semantics as agreement • Abductivereasoning (on transitive closures of mappings) • Query dissemination

Data Management in Large-Scale Decentralized Information Systems

Data Management in Large-Scale Decentralized Information Systems

Presentation Transcript

T.J. Writing Company

Bluetooth Architecture Overview Dr. Chatschik Bisdikian IBM Research T.J. Watson Research Center Hawthorne, NY 10532, US

Putting IBM Watson to Work In Healthcare

Dr. Dong Chen IBM T.J. Watson Research Center Yorktown Heights, NY

David F. Bacon T.J. Watson Research Center

David F. Bacon T.J. Watson Research Center

Supercomputer Platforms and Its Applications Dr. George Chiu IBM T.J. Watson Research Center

Data Management Community at IBM Watson

T.J. and Rafael

T.J. Brown

Wendy A. Kellogg, Jason B. Ellis, John C. Thomas IBM T.J. Watson Research Center

Dr. George Chiu IEEE Fellow IBM T.J. Watson Research Center Yorktown Heights, NY

Marco Pistoia, Ted Habeck, Larry Koved IBM T.J. Watson Research Center New York, USA

IBM Watson Analytics Online Training With Live Project

Fred Douglis IBM T.J. Watson Research Center

Scenes from HCI Research at the IBM T.J. Watson Research Lab

Dr. George Chiu IEEE Fellow IBM T.J. Watson Research Center Yorktown Heights, NY

Supercomputer Platforms and Its Applications Dr. George Chiu IBM T.J. Watson Research Center

IBM Watson in Health Care Joel Farrell, IBM

Course in AI | IBM Watson Application Development | GKT