Piazza: Data Management Infrastructure for the Semantic Web

Piazza: Data Management Infrastructurefor the Semantic Web Joint work with Alon Halevy, Peter Mork, Dan Suciu,Igor Tatarinov, University of Washington Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing February 3, 2004

The Big Question in P2P Why use a P2P system vs. a centralized one? • PRO: P2P offers greater flexibility and resource utilization • CON: P2P often sacrifices reliability guarantees, accountability, and sometimes even performance There are a few simple cases where P2P wins: • Avoiding the law/RIAA/MPAA: copying music, videos, etc. • Anonymity (FreeNet, etc.) • Exploiting idle cycles But are there applications that are inherently P2P?

Most P2P Work is “Bottom-up” • The basis of P2P: Algorithms/data structures papers • Chord, CAN, Pastry • Focus on providing a robust DHT – not what to do with it • Several systems build functionality over the DHT: • Tang et al. information retrieval paper • Maps LSI space into CAN multidimensional space • Interesting but uncertain benefits • Berkeley PIER • DB query engine: uses distributed hash table to do distributed joins • Sophia • Prolog rules in a distributed environment for network monitoring • None of these apps are inherently (or perhaps even best) based on P2P architectures

Thinking Top-Down • Find an application that has needs matching the properties of P2P: • No central authority (and no logical owner of central server) • Loose, relatively ad hoc membership • Capabilities of a system grow as new members join • Participants are generally cooperative

One Possible Answer: Data Integration/Interchange Applications • Multiple parties have proprietary data + sources • Not willing to relinquish control or change their data representation, but are willing to share • Examples: • UPenn hospital system is looking to modernize information sharing among departments (trauma, neurology, etc.) • Many bioinformatics warehouses (e.g., Penn’s GUS, NCBI’s GeneBank) have related info they would like to share • The W3C’s vision of the “Semantic Web”: a web where all pages are annotated with meaning, meanings are well-defined, and complex questions can be answered

The “Old” Model: Centralization • Get all parties to hash out a standard, global schema or ontology • Different classes of objects to be represented • Constraints + relationships between them • Relate all of the data sources to that schema • Relationships are specified as named queries – views • Efficient techniques exist for using these views to answer future queries posed over the mediated schema

Centralized Data Integration Architecture Query Results Data Integration System / Mediator Mediated Schema Source Catalog Query-basedSchema Mappings in Catalog Wrapper Wrapper Wrapper Source Data

Centralization Doesn’t Scale • Difficult to arrive at one standard schema • … When we do, it’s slow to evolve to new needs • This is a human factor, but it is also a scalability issue • Hard to leverage mappings well: • If we map source A  mediated schema, does this help us map source B, even if source B is “almost” like source A? • Can we prevent mappings from “breaking” when we update the central schema? • Users often prefer familiar schema, not central one • More schemas  more users forced to change schemas

The Piazza System: Infrastructure for Relating & Querying Structured Data • Recasts data integration as a decentralized confederation of peers and mappings • Our initial focus is on the logical aspects: • Mediating between different types of XML-encoded data • Based on extensions of formalisms & techniques from data integration • Schemas are related via directional pairwise mappings • Making maximal use of a limited number of mappings • Translates queries over transitive closure of mappings • Uses mappings “in reverse”

Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Mediated Query Answering in the Piazza System CiteSeer UW Stanford Mappings typically directional, pairwise DBLP Leipzig Oxford Penn

Data in Piazza Root • Each participant may have its own schema + data • Unordered XML, with pre-specified schemas • In general, we’ll identify it with XPath expressions: • Similar syntax to Unix paths, but over threes • e.g., /rootelement/subelement/* db ?xml book mdate pub key author title year 2002… 1992 Brown92 PRPL… MKP Kurt Brown

Mappings in Data Integration • Express value or class equivalence: DollarCost = EuroToDollar(EuroCost) “ID#0123456”  “Catalog#98324” S1/book/author = S2/author • Also containment: S2/book  S3/publication • Ability to use value(s) as IDs Collect all entries related to the ID into one object • Convert between edge labels, values <authors><book>1</book></authors>  <action type=“authorship”><object>book</object></action> • Concatenation: S2/author/fullname = S3/author/first + S3/author/last

Piazza’s Mapping Language Goals: • Build on XQuery and XML • Remain computationally inexpensive • Capture the common mapping types Directional XML mapping language based on templates <output> {: $var IN document(“doc”)/path WHERE condition :} <tag>$var</tag></output> • Translates between parts of data instances • Restricted subset of XQuery that’s decidable to reason about • Supports special annotations and object fusion

Mapping Example between XML Schemas Source: authors author* full-name publication* title pub-type Target: pubs book* title author* name writtenBy author publication title pub-type name

Example Piazza Mapping <pubs> <book piazza:id={$t}>{:$aIN document(“…”)/authors/author,$anIN$a/full-name,$tIN$a/publication/title,$typIN$a/publication/pub-typeWHERE$typ = “book”PROPERTY$t >= ‘A’ AND$t < ‘B’ :}<title>{$t}</title> <author><name>{$an}</name></author> </book></pubs>

Query Answering in Piazza Given an XQuery over a schema, iteratively expand and translate it to capture neighbors at distance i • Requires sophisticated reasoning to avoid cycles, redundant expansions • See paper for details How does this work? • Mapping defines constraints on pairs of source & target instances Constrains possible pairs of matched interpretations • Easy to use mapping in “forward direction”: query composition with a view (or chain of views) • Also have algorithms to rewrite query over source in terms of target Need to invert mapping and compose that with query • Answer set is defined by “certain” answers • May lose some information in inversion

Piazza Is One of Several Similar Efforts • Peer-to-peer databases: PIER, PeerDB, Hyperion, [Bernstein et al. WebDB02], [Aberer et al. WWW03] • RDF engines and mediators for the Semantic Web: EDUTELLA, Sesame • Makes use of semi-automated mapping construction techniques from the database/machine learning communities: • Clio, LSD, GLUE, Cupid, many others

Summary: Infrastructure for Decentralized Mediation • Powerful XML mappings and transformations • Extensible, scalable architecture, thanks to sophisticated reasoning techniques for mappings The model itself is peer-to-peer at a logical level – functionality that is best suited to a P2P architecture

Where from Here? Ongoing Work • Piazza effort at U. Wash. continues to focus on problems relating to mappings • Orchestra at Penn follows up with a focus on two questions: • What does a true DHT-based P2P integration system look like? • Covers a variety of query processing stages, including mapping reformulation and query optimization, not just execution (as in PIER) • Where should we materialize or replicate data • The “data placement” problem • What issues arise when we want to consider updates and synchronization at web-scale?

Data Management and P2P • We’ve now seen a number of approaches • Information retrieval • Network monitoring • Query execution • Decentralized data integration • Common themes: • Declarative query languages separate logical + physical levels • Large amount of data with semantic info, distributed in many sites • Which ideas hold the most promise? • Is data management well-suited to P2P and DHTs?Does data management need P2P?

Backup slides…

Challenges with Mappings • Information may be lost in one direction of a mapping: • Name := concat(FirstName, LastName) • Faculty := Professors  Lecturers • Correspondences may be hard to specify precisely: • Bug ≈ Insect • Data may be dirty or incomplete • Exact mappings may be computationally expensive

RDF vs. XML • RDF explicitly names relationships: (book, title, “ABC”) (book, writtenBy, author) (author, name, “John Smith”) • XML does not always: • <book> <title>ABC</title> <writtenBy> <author><name>John Smith</name></author> </writtenBy></book> • <book> <title>ABC</title> <author>John Smith</author></book> writtenBy author book title name

RDF vs. XML 2 • RDF is subject-neutral (a graph) • XML centers around a subject (a tree): • <book> <title>ABC</title> <author>John Smith</author></book> • <author> <name>John Smith</name> <book>ABC</book></book> • This may result in duplication of contained objects

Mapping XML to OWL • We can map from XML to XML; thus we can go from XML to an XML serialization of RDF • Caveat: this doesn’t give us the full power of the KR-based Semantic Web! • We can only create OWL individuals that can be expressed in an XQuery-style view definition • To go any further, we may need to supplement these with additional OWL class definitions • But it gets us 80% there and makes the rest much easier – and it supplies mapping capabilities missing from OWL itself

Implementing the Semantic Web Early emphasis on languages, tools for one (or a few) ontologies • Very powerful solutions in OWL and tools! • Initial assumption: data will have to be created in RDF Important problems remain: sharing at scale and legacy data • Global representations/ontologies hard to agree on! • Not just due to preference: different representations better suited to certain usage models – differences are inevitable • Need infrastructure that allows users to choose & query in their ontology, get results from all related (mapped) data • Must be able to import relevant structured data • Most data is in existing, non-RDF formats (XML, relations, legacy sources, etc.)

Impossible to Capture & NormalizeAll Semantics (1/2) Even RDF/OWL regularity can’t enforce a single conceptual model: • May use different names for same items • May use different levels of granularity: book vs. publication • Metadata + data may be interchanged: (Car4, hasWheel, Wheel1) vs. (Car5, contains, Obj2), (Obj2, hasPurpose, wheel)

Impossible to Capture & NormalizeAll Semantics (2/2) • Even collections may be described differently: • (Person, eatsForBreakfast, Meal1) (Person, eatsForLunch, Meal2)(Person, eatsForDinner, Meal3) • (Person, eatsMeals, TodaysMeals)(TodaysMeals, breakfast, Meal1)(TodaysMeals, lunch, Meal2)(TodaysMeals, dinner, Meal3) • (Person, eatsMeals, list of Meal)(list of Meal := {Meal1, Meal2, Meal3})

Observations • Even formalisms like RDF, OWL capture only a part of the semantics • Still need some interpretation • (This shouldn’t be surprising, but it’s important!) • Very hard to get many contributors to agree on the same representation or ontology • Simple equivalences (owl:equivalentProperty, owl:equivalentClass) aren’t enough to map between different ontologies • Need infrastructure for relating data in many different representations, at different levels of granularity! • This is the core strength of database techniques

Benefits of Piazza’s DB Heritage Terabytes of existing data that’s in XML (or easily translatable to XML) • Hierarchical and relational data, spreadsheets, Java objects, … • XML files, RDF itself! Sophisticated reasoning about mappings is possible by extending existing data integration work • Achieves schema/concept mapping at different granularities • Chaining of mappings, using mappings in reverse direction, … Can map between data in different structures (including RDF serializations, XML)

Key Problem: Coordinating Efforts between Collaborators • Today, to collaboratively edit structured data, we centralize • For many applications, this isn’t a good model, e.g.: • Bioinformatics groups have multiple standard schemas and warehouses for genomic information – each group wants to incorporate the info of the others, but have it in their format, with their own unique information preserved, and the ability to override info from elsewhere • Different neuroscientists have may data from measuring electrical activity in the same part of the brain – they may want to share common information but maintain their specific local information; each scientist wants the ability to control when their updates are propagated Work-in-progress with Nitin Khandelwal; other contributors: Murat Cakir, Charuta Joshi, Ivan Terziev

The Orchestra System: Infrastructure for Collaborative Data Sharing • Each participant is a logical peer, with some XML schema that is mapped to at least one other peer’s schema • Schemas’ contents are logically synchronized initially and then on demand Translated updates from 3: + XML tree A’ - XML tree B’ Part2 mappings between XML schemas Schema 2 mappings Translated updates from 3: + XML tree A’’ - XML tree B’’ Updates: + XML tree A - XML tree B Part3 Part1 Schema 1 Schema 3

Some Challenges in Orchestra • Mappings • How to express them • Using them to translate updates, queries • Inconsistency • How to represent conflicts • How to resolve them • Update propagation • Consistency with intermittent connectivity • Scaling • To many updates • To many queries Logical & semantics- level Implementation- level (P2P-based)

Mappings • Some peers may be replicas • Others need mappings, expressed as “views” • Views: functions from one schema to another • Can be inverted (may lose some information) • Can be “chained” when there is no direct connection • (Much research in generating these automatically [DDH00][MB01], …) • Prior work on propagating updates through relational views [BD82][K85][C+96]… • Ensuring the mapping specifies a deterministic, side-effect-free translation • Algorithmically applying the translation • Ongoing work with Nitin Khandelwal: • Extending the model to handle (unordered) XML • Challenge: dealing with XML’s nesting and its repercussions

A Globally Consistent Model that Encodes Conflicts • Even in the presence of conflicts, want a “global state” (from perspective of some schema) when we synchronize • Allows us to determine what’s agreed-upon, what’s conflicting • Can define conflict resolution strategies • Goal: “union of all states” with a way of specifying conflicts • Define conditional XML tree based on a subset of c-tables [IM84] • Each peer pi has a boolean flag Pi representing “perspective i” root If P2 auth If P1 auth Lee Smith

Propagating Updates with Intermittent Connectivity • How to synchronize among n peers (even assuming the same schema)? • Not all are connected simultaneously • Usual approaches: • Locking (doesn’t scale) • Epidemic algorithms (only eventually consistent) • Approach: • “Shadow instance” of the schema, replicated within the other peers of the network • Everyone syncs with the shadow instance • Benefits: state is deterministic after each sync

Scaling, Using P2P Techniques • Update synchronization • Key problem: find values conflicting with “shadow instance” • Partition the “shadow instance” across the network • Query execution • Partition computation across multiple peers (PIER does this) • Query optimization • Optimization breaks the query into sub-problems, uses dynamic programming to build up estimates of the costs of applying operators • Can recast as recursion + memoization • Use P2P overlay to distribute each recursive step • Memoize results at every node • Why is this useful? Suppose 2 peers ask the same query!

Current Status • Have a basic strategy for addressing many of the problems in collaborative data sharing • Initial sketches of the core algorithms • Need to develop them further • … And to implement (and validate) them in a real system!

Piazza: Data Management Infrastructure for the Semantic Web