1 / 38

Piazza: Data Management Infrastructure for the Semantic Web

Piazza: Data Management Infrastructure for the Semantic Web. Joint work with Alon Halevy, Peter Mork, Dan Suciu, Igor Tatarinov, University of Washington. Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing February 3, 2004. The Big Question in P2P.

ace
Download Presentation

Piazza: Data Management Infrastructure for the Semantic Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Piazza: Data Management Infrastructurefor the Semantic Web Joint work with Alon Halevy, Peter Mork, Dan Suciu,Igor Tatarinov, University of Washington Zachary G. Ives University of Pennsylvania CIS 700 – Internet-Scale Distributed Computing February 3, 2004

  2. The Big Question in P2P Why use a P2P system vs. a centralized one? • PRO: P2P offers greater flexibility and resource utilization • CON: P2P often sacrifices reliability guarantees, accountability, and sometimes even performance There are a few simple cases where P2P wins: • Avoiding the law/RIAA/MPAA: copying music, videos, etc. • Anonymity (FreeNet, etc.) • Exploiting idle cycles But are there applications that are inherently P2P?

  3. Most P2P Work is “Bottom-up” • The basis of P2P: Algorithms/data structures papers • Chord, CAN, Pastry • Focus on providing a robust DHT – not what to do with it • Several systems build functionality over the DHT: • Tang et al. information retrieval paper • Maps LSI space into CAN multidimensional space • Interesting but uncertain benefits • Berkeley PIER • DB query engine: uses distributed hash table to do distributed joins • Sophia • Prolog rules in a distributed environment for network monitoring • None of these apps are inherently (or perhaps even best) based on P2P architectures

  4. Thinking Top-Down • Find an application that has needs matching the properties of P2P: • No central authority (and no logical owner of central server) • Loose, relatively ad hoc membership • Capabilities of a system grow as new members join • Participants are generally cooperative

  5. One Possible Answer: Data Integration/Interchange Applications • Multiple parties have proprietary data + sources • Not willing to relinquish control or change their data representation, but are willing to share • Examples: • UPenn hospital system is looking to modernize information sharing among departments (trauma, neurology, etc.) • Many bioinformatics warehouses (e.g., Penn’s GUS, NCBI’s GeneBank) have related info they would like to share • The W3C’s vision of the “Semantic Web”: a web where all pages are annotated with meaning, meanings are well-defined, and complex questions can be answered

  6. The “Old” Model: Centralization • Get all parties to hash out a standard, global schema or ontology • Different classes of objects to be represented • Constraints + relationships between them • Relate all of the data sources to that schema • Relationships are specified as named queries – views • Efficient techniques exist for using these views to answer future queries posed over the mediated schema

  7. Centralized Data Integration Architecture Query Results Data Integration System / Mediator Mediated Schema Source Catalog Query-basedSchema Mappings in Catalog Wrapper Wrapper Wrapper Source Data

  8. Centralization Doesn’t Scale • Difficult to arrive at one standard schema • … When we do, it’s slow to evolve to new needs • This is a human factor, but it is also a scalability issue • Hard to leverage mappings well: • If we map source A  mediated schema, does this help us map source B, even if source B is “almost” like source A? • Can we prevent mappings from “breaking” when we update the central schema? • Users often prefer familiar schema, not central one • More schemas  more users forced to change schemas

  9. The Piazza System: Infrastructure for Relating & Querying Structured Data • Recasts data integration as a decentralized confederation of peers and mappings • Our initial focus is on the logical aspects: • Mediating between different types of XML-encoded data • Based on extensions of formalisms & techniques from data integration • Schemas are related via directional pairwise mappings • Making maximal use of a limited number of mappings • Translates queries over transitive closure of mappings • Uses mappings “in reverse”

  10. Q’’ Q’ Q’’ Q’’ Q Q’’ Q’ Mediated Query Answering in the Piazza System CiteSeer UW Stanford Mappings typically directional, pairwise DBLP Leipzig Oxford Penn

  11. Data in Piazza Root • Each participant may have its own schema + data • Unordered XML, with pre-specified schemas • In general, we’ll identify it with XPath expressions: • Similar syntax to Unix paths, but over threes • e.g., /rootelement/subelement/* db ?xml book mdate pub key author title year 2002… 1992 Brown92 PRPL… MKP Kurt Brown

  12. Mappings in Data Integration • Express value or class equivalence: DollarCost = EuroToDollar(EuroCost) “ID#0123456”  “Catalog#98324” S1/book/author = S2/author • Also containment: S2/book  S3/publication • Ability to use value(s) as IDs Collect all entries related to the ID into one object • Convert between edge labels, values <authors><book>1</book></authors>  <action type=“authorship”><object>book</object></action> • Concatenation: S2/author/fullname = S3/author/first + S3/author/last

  13. Piazza’s Mapping Language Goals: • Build on XQuery and XML • Remain computationally inexpensive • Capture the common mapping types Directional XML mapping language based on templates <output> {: $var IN document(“doc”)/path WHERE condition :} <tag>$var</tag></output> • Translates between parts of data instances • Restricted subset of XQuery that’s decidable to reason about • Supports special annotations and object fusion

  14. Mapping Example between XML Schemas Source: authors author* full-name publication* title pub-type Target: pubs book* title author* name writtenBy author publication title pub-type name

  15. Example Piazza Mapping <pubs> <book piazza:id={$t}>{:$aIN document(“…”)/authors/author,$anIN$a/full-name,$tIN$a/publication/title,$typIN$a/publication/pub-typeWHERE$typ = “book”PROPERTY$t >= ‘A’ AND$t < ‘B’ :}<title>{$t}</title> <author><name>{$an}</name></author> </book></pubs>

  16. Query Answering in Piazza Given an XQuery over a schema, iteratively expand and translate it to capture neighbors at distance i • Requires sophisticated reasoning to avoid cycles, redundant expansions • See paper for details How does this work? • Mapping defines constraints on pairs of source & target instances Constrains possible pairs of matched interpretations • Easy to use mapping in “forward direction”: query composition with a view (or chain of views) • Also have algorithms to rewrite query over source in terms of target Need to invert mapping and compose that with query • Answer set is defined by “certain” answers • May lose some information in inversion

  17. Piazza Is One of Several Similar Efforts • Peer-to-peer databases: PIER, PeerDB, Hyperion, [Bernstein et al. WebDB02], [Aberer et al. WWW03] • RDF engines and mediators for the Semantic Web: EDUTELLA, Sesame • Makes use of semi-automated mapping construction techniques from the database/machine learning communities: • Clio, LSD, GLUE, Cupid, many others

  18. Summary: Infrastructure for Decentralized Mediation • Powerful XML mappings and transformations • Extensible, scalable architecture, thanks to sophisticated reasoning techniques for mappings The model itself is peer-to-peer at a logical level – functionality that is best suited to a P2P architecture

  19. Where from Here? Ongoing Work • Piazza effort at U. Wash. continues to focus on problems relating to mappings • Orchestra at Penn follows up with a focus on two questions: • What does a true DHT-based P2P integration system look like? • Covers a variety of query processing stages, including mapping reformulation and query optimization, not just execution (as in PIER) • Where should we materialize or replicate data • The “data placement” problem • What issues arise when we want to consider updates and synchronization at web-scale?

  20. Data Management and P2P • We’ve now seen a number of approaches • Information retrieval • Network monitoring • Query execution • Decentralized data integration • Common themes: • Declarative query languages separate logical + physical levels • Large amount of data with semantic info, distributed in many sites • Which ideas hold the most promise? • Is data management well-suited to P2P and DHTs?Does data management need P2P?

  21. Backup slides…

  22. Challenges with Mappings • Information may be lost in one direction of a mapping: • Name := concat(FirstName, LastName) • Faculty := Professors  Lecturers • Correspondences may be hard to specify precisely: • Bug ≈ Insect • Data may be dirty or incomplete • Exact mappings may be computationally expensive

  23. RDF vs. XML • RDF explicitly names relationships: (book, title, “ABC”) (book, writtenBy, author) (author, name, “John Smith”) • XML does not always: • <book> <title>ABC</title> <writtenBy> <author><name>John Smith</name></author> </writtenBy></book> • <book> <title>ABC</title> <author>John Smith</author></book> writtenBy author book title name

  24. RDF vs. XML 2 • RDF is subject-neutral (a graph) • XML centers around a subject (a tree): • <book> <title>ABC</title> <author>John Smith</author></book> • <author> <name>John Smith</name> <book>ABC</book></book> • This may result in duplication of contained objects

  25. Mapping XML to OWL • We can map from XML to XML; thus we can go from XML to an XML serialization of RDF • Caveat: this doesn’t give us the full power of the KR-based Semantic Web! • We can only create OWL individuals that can be expressed in an XQuery-style view definition • To go any further, we may need to supplement these with additional OWL class definitions • But it gets us 80% there and makes the rest much easier – and it supplies mapping capabilities missing from OWL itself

  26. Implementing the Semantic Web Early emphasis on languages, tools for one (or a few) ontologies • Very powerful solutions in OWL and tools! • Initial assumption: data will have to be created in RDF Important problems remain: sharing at scale and legacy data • Global representations/ontologies hard to agree on! • Not just due to preference: different representations better suited to certain usage models – differences are inevitable • Need infrastructure that allows users to choose & query in their ontology, get results from all related (mapped) data • Must be able to import relevant structured data • Most data is in existing, non-RDF formats (XML, relations, legacy sources, etc.)

  27. Impossible to Capture & NormalizeAll Semantics (1/2) Even RDF/OWL regularity can’t enforce a single conceptual model: • May use different names for same items • May use different levels of granularity: book vs. publication • Metadata + data may be interchanged: (Car4, hasWheel, Wheel1) vs. (Car5, contains, Obj2), (Obj2, hasPurpose, wheel)

  28. Impossible to Capture & NormalizeAll Semantics (2/2) • Even collections may be described differently: • (Person, eatsForBreakfast, Meal1) (Person, eatsForLunch, Meal2)(Person, eatsForDinner, Meal3) • (Person, eatsMeals, TodaysMeals)(TodaysMeals, breakfast, Meal1)(TodaysMeals, lunch, Meal2)(TodaysMeals, dinner, Meal3) • (Person, eatsMeals, list of Meal)(list of Meal := {Meal1, Meal2, Meal3})

  29. Observations • Even formalisms like RDF, OWL capture only a part of the semantics • Still need some interpretation • (This shouldn’t be surprising, but it’s important!) • Very hard to get many contributors to agree on the same representation or ontology • Simple equivalences (owl:equivalentProperty, owl:equivalentClass) aren’t enough to map between different ontologies • Need infrastructure for relating data in many different representations, at different levels of granularity! • This is the core strength of database techniques

  30. Benefits of Piazza’s DB Heritage Terabytes of existing data that’s in XML (or easily translatable to XML) • Hierarchical and relational data, spreadsheets, Java objects, … • XML files, RDF itself! Sophisticated reasoning about mappings is possible by extending existing data integration work • Achieves schema/concept mapping at different granularities • Chaining of mappings, using mappings in reverse direction, … Can map between data in different structures (including RDF serializations, XML)

  31. Key Problem: Coordinating Efforts between Collaborators • Today, to collaboratively edit structured data, we centralize • For many applications, this isn’t a good model, e.g.: • Bioinformatics groups have multiple standard schemas and warehouses for genomic information – each group wants to incorporate the info of the others, but have it in their format, with their own unique information preserved, and the ability to override info from elsewhere • Different neuroscientists have may data from measuring electrical activity in the same part of the brain – they may want to share common information but maintain their specific local information; each scientist wants the ability to control when their updates are propagated Work-in-progress with Nitin Khandelwal; other contributors: Murat Cakir, Charuta Joshi, Ivan Terziev

  32. The Orchestra System: Infrastructure for Collaborative Data Sharing • Each participant is a logical peer, with some XML schema that is mapped to at least one other peer’s schema • Schemas’ contents are logically synchronized initially and then on demand Translated updates from 3: + XML tree A’ - XML tree B’ Part2 mappings between XML schemas Schema 2 mappings Translated updates from 3: + XML tree A’’ - XML tree B’’ Updates: + XML tree A - XML tree B Part3 Part1 Schema 1 Schema 3

  33. Some Challenges in Orchestra • Mappings • How to express them • Using them to translate updates, queries • Inconsistency • How to represent conflicts • How to resolve them • Update propagation • Consistency with intermittent connectivity • Scaling • To many updates • To many queries Logical & semantics- level Implementation- level (P2P-based)

  34. Mappings • Some peers may be replicas • Others need mappings, expressed as “views” • Views: functions from one schema to another • Can be inverted (may lose some information) • Can be “chained” when there is no direct connection • (Much research in generating these automatically [DDH00][MB01], …) • Prior work on propagating updates through relational views [BD82][K85][C+96]… • Ensuring the mapping specifies a deterministic, side-effect-free translation • Algorithmically applying the translation • Ongoing work with Nitin Khandelwal: • Extending the model to handle (unordered) XML • Challenge: dealing with XML’s nesting and its repercussions

  35. A Globally Consistent Model that Encodes Conflicts • Even in the presence of conflicts, want a “global state” (from perspective of some schema) when we synchronize • Allows us to determine what’s agreed-upon, what’s conflicting • Can define conflict resolution strategies • Goal: “union of all states” with a way of specifying conflicts • Define conditional XML tree based on a subset of c-tables [IM84] • Each peer pi has a boolean flag Pi representing “perspective i” root If P2 auth If P1 auth Lee Smith

  36. Propagating Updates with Intermittent Connectivity • How to synchronize among n peers (even assuming the same schema)? • Not all are connected simultaneously • Usual approaches: • Locking (doesn’t scale) • Epidemic algorithms (only eventually consistent) • Approach: • “Shadow instance” of the schema, replicated within the other peers of the network • Everyone syncs with the shadow instance • Benefits: state is deterministic after each sync

  37. Scaling, Using P2P Techniques • Update synchronization • Key problem: find values conflicting with “shadow instance” • Partition the “shadow instance” across the network • Query execution • Partition computation across multiple peers (PIER does this) • Query optimization • Optimization breaks the query into sub-problems, uses dynamic programming to build up estimates of the costs of applying operators • Can recast as recursion + memoization • Use P2P overlay to distribute each recursive step • Memoize results at every node • Why is this useful? Suppose 2 peers ask the same query!

  38. Current Status • Have a basic strategy for addressing many of the problems in collaborative data sharing • Initial sketches of the core algorithms • Need to develop them further • … And to implement (and validate) them in a real system!

More Related