Solving the Challenges of Data Integration and Semantics in RDF and SOA

RDF and SOA David Booth, Ph.D. <dbooth@hp.com>HP Software 2007-02-27 Latest version: http://dbooth.org/2007/rdf-and-soa/rdf-and-soa-slides.ppt These slides are based on the paper at http://dbooth.org/2007/rdf-and-soa/rdf-and-soa-paper.htm

Some issues mentioned yesterday • Ken: "Someone walks in with a 40 page schema, and someone else walks in with a 60 page schema" • Namespace/vocabulary/schema management problem • Chris: "that concept is almost what I need, but not quite, so I need to define my own" • Paul: "Super billing system, with everyone's features" • Jonathan: "Top down policy enforcement is bucking the tide" • Skip: "too difficult to reuse [schemas]" • Nick: "I see customers spending 80% of their time on getting their documents right, not on the interfaces or how to exchange them" • Ben: "WS interop doesn't solve my problem. I need my services to interoperate"

Outline • The problem with XML in SOA • Babelization • Versioning • What's wrong with "THE model" • Lack of consistent semantics across models • The case for RDF in SOA • Data integration • Consistent semantics • Versioning • Validation • Bridging between XML and RDF • XML as an RDF serialization

Problems

Problem 1: Babelization • Success of Web services causes new challenges • Each Web service defines its own interface: • XML messages in/out • WSDL doc describes a “language” for interacting with that service • As services multiply, these languages multiply • What do the messages mean? • Result: “Babelization” (http://www.w3.org/2002/Talks/1218-semweb-dbooth/slide43-0.html )

Babelization impedes integration • Each service speaks its own language • Cannot easily connect them in new ways • Each message type has its own data model • Cannot easily integrate data

Problem 2: Versioning • Need to be able to independently version both clients and services • XML is brittle • Versioning is a constant problem

History of data model design in XML • Version 1: This is the model. • Version 2: Oops! No, *this* is the model. • Version 3: This is the model today, but here's an extensibility point for tomorrow. . . . Then, after integrating with another model: • Version 4: This is the *super* model (with extensibility for tomorrow, of course). . . . Eventually, after integrating with more models: • Version 5: This is the super-duper-*ultra* model (with extensibility, of course) • etc.

Schema Hell • Model gets very complex • Each app/component only uses one part of it • Lots of optionality: hard to know what is really used • Example: RosettaNet Purchase Order • Over 500 elements, 75% optional • Trading partners take 2-3months to negotiate through optionality

Moral: There is no such thing as THE model. • There are many models. There always will be. • Standard model is always beneficial when practical • But only possible in the micro -- not the macro • Why? • Different apps need different models • Hard to get agreement as org/committee grows • App needs change over time • Org changes over time

Problem 3: No consistent semantics • 1000s of applications • 1000s of schemas • How do they relate to each other? • <foo:CustAddress> == <bar:ShippingAddr>

The Case for RDF in SOA

RDF • Relational data model framework • W3C standard >6 years • Language for making statements about things • Used to express both: • Ontologies (with OWL), and • Instance data

Key features of RDF • Syntax independent (specifies model) • Some existing serializations: RDF/XML, N3, Turtle • Consistent semantics • Based on URIs • Great for data integration problems • Data "mashups"

Why RDF excels at data integration • New data models can be easily added • Old and new data models co-exist in merged model • Relationships between the old and new models are expressed explicitly • Both old and new can be used simultaneously

Example: Blue App has model

Red App has model • Need to integrate Red & Blue models into new Green model. How?

Model integration in XML • Option 1: Red and Blue models become subtrees of Green model • Very little gained • Relationships are implicit in the processing code • Option 2: Design a new model • How is it different in RDF?

Step 1: Merge RDF • Same nodes (URIs) join automatically

Step 2: Add relationships between Red & Blue models • (Relationships are also RDF)

Step 3: Define Green model • (Making use of Red& Blue models)

What the Blue app sees • No difference!

What the Red app sees • No difference!

What the Green app sees

Consistent semantics • RDF facilitates consistent semantics • Terms have the same meaning across apps • Based on URIs • Semantics of different terms can be related declaratively • ESB can convert formats, but give no assurance of consistent semantics • Example: Security entitlements across applications

RDF and versioning of message models

Versioning of message models • RDF makes message model versioning easier: • Syntax independence • Open World Assumption (OWA) • RDF does not address process flow versioning • REST addresses that • RDF : message models :: REST : process flow

Document Validation • XML is closed world (normally) • RDF uses open world assumption (normally) • Validation requires different approaches • There are pros and cons

Document validation

Example: Missing data in XML vs. RDF • Suppose: • Address requires a city name, but • City name is missing • Q: CityName == "New York"? • In XML (closed world assumption): • Error. • In RDF (open world assumption): • Sure, why not? (No evidence to the contrary)

Validation in RDF • Techniques are available • World can be temporarily closed • Sample SPARQL query can check data • If query succeeds, data is complete • Tools can also help detect errors (e.g., Eyeball, by Chris Dollin, HP Labs Bristol)

Validating Messages • WSDL use is lopsided: • Written solely from the perspective of the service • Specifies input and output schemas • Service may not know how its clients use its data!

Kinds of Data Validation • Model integrity • Same for producer and consumer • Suitability for a particular use • Differs for producer and consumer

Producer Versus Consumer Validation • Data producer should provide validator for data it creates (model integrity) • E.g., sample SPARQL query • Data consumer should provide validator for data it expects (suitability for use) • E.g., sample SPARQL query

RDF and efficiency • RDF is not inherently inefficient • If you need the work done, it must happen somewhere • Jena, Arq, etc., are pretty good • There *is* a learning curve for RDF • Must learn what is efficient and inefficient, just as in programming and RDBMS queries • Can process directly in XML if RDF power is not needed

Fundamental Trends • Need for easier service and data integration is increasing • Flooded with data • Continually integrating new models & services • Need for consistent semantics is increasing

Bridging XML and RDF

But Web services already use XML! • XML is well known and used • Legacy apps may require specific XML or other formats that cannot be changed • Standard RDF/XML serialization is verbose & ugly • How can we gain the benefits of RDF while still accommodating XML?

Recall: RDF is syntax independent • Specifies info model -- not syntax! • Can be serialized in any agreed-upon way • Can define specialized RDF serializations!

Defining new RDF serializations • New XML (or other) formats can be defined as specialized serializations of RDF • Mapping converts XML/other to RDF • XSLT/other can define mapping • Namespace or GRDDL can specify transformation • Analogous to microformats or micromodels, but: • not restricted to HTML/xhtml • not necessarily using standards-based ontologies • Can use app-specific ontologies

Mapping existing formats to RDF • Existing XML/other formats also can be mapped to RDF • Treat as specialized RDF serializations • XSLT/other can define mapping • Allows service to treat XML/other input as RDF • Both old & new formats

Documents on the wire can be XML, other or RDF Input can be normalized to RDF App processing can use RDF engine/store Normalizing to RDF Service RDF Normalizeto RDF XML/other Core appProcessing RDF Engine / Store Client

Different clients may require different formats: versions, etc. Can even change dynamically Can be normalized to RDF for common processing Core app is unaffected Supporting multiple client formats Service RDF Normalizeto RDF XML/other Core appProcessing RDF Engine / Store Client

Serialize to whatever formats are required Generate XML/other directly (or even RDF!), or SPARQL query can generate specific view virst Service Output Service Normalizeto RDF Core appProcessing RDF Engine / Store Client Serialize asXML/other/RDF

Granularity • Mapping from XML to RDF can be done with any level of granularity • Fine grained: • Every element, attribute, etc., in XML maps to 1+ RDF assertion • Permits more detailed inferences • Adds more complexity & processing up front • Coarse grained: • Entire chunk of XML maps into RDF • XML chunk is retained • RDF metadata can annotate XML chunk • Simpler, less processing up front • Information inside chunk is less accessible

Choosing appropriate granularity Consider: • What is significant from a distributed systems perspective? • What may be interpreted differently by someone else if its semantics are not pinned down? • When finer granularity is added later, differences may show up

Summary of Principles for RDF in SOA • Define interface contracts as though message content is RDF • Permit custom XML/other serializations as needed • Provide machine-processable mappings to RDF • Treat the RDF version as authoritative • Client and service should each provide both: • A model integrity validator for data it creates; and • A suitability for use validator for data it expects. • Choose RDF granularity that makes sense

Conclusions • Value of RDF in data integration is well proven • We have some evidence of its value in SOA, but need: • More exploration of graceful adoption paths • More work on transformation techniques • More work on validating RDF models in SOA context • Best practices for the above

Solving the Challenges of Data Integration and Semantics in RDF and SOA

Solving the Challenges of Data Integration and Semantics in RDF and SOA

Presentation Transcript

dbMotion and SOA

RDF, RDF, RDF….

RDF

RDF

Practical RDF Ch.10 Querying RDF: RDF as Data

3. RDF and XML

SOA and Collaboration

RDF and RDA

RDF

XML, RDF, and OWL

Practical RDF Chapter 2. RDF: Heart and Soul

XML and RDF

RDF

RDF and SPARQL

RDF

W3C and RDF

SOA