Enterprise Information Integration – XML to the Rescue!

Enterprise Information Integration –XML to the Rescue! Michael Carey BEA Systems, Inc. ER2003 Conference October 2003

Roadmap for Today’s Talk • Twenty years of data integration • In the beginning… • Variations on a theme • Enterprise information integration • XML – evolution, revolution, or convolution? • XML Schema • XQuery • Web Services • XML-based enterprise information integration • Viewing the enterprise through XML glasses • An example: BEA’s Liquid Data for WebLogic • Scaling XML enterprise information models • Conclusions, challenges, and Q&A

Let’s Get Started… • Twenty years of data integration • In the beginning… • Variations on a theme • Enterprise information integration • XML – evolution, revolution, or convolution? • XML Schema • XQuery • Web Services • XML-based enterprise information integration • Viewing the enterprise through XML glasses • An example: BEA’s Liquid Data for WebLogic • Scaling XML enterprise information models • Conclusions, challenges, and Q&A

The Relational Revolution • Codd introduced the relational model in 1970 • Logical (vs. physical) model of data • Set-oriented programming model • The relational model is simple • Database = set of relations (or tables) • Algebra of set-oriented operations (select, join, …) • Declarative interfaces (what, not how) • SQL, QBE, et al • Embedded languages, 4GLs • Three schema levels (data abstraction) • Views, base tables, physical files and indexes

We’re Still in the Relational Era • The relational model was a huge success • Simplicity made research tractable • Data independence → major productivity gains • Now a large/established software market • Relational DBMS “goodies” include • Query optimization and efficient execution • Well-defined transaction semantics and support • Views for data independence and access control • Constraints, triggers, and stored procedures for capturing and enforcing business rules • Moreover, relational technology scales well • Distributed queries and transactions • Large-scale parallel database systems

So, Aren’t We Done? • Sadly, the world is not that centralized • Departmental vs. inter-galactic databases • Nor is the world that agreeable or flexible • Oracle, DB2(s), Informix, SQL Server, Sybase, … • Hangers-on: IMS, IDMS, Model 204, Pick, … • XML databases (depending on who you believe ) • Nor is all the world’s data SQL-accessible • Packaged applications (SAP, PeopleSoft, Siebel, Oracle) • Custom “homegrown” applications • CICS transactions • Document management systems • Files of various shapes and sizes • Other exotic subsystems and/or data sources

Enter Federated Databases • The basic idea seems simple enough • Pick your favorite data model and query language • Logically map the schemas of your different databases into one (logically) centralized schema • Federated DBMS will convert queries into: • Sub-queries against the individual databases, plus • Middleware coordination + computations to finish the job • Federated DBMS will convert transactions into: • Sub-transactions against the individual databases, plus • Appropriate distributed transaction coordination + recovery • Many technical challenges identified and studied • Schema matching and integration • Translation of data models, translation of queries • Global transactions in an uncooperative world

Ex: MultiBase (Early 1980’s) • One of the first distributed DBMS projects to relax the homogeneity assumption (vs. Ingres*, R*, et al) • Computer Corporation of America (CCA) • Funded by DoD, as the military saw this problem early • Approach very relevant in today’s web services world • Interesting foundation and technical contributions • Functional data model • Prehistoric objects with identity • Functions model attribute access, relationship navigation • Model realized via the DAPLEX query language • Important technical achievements • Abstract model to normalize relational, network, and others • Results on federated query processing (and sets/multisets) • Results on resolving semantic (data/schema) inconsistencies

The Next Twenty Years • Many “favorite data model and query language” combinations were explored • Functional • Relational • Logical (a.k.a. Datalog) • Object-oriented • Object-relational • Semi-structured • A number of products (and startups) launched • Relational (IBM DataJoiner, Cohera, …) • Object-oriented (Ontologic) • Object-relational (UniSQL) • Etc! (I’ve barely scratched the surface with my list)

Fast Forward to the 21st Century • No home runs, startups have come and gone, and the problem is becoming ever more pressing… • Most “plumbing” challenges have been worked out • Distributed infrastructure (RPC, messaging) • Distributed query processing / optimization • Distributed transaction management • Two big adoption impediments have lingered on • Mapping real data into any one particular data model • No community model consensus (other than relational) • Other (hard!) problems linger on as well • Semantic heterogeneity (schema, data values, …) • Data cleanliness (or lack thereof)

Enter XML… • Twenty years of data integration • In the beginning… • Variations on a theme • Enterprise information integration • XML – evolution, revolution, or convolution? • XML Schema • XQuery • Web Services • XML-based enterprise information integration • Viewing the enterprise through XML glasses • An example: BEA’s Liquid Data for WebLogic • Scaling XML enterprise information models • Conclusions, challenges, and Q&A

What’s XML? • eXtensible Markup Language • Derived from document markup language SGML • Serving two masters: text and data • W3C standard (XML 1.0 – 1998, 2000) • Think HTML, but for data markup • Data is tagged – has elements, attributes, text, … • Unlike HTML, tags in XML are user-defined • Like HTML, XML documents are text files • As a result, data becomes “self-describing” • May conform to an XML Schema (or DTD) • The XML vision • XML can do for the “machine web” what HTML did for the “human web”

XML Example <purchaseOrder orderDate="1999-10-20"> <customer> <name>Alice Smith</name> <city>Mill Valley</city> <state>CA</state> </customer> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> </item> <item partNum="926-AA"> <productName>Garden Hose</productName> <quantity>2</quantity> <USPrice>11.98</USPrice> <comment>Without these, I’m hosed!</comment> </item> ... </items> </purchaseOrder>

“Data XML” • Enjoying rapid adoption for data exchange • Easily produced, consumed, and debugged • Central to B2B standardization work • Provides for loose coupling of applications • Provides a “semi-structured” data model • Blurs the schema / data boundary • Naturally handles complex, nested data • Flexible and extensible (over space and time) • Extra or missing data handled gracefully • Element and/or attribute names can vary • Data types can vary as well • Even allows for mixed content

XML Schema • How can I express the “type” of my document? • Programs must know something for things to (really) work! • First attempt: document type definitions (DTDs) • Today’s answer: XML Schema • Think class definition for XML documents • Typed elements and attributes • Simple types and complex types • Derived types (extension, restriction) • Integrity constraints (values, occurrences, uniqueness) • Referential integrity (keys and key references) • Sequence vs. all, choice, substitution groups, and more

Simple Schema Example <person> <name>John Jones</name> <birthday>01-05-1970</birthday> </person> <xs:schema . . .> <xs:element name="person" type="person-info"/> <xs:complexType name="person-info"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="birthday" type="xs:date"/> </xs:sequence> </xs:complexType> </xs:schema>

XQuery • Why not just use or extend SQL…? • Flat tables vs. hierarchical XML • Uniform tables vs. highly variable XML • Unordered rows vs. ordered XML content • Table schema vs. flexible or absent XML schema • Relational data vs. mixed XML content • (Not to mention the fact that SQL is already “full”!) • XQuery is to XML as SQL is to tables • W3C XML Query WG • Both “data XML” and “text XML” use cases • Hoping for a Recommendation in Spring 2004

XQuery Basics • XQuery data model (shared with XPath 2.0) • Query consumes and produces sequences of either/both: • Atomic values (based on XML Schema’s primitive types) • XML nodes (element, attribute, text, document, etc.) • Functional, side-effect-free language • Query = prolog (environment) + body expression (result) • Basic expressions can be constants, variables, arithmetic expressions, function calls, or path expressions • Expressions combinable via operators, functions, node constructors, if/then/else, typeswitch, FLWOR expressions • Very rich, powerful language • Navigate within an input document • Combine data from multiple input documents • Generate new XML structures (w/JSP-like { } syntax)

FLWOR Expressions • for clause – very similar to FROM in SQL • Generates one or more value sequences and binds the values to query variables • let clause – similar to temporary views in SQL • Binds a temporary variable to the result of a query expression • where clause – much like SQL’s WHERE clause • Contains Boolean predicates that restrict the for clause’s variable bindings • order by clause – like ORDER BY in SQL • Contains a list of expressions that dictate the order of the FLWOR expression’s XML output • return clause – like SELECT clause in SQL (on steroids) • Specifies the query’s desired XML output, using JSP-like approach to switching between literal XML and XQuery expressions

XQuery Example for $p in document("polist.xml")/purchases/purchaseOrder where $p/customer/state = 'CA' return <CAOrder> {$p/customer} { for $i in $p/items/item let $itot := $i/quantity * $i/USPrice where $itot > 50.0 return <item> <pno>{data($i/@partNum)}</pno> <name>{data($i/productName)}</name> <total>{$itot}</total> </item> } </CAOrder>

XML Input (Multiple POs) <purchases> <purchaseOrder orderDate="1999-10-20"> <customer> <name>Alice Smith</name> <city>Mill Valley</city> <state>CA</state> </customer> <items> <item partNum="872-AA"> <productName>Lawnmower</productName> <quantity>1</quantity> <USPrice>148.95</USPrice> </item> <item partNum="926-AA"> <productName>Garden Hose</productName> <quantity>2</quantity> <USPrice>11.98</USPrice> <comment>Without these, I’m hosed!</comment> </item> ... </items> </purchaseOrder> . . . </purchases>

XQuery Output <CAOrder> <customer> <name>Alice Smith</name> <city>Mill Valley</city> <state>CA</state> </customer> <item> <pno>872-AA</pno> <name>Lawnmower</name> <total>148.95</total> </item> ... </CAOrder> ...

Web Services (in 1 slide) • Think XML RPC (but not strictly RPC!) • Synchronous or asynchronous • Loose coupling (“document style”) • XML web services • Expose callable application logic • Consume and produce XML documents • Described using WSDL (XML, XML Schema) • Talk via messaging (standard Internet protocols) • Format messages according to SOAP (XML) • Discovery supported via UDDI (less important) • Primary use cases • Application integration: add employee to PeopleSoft • Internet commerce: get price quote, track package

On To The Enterprise… • Twenty years of data integration • In the beginning… • Variations on a theme • Enterprise information integration • XML – evolution, revolution, or convolution? • XML Schema • XQuery • Web Services • XML-based enterprise information integration • Viewing the enterprise through XML glasses • An example: BEA’s Liquid Data for WebLogic • Scaling XML enterprise information models • Conclusions, challenges, and Q&A

CRM andSupport ERP EmployeeSystems Partner& ChannelManagement Supply ChainManagement The Modern Enterprise Customers Remote Divisions& Employees Employees StrategicPartners FIREWALL Distributors & ChannelPartners Suppliers

Modern Enterprise Challenges • Wide variety of physical information source types • Highly diverse data sources, as mentioned earlier • Boils down to a mix of query-able “data” sources and API-based “functional” sources • At next level of detail, disparate data representations / APIs • Heterogeneous information representations • Lack of any canonical model for a given entity – requiring schema and data mappings • Ever-growing need for information integration • More and more applications being built by composition • More and more companies being built through M&A’s • More and more channels need to access same information • More and more need for “real-time visibility” to compete

XML Is No Different – Or Is It? • XML was designed for data diversity • The modern enterprise is a case study in data diversity • Databases (relational and other) • Applications (packaged, homegrown, web services) • Other information sources (files, messages, …) • Very real problems facing the IT world today • Single view of X (ex: X = customer) for multiple channels • Virtual data access layer for new applications • XML is changing the game • Community consensus is happening around XML • XML is being pulled, not being pushed! • XML Schema, XQuery also happening on their own • Schema has real traction, XQuery has much interest

<shamelessPlug>

BEA Liquid Data for WebLogic Employees Customer Service Order Management Business Process Portal / Dashboard Web Application Query request XML result Client API EJB JSP TagLib Web Services Workshop Control BEA Liquid Data for WebLogic Support Data View Sales Data View Data Views Administration Console Caching, Security, Management Data View Builder Query Design Tool Customer Data View Optimized Distributed Queries JDBC JDBC Web Services XML Files Inflight XML Custom Functions BEA WebLogic Adapters Custom Apps Legacy Apps Packaged Apps File System Business Partner RDBMS DW/DM Messages Other

Role of XQuery in Liquid Data • Each data source surfaced to XQuery as either • Virtual document described by an XML Schema • Relational database (default schema) • XML file or delimited file (specify schema) • XQuery functions with XML Schema inputs and outputs • Web service (WSDL) • Packaged application (wrapped by BEA XML adaptor) • Custom functions (specify signatures) • Stored procedures, custom SQL (specify signatures) • XQuery is then available to define • Single view of X for various X’s • Views tailored to various departmental applications • Stored (“canned”) queries to be invoked by applications • Ad hoc queries, invoked by reporting tools at runtime

Ex: Single View of Customer

Graphical Query / View Builder

Consuming Stored Queries

</shamelessPlug>

Enterprise Modeling Revisited • Q: So, do you really expect me to squeeze my entire enterprise into one big XML Schema? • No, that would probably yield an unnatural model • Also, it wouldn’t scale to 10’s or 100’s (or 1000’s) of X’s • Claim: The world needs a logical “XML data model” • One with entities, relationships, views, constraints • Respectful extension of XML Schema’s facilities • Data views, constraints, and other expressions in XQuery • Queries could be written against this logical model • Don’t re-implement relationships in each query • Don’t re-implement data mappings in each query • Graphical tools add them or offer them up auto-magically

One Possibility: XML Meets ER • XML Entities • Business objects: Customer, Order, Employee, Department • Data content naturally describable using XML Schema • Content computable using XQuery on underlying sources • XML Relationships • Relate two (or more) business entities: HasOrders, WorksIn • Model and abstract out the details of a relationship • Again, XQuery is a fine tool for the job (as I will try to show) • Views in an XML/ER world • A view (or model) is a set of Entities and Relationships, taken together, along with their metadata

($customer/Id eq $order/CustId) Orderer Orders Ex: Customers & Orders Customer HasOrder Order (0,n) (0,1) <Customer> <Id/> <Name/> <Address> <Street/> <City/> <State/> <Zip/> </Address> <Phone/> <CreditCards> <Card>* <Type/> <Number/> <Expiration/> </Card> </CreditCards> </Customer> <Order> <Id/> <CustId/> <TotalAmt/> <Items> <Item>* <ProdId/> <Price/> <Quantity/> <Item> </Items> for $c in doc(“Customer”)/Customer where $c/Id eq 12345 return … for $o in Orders($c) return … define function Orders ($cust) as Order* { for $o in doc(“Order”)/Order where $cust/Id eq $o/CustId return $o } define function Orderer($order) as Customer { … }

Contrast With One Nested View • Customer and Order both available as first-class (i.e., top-level) business entities in queries and subsequent views: for $c in doc(“Customer”) where $c/Id eq 12345 return … for $o in Orders($c) return … Customer Order Customer Order* for $o in doc(“Order”) where $o/Id eq 98765 return … let $c := Orderer($o) return …

To Nest, Or Not To Nest…? • What do I use a top-level XML/ER entity for? • Any X for which a single view of X is desired • Any Y for which “for $y in document(“Y”)” is desirable • When should I use a nested XML/ER entity instead? • “Weak” entity in ER-speak (ex: credit card) • “ID dependent” entity in ER-speak (ex: line item) • Entity that is already nested in a web service call result • When should I create an XML/ER relationship? • To relate something to a top-level entity (no physical nesting) • When traversed, still provides “virtual nesting” • Might extend XQuery with relationship traversal (or with function application, e.g.: for $o in $c/Orders( ) …)

We’re Almost There… • Twenty years of data integration • In the beginning… • Variations on a theme • Enterprise information integration • XML – evolution, revolution, or convolution? • XML Schema • XQuery • Web Services • XML-based enterprise information integration • Viewing the enterprise through XML glasses • An example: BEA’s Liquid Data for WebLogic • Scaling XML enterprise information models • Conclusions, challenges, and Q&A

Challenge #1:Enterprise Data Services • IT wants to create a “employee service” that can update as well as access employee information drawn from disparate sources. • Updating enterprise entities isn’t like SQL UPDATE • Insert employee into company is a workflow • Making some changes will require specific API calls • Must be based on a coarse-grain, loosely-coupled model • Information needed to process an integration query doesn’t suffice for updates or full query optimization • A given piece of information may live in multiple places • Mappings of keys and other values must be two-way • Need a friendly XML programming model on top • “Integrate once”, declaratively, then system takes it from there • Also need hooks for custom code (like “do instead” triggers)

Challenge #2:Information Asset Management • IT wants to enable its employees to find, reuse, and safely change components of any given enterprise data service. • Lots of information must be captured and managed • Data sources and their schemas / signatures • Canonical schemas for business entities • Relationships (including keys, foreign keys, etc.) • Two-way mappings and view definitions • Data redundancy (ex: copies of employee address) • Need a robust repository of enterprise XML assets • General catalog of what’s known / available • Metadata for distributed query planning • Metadata to drive update propagation • Dependency information for change management

Where We’ve Just Been • Brief data integration history lesson • Relational revolution • Federated database history • XML technology overview, focusing on • XML Schema • XQuery • Web services • XML for enterprise information integration • EII challenges • Why XML can (and will!) hit a home run • It’s real: BEA’s Liquid Data product • Enterprise modeling in the XML age • A few XML-based-EII research challenges

You Made It…! • Twenty years of data integration • In the beginning… • Variations on a theme • Enterprise information integration • XML – evolution, revolution, or convolution? • XML Schema • XQuery • Web Services • XML-based enterprise information integration • Viewing the enterprise through XML glasses • An example: BEA’s Liquid Data for WebLogic • Scaling XML enterprise information models • Conclusions, challenges, and Q&A

Enterprise Information Integration – XML to the Rescue!