Information Integration

Dagstuhl Workshop on Information Integration April/May 2002 Information Integration (My Personal) Workshop Summary

Overview • Supported by • Focus Group Results (esp. Christoph Freytag, Rakesh Agrawal, Gerd Stumme) • Panels (we all) • Starting Point:A Look at the Original Seminar Proposal • The topics raised • The solutions given • Hot research topics • Closing Point:The (forseeable) truth

Old Keywords

New Keywords InformationIntegration ApplicationIntegration BusinessIntegration Data Integration Process Integration

Information Integration: Topic Description • II subsumes all technologies needed to provide for manipulation of information scattered over many data stores while supporting a single system image. • The data stores to be integrated are inherently heterogeneous in nature, owned by different organizations, and distributed over the whole world. • Data can be structured semi-structured, or unstructured. • Data access can be based on standardized interfaces or via proprietary APIs. • II is expected to become a key technology in many application areas like • product data management • business process management • enterprise application integration • life science • entertainment.

Discussion Areas • How to get Access to the various data stores? Different technologies like SQL/MED wrappers, J2EE connectors, EAI adapters, and Web Services can be used for these purposes. • When should either of these technologies be used? • Can they be unified? • What are possible System Structures? • Which role will database systems, application server, workflow systems, messaging systems, portal servers, etc. play?How do they relate and cooperate? • Does “Web Database Technology” suffice? • Can XML be used as the language for describing the integrated information base? How to capture “navigational access” based on hyper-linked HTML pages performed today in many application areas? • How to combine search and query functionality? • How is XML stored - sliced/diced, as whole document as file in file system, as whole document but combined with other documents in file system? How do you index these effectively? How do you combine SQL and an XML-based query over the same data (i.e., XML query against SQL data and SQL against XML)? Is a pure XML database the way to go or will an extended relational engine be the right solution?

Discussion Areas • How is information described?Which information qualities are needed? How can qualities be compared, assessed, measured,…? Which metadata is relevant (schema, ontologies,…)? • WhichFederated Database Technologies can be used?What is a federated schema if structured and unstructured data are brought together? Which schema integration techniques, federated query and search technologies are applicable? • Not Discussed • Which Transaction Model is appropriate?Which transactional guarantees are needed? Which concurrency models, recovery models are applicable? • Not Discussed

Some Results/Agreements: ... Definitions • Data Integration • integration schema: an “image” presenting all “relevant” facts as one data source • generic functions for access and change • Integration different from cooperation and interaction • Integration properties are Impacted by • Experimental/Exploratory vs. Production • Exploratory • loosely coupled • fluid integration • Data centric • Production • Function integration • less flexible and often fixed

Some Results/Agreements: ... Definitions • Structured object: • <oid, {<name, value>}> • Unstructured object: • <oid, {word}> • <oid, unknown/complex structure> • Semi-structured object • <oid, {<name, value>}, {word}> • <name, value> pairs may be • Given (e.g. author, title, etc.) • Extracted (e.g. Date, Zipcode, etc.) • Inferred (e.g. Topic)

Some Results/Agreements: ... Definitions • Metadata can be anything between natural language text and formalisms with formal semantics (e.g. ER models, (first order) logics, description logics, ontologies), including intermediate degrees of formality (e.g. XML, RDF) • For supporting II, we need more formal models which allow for machine manipulation • Ontologies are • data about metadata (schema for metadata) • Force people to use them! • (at least) 2 Secrets/Rumors: • Late night tutorial for thursty people • (late) night tete-a-tete (maybe separe´) tutorial

Some Results/Agreements: ... Web Services • Web Services is a new model for using the Web "An interface that describes a collection of operations that are network accessible through standardized XML messaging.“ • transactions initiated automatically by a program, not necessarily using a browser • can be described, published, discovered, and invoked dynamically in a distributed computing environment

Some Results/Agreements: ... Information Integration: the database reaches out • Unstructured Data Support • OLAP, Mining, rich Search • Federated database extensions • Metadata management • SQL and XML • Pure XML • ... • SQL and XML and NF2-like technology • DBMS should reflect some semantics of the applications

Some Results: Confluence of Multiple Disciplines Application Integration Business Integration InformationIntegration CommunicationIntegration Policy Integration Data Integration Process Integration Meta Data/Ontologies

... Other Results • More focus on Process Models and Process Specifications • DIFF operator • Formal theory for process models etc. • More focus on Semantics • Much more focus on Semantics • ... • Performance • Much more Performance • ... • Information systems should also project into the future ...

The (forseeable) Truth Heterogeneity is Fact Integration is Fact Make Life in Heterogeneity possible ... Easy! Data Integration per se is not beneficial! There has to be life beyond angular brackets!

Information Integration