560 likes | 820 Views
Technologie van Informatiesystemen TIS college 3 Inhoud Inleiding, 30/11 Web engineering & Web information systems, 7/12 Data transformatie & Data integratie, 14/12 ERP, Smulders (Deloitte), 21/12 + 11/1 Flower, Berens (Pallas Athena), 25/1 + 1/2 Biztalk, van den Boom (Microsoft), 15+22/2
E N D
Technologie van InformatiesystemenTIS college 3
Inhoud • Inleiding, 30/11 • Web engineering & Web information systems, 7/12 • Data transformatie & Data integratie, 14/12 • ERP, Smulders (Deloitte), 21/12 + 11/1 • Flower, Berens (Pallas Athena), 25/1 + 1/2 • Biztalk, van den Boom (Microsoft), 15+22/2
Inhoud • Inleiding, 30/11 • Web engineering & Web information systems, 7/12 • Data transformatie & Data integratie, 14/12 • ERP, Smulders (Deloitte), 21/12 + 11/1 • Flower, Berens (Pallas Athena), 25/1 + 1/2 • Biztalk, van den Boom (Microsoft), 15+22/2 Philippe Thiran
Philippe Thiran Computer Science Department Technische Universiteit Eindhoven The Netherlands Data TransformationData Integration
Data Transformation & Integration • Agenda • Problem Statement • Existing database systems • Heterogeneity, distribution, autonomy • Data Transformation • Schema conversion • Query conversion: Wrapper • Data Integration • Schema integration • Query processing: Multidatabase and Federation
Problem Statement Existing database systems Heterogeneity, distribution, autonomy
Problem StatementExisting Database Systems • Existing Database Systems • Data are recorded in existing database systems • Existing database systems are: • Mission critical (essential to the organization business) • To be operational at all times • Inflexible • Typically, existing database systems are: • Very large (millions of lines of code) • Old (often more than 10 years old) • Written in old programming language like COBOL, PL/1, SQL! • Built around an old DBMS
Problem StatementExisting Database Systems • Existing Database Systems • Data are recorded in existing database systems • Answer of old requirements • New functions and services • New user requirements • New technology (Web) • Communication among them?
Problem StatementExisting Database Systems • Existing Systems: New Services • How to deal with existing database systems ? • Abandon the existing systems: migration to a new system • Keep and modify the existing systems • Keep the existing systems and wrap them: autonomy • Existing Systems: Communication • How to integrate existing database systems?
Distribution Distributed databases Autonomy Centralized DBMS Heterogeneity Problem StatementData Integration • Data Integration Problems • Integrating database systems is very hard and costly • Three main dimension of the problem: • Distribution • Autonomy • Heterogeneity
Distribution Autonomy Heterogeneity Data IntegrationProblem Statement • Autonomy • Autonomy refers to the distribution of control • Four dimensions of autonomy: • Design: own data models and own transaction management technique • Communication: nor knowledge of the existence of other system nor how to communicate with them • Execution: independently of the other systems • Association: each system decides how much of its data and processing capabilities it will share with the other system
Distribution Autonomy Heterogeneity Data IntegrationProblem Statement • Heterogeneity • Heterogeneity may exist at three basic levels: • DBMS level. Data is managed by a variety of DBMS based on different data models and data languages • Data models : relational model, hierarchical model and file model • Data languages : SQL, DL/1, COBOL programs • Platform level. Different hardwares, different network protocols • Semantic level. Different designer viewpoints in modelling the same objects of the application domain. Incompatible design specifications which lead to different naming, types or integrity constraints
Data IntegrationGeneric Integration Architecture • Schema Hierarchy Homogenizes and unions import schemas Integrated Schema Common Model Import Schema 1 Import Schema 2 Import Schema 3 View on export schema available for non-local access Export Schema 1 Export Schema 2 Export Schema 3 Unifies data models Database Schema 1 Database Schema 2 Data Schema 3 Local Models DB1 DB2 File System OO DBMS Relational DBMS
Data IntegrationGeneric Integration Architecture • Schema Hierarchy Integrated Schema Data and Schema Integration Common Model Import Schema 1 Import Schema 2 Import Schema 3 Export Schema 1 Export Schema 2 Export Schema 3 Data and Schema Transformation Database Schema 1 Database Schema 2 Data Schema 3 Local Models DB1 DB2 File System OO DBMS Relational DBMS
Data Transformation Schema Conversion Query Conversion: Wrapper
Data TransformationSchema Conversion • Introduction • Schema conversion • Query/Data conversion Common Data Model Export Schema 1 Export Schema 2 Query1 Data1 Data2 Query2 Local Data Models Database Schema 1 Database Schema 2 Query1’ Data1’ Data2’ Query2’ Data Source 1 Data Source 2
Data TransformationSchema Conversion • Schema Conversion • Schema transformation • Transformation of a schema expressed in a data model (Ms) into an equivalent schema expressed in another data model (Mt) • Examples • ER model Relational model (lecture ISO) • Relational model XML Schema (see later) • Schema transformation operators • Schema conversion consists in applying the relevant transformations on the relevant constructs of the schema expressed in Ms in such a way that the final result complies with Mt
B A B1 A1 A B B2 A1 B1 id: B1 B1 B2 ref: B1 id: B1 R 1-1 0-N Data TransformationSchema Conversion • Schema Conversion • Schema transformation • A (schema) transformation basically is an operator by which a source data structure C is replaced with a target structure C'. • Example of a semantics-preserving transformation: transforming a relationship type into an attribute
B A B1 A1 A B B2 A1 B1 id: B1 B1 B2 ref: B1 id: B1 R 1-1 0-N Data TransformationSchema Conversion • Schema Conversion • 2 main schema transformations for ER model Relational model
Data TransformationSchema Conversion • Schema Conversion • Exercice: FromER model Relational model
Data TransformationSchema Conversion • Schema Conversion • Exercice: FromER model Relational model
Data TransformationSchema Conversion • Schema Conversion • Exercice: FromER model Relational model
Common Data Model Common Data Model Common Query Language Export Schema Data TransformationWrappers Wrapper Local Data Models Database Schema Data Source • Definition • A wrapper controls a (legacy) data source • Basically a wrapper is a software component that offers an homogeneous query interface based on a common data model (XML for the Web) • It converts data and queries from the common data model to a local data model It offers an adequate way for solving the DBMS heterogeneity that appears when one wants to integrate existing and heterogeneous data systems
Data TransformationWrappers • Definition (ctd) • A data wrapper is basically defined as a converter of data and queries • That is, a wrapper: • Offers an export schema in the common data model • Accepts queries against the export schema • Translates them into queries understandable by the data system • Transforms the results of the local queries into a format understood by the application Query Data Common Data Model Common Data Model Common Query Language Export Schema Wrapper Local Data Models Local Data Model Local Query Language Database Schema Data Source
Data TransformationWrappers • Categories of Wrappers • There exists no standard approach to build wrappers • Functionality • One-way: only transformation of data (e.g., for data warehouses) • Two-way: transformation of requests and data • Development • Hard-wired wrappers, for specific data sources • Semi-automated generation: wrapper development tools • Automatically generated wrappers • Availability • Standalone programs (data conversion, data migration) • Components of a federation (see later) • Database interface for foreign data
Data TransformationWrappers • Wrappers and the Web • Wrapper interface • Data format: XML • Common data model: XML DTD and Schema • Common query language: XPath, XQuery, none • Wrapper mapping • Generally between relational data and XML • Two translation types • Automated • Defined by the user • XML- or SQL-oriented query language
Data TransformationWrappers • XML Views of Relational Databases • Automated translation Order Item Payement <db> <order> <row><id>10</id><custname>Philips</custname><custum>7734</custnum></row> <row><id>9</id><custname>Unilever</custname><custum>7725</custnum></row> </order> <item> <row><oid>10</oid><desc>Ship</desc><cost>24000</cost></row> <row><oid>10</oid><desc>Generator</desc><cost>8000</cost></row> </item> <payement> similar to <order> and <item> </payement> </db>
Data TransformationWrappers • XML Views of Relational Databases • User-defined Translation Order <order id=’10’> <custname> Philips </custname> <items> <item description=“Ship”> <cost> 24000 </cost> </item> <item description=“Generator”> <cost> 800 <cost> </item> </items> </payments> <payement due=’1/10/01’> <amount> 20000 </amount> </payement> </payements> </order> <order id =‘9’> … </order> Item Payement
Data TransformationWrappers • XML Views of Relational Databases • Exercises • What is the XML Document of this relational database?
Data TransformationWrappers • XML Views of Relational Databases • Exercises • What is the XML Document of this relational database? <!ELEMENT Catalog (Order*, Product*)> <!ELEMENT Order (Customer, Date, Total?, Detail+)> <!ATTLIST Order OrderID ID #REQUIRED> <!ELEMENT Customer ANY> <!ELEMENT Date (#PCDATA)> <!ELEMENT Total (#PCDATA)> <!ELEMENT Detail (Quantity, Amount)> <!ATTLIST Detail Product IDREF #REQUIRED> <!ELEMENT Quantity (#PCDATA)> <!ELEMENT Amount (#PCDATA)> <!ELEMENT Product (Supplier+)> <!ATTLIST Product Reference ID #REQUIRED Label CDATA #IMPLIED UnitPrice CDATA #REQUIRED> <!ELEMENT Supplier ANY>
Data TransformationWrappers • XML Views of Existing Relational Databases • Mapping definition • SQL-oriented query language For $b in SQL(select * from Order where Custname=“’ +$x + ‘””) return <order> {$b/Id} <Custname>{$x}</Custname></order> Order Order Id Custname
Data TransformationWrappers • XML Views of Existing Relational Databases • XML View definition • Bottom-up (from the relational schema) • Top-Down (from a given XML schema) • Mappings between XML views and relational schemas • Automated (algorithm) • Manual (defined by the user)
Data TransformationWrappers • XML Views of Existing Relational Databases • Examples
Data Integration Generic Integration Architecture Schema Integration Query Processing: multidatabase and federation
Data Integration Generic Integration Architecture Schema Integration
Data IntegrationGeneric Integration Architecture • Schema Hierarchy Homogenizes and unions import schemas Integrated Schema Common Model Import Schema 1 Import Schema 2 Import Schema 3 View on export schema available for non-local access Export Schema 1 Export Schema 2 Export Schema 3 Unifies data models Database Schema 1 Database Schema 2 Data Schema 3 Local Models DB1 DB2 File System OO DBMS Relational DBMS
Data IntegrationGeneric Integration Architecture • Component Architecture Application 1 Application 2 Application 3 Integrated Schema Common DDL/DML Meditor Offers an abstract integrated view of sources Reconciles independent data structures to yield a unique, coherent, view of the data Import Schema 1 Export Schema 1 Wrapper Wrapper Wrapper Local DDL/DML Database Schema 1 DBMS 1 DBMS 2 DBMS 3 Controls a local data source Offers an homogeneous query interface based on a common data model DB1 DB2 DB3
Data IntegrationGeneric Integration Architecture • Aspects to Consider for Integration • General Issues • Bottom-up vs. top-down engineering • From existing schema to integrated or vice-versa • Schema integration vs. schema matching • Virtual vs. materialized integration • Read-only vs. read-write access • Transparency • Language, schema, location • Data Model related issues • Types of sources • Structured, semi-structured, unstructured • Common data model of integrated system • Tight vs. loose integration • Use of a global schema • Query model
Data IntegrationSchema Integration • Methodology • Bottom-up process • Four main steps • Preparing the local schemas • Detecting what is common between the components of local schemas • Correspondence (what is common) • Solving the conflicts • Conflict (what is incompatible) • Integrating the different schemas according to the correspondences and conflicts detected in the previous steps
Data IntegrationSchema Integration • Concept of Correspondence • Two complementary views of correspondence: • Structural correspondence (schema level: concepts) • Instance correspondence (instance level: data) • Structural correspondence • Five types of structural correspondence: • Identity • Independence • Complementarity • Subtyping • Common supertype
Data IntegrationSchema Integration • Concept of Correspondence • Instance correspondence • Four types of instance correspondence: • Disjointed: the instances classes are disjointed • Inclusion: the set of one class is included to another class • Equivalence: the classes contain the same instances • Overlapping: the classes share some instances but not all
Data IntegrationSchema Integration • Concept of Conflict • Conflicts occur in three possible ways : syntactic (naming conflicts), structural, semantic or instance • Syntactic conflicts (resolution: use of an ontology) • Synonyms. Two identical objects (entities, attributes, relationships) that have different names are synonyms • Homonyms. Two different objects that have identical names are homonyms • Structural conflicts (resolution: mapping function or transformation) • Domain. Two identical objects have different domains (Differences in dimension, units and scales) • Structure. The same concept is presented by different data structures (e.g., different attributes)
Data IntegrationSchema Integration • Concept of Conflict • Structural conflict • In the left-hand schema, Address is an compound attribute, whereas in the right-hand one, Address is represented by an entity type • Resolution: transformation Site 2 Site 1
Data IntegrationSchema Integration • Concept of Conflict • Semantic conflicts • A semantic conflict appears when a contradiction appears between two representations A and B of the same application domain concept or between two integrity constraints (resolution?) • Example • In the left-hand schema, Customer is identified by CustId, whereas in the right-hand one, it is identified by Name Site 1 Site 2
Data IntegrationSchema Integration • Concept of Conflict • Instance conflicts • Instance conflicts are specific to existing data • Modelling constructs A and B that are recognized as corresponding can cover sets with different scopes • Examples • ZIP codes of addresses can be written like “NL-5600 MB” or “56oo MB” or “5600” • Different ZIP codes can be recorded for the same address (encoding errors) • Resolution: Data transforming… cleaning?
Data Integration Query Processing: multidatabase and federation
Data IntegrationIntegration Architecture • Three Classical Architectures • Multidatabases • No integrated schema • Integrated access to different relational DBMS • Federated Databases • Integrated schema • Integrated access to different DBMS • Integrated access to different data sources (on the Web) • Data Warehouses • Materialized integrated data sources • Not here
Data IntegrationQuery Processing • Classical Architecture: Multidatabase • Enable transparent access to multiple (relational) databases • Hides distribution, different SQL variants • Processes queries and updates against multiple databases (2-phase commit) • Does not provide any type of global schema (does not hide the different database schemas) • Example: IBM DataJoiner DataJoiner Sybase Open Client Oracle SQL*Net TCP/IP Network Sybase Server Oracle Server
Data IntegrationQuery Processing • Classical Architecture: Multidatabase • Multidatabase schema Multidatabase Schema Sybase Oracle Source 1 Source 2
Data IntegrationQuery Processing • Classical Architecture: Multidatabase • Query processing Multidatabase Schema SELECT p2.title FROM Sybase.PUBLICATIONS p1, Oracle.PAPERS p2 WHERE p1.title = p2.title SELECT title FROM PUBLICATIONS SELECT title FROM PAPERS Sybase Oracle Source 1 Source 2 Sybase Data Oracle Data