340 likes | 355 Views
This paper explores a comprehensive integration architecture for resolving conflicts in relational databases. Learn about the integration process, schema and data conflicts, motivation for integration, and more.
E N D
Automatic Conflict Resolution to Integrate Relational Databases Ramon Lawrence University of Manitoba umlawren@cs.umanitoba.ca
Outline • Introduction, Motivation, and Background • The integration architecture • Standard dictionary, X-Specs, query processor • Example integration • Northwind, Southstorm databases • Querying the integrated databases • Generating SQL queries from semantic queries • Unity implementation • Contributions, Conclusions, and Future Work
What is Integration? • Two levels of integration: • Schema integration -the description of the data • Data integration -the individual data instances • Integration problems include: • Different data models and conflicts within a model • Incompatible concept representations • Different user or view perspectives • Naming conflicts (homonym, synonym) • Integration handles the different mechanisms for storing data (structural conflicts), for referencing data (naming conflicts), and for attributing meaning to the data (semantic conflicts).
Why is Integration Required? • There are many integration environments: • Operational systems within an organization • System integration during company merger • Data warehouses, Intranets, and the WWW • Users require information from many data sources which often do not work together. • Companies require a global view of their entire operations which may be present in numerous operational databases for different departments and distributed geographically. • E-commerce demands integration of web databases with production systems.
Previous Work • Research systems: • integrating systems by logical rules (Sheth) • defining global dictionaries (Castano) • Carnot Project using the Cyc knowledge base • wrapper and mediator systems: • Information Manifold, TSIMMIS, Infomaster • Industrial systems and standards: • Metadata Interchange Specification (MDIS) • XML, BizTalk, E-commerce portals • Query Languages: • SQL, MSQL, IDL, DIRECT, SchemaSQL
Previous Work Summary • Current techniques for database integration have some of these problems: • Require integrator to understand all databases • Integration process is manual • Do not hide system complexity from the user • Force changes on the existing database systems • Construct global view manually • Suffer from query imprecision (query containment)
Our Approach • Our approach combines standardization and query mapping algorithms. • The major idea is that schema conflicts can be resolved if we: • Eliminate all naming conflicts • Define a language capable of determining schema equivalence and performing transformations • Naming conflicts are eliminated by accepting a standard term dictionary. • Not a knowledge base or set of mediated views • Leverages semantic information in English words
Integration Architecture Integrated Context View X-Spec Editor Standard Dictionary Integration Algorithm Query Processor and ODBC Manager Client Client • Architecture Components: • 1) Integrated Context View • user’s view of integration • 2) X-Spec Editor • stores schema & metadata • uses XML • 3) Standard Dictionary • terms to express semantics • 4) Integration Algorithm • combines X-Specs into integrated context view • 5) Query Processor • accepts query on view • determines data source mappings and joins • executes queries and formats results Multidatabase Layer Subtransactions X-Spec X-Spec Database Database Local Transactions
Architecture Components • The architecture consists of four components: • A standard dictionary (SD) to capture data semantics • SD terms are used to build semantic names describing semantics of schema elements. • X-Specs for storing data semantics • Database metadata and semantic names stored using XML • Integration Algorithm • Matches concepts in different databases by semantic names. • Produces an integrated view of all database concepts. • Query Processor • Allows the user to formulate queries on the view. • Translates from semantic names in integrated view to SQL queries and integrates and formats results. • Involves determining correct field and table mappings and discovery of join conditions and join paths
Integration Processes • The integration architecture consists of three separate processes: • Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec. • Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view. • Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries (SQL) and the results are integrated and formatted.
Integration Architecture:The Capture Process Relational Schema Automatic Extraction X-Spec Specification Editor DBA Lookup of terms Standard Dictionary • Capture process involves: • Automatically extracting the schema information and metadata using a specification editor • Assigning semantic names to each schema element (tables and fields) to capture their semantics
Architecture Components: The Standard Dictionary • A standard dictionary (SD) provides standardized terms to capture data semantics. • Hierarchy of terms related by IS-A or Has-A links • Contains base set of common database concepts, but new concepts can be added • A SD term is a single, unambiguous semantic definition. • Several SD entries for a single English word are required if the word has multiple definitions. • The top-level dictionary terms are those proposed by Sowa.
Architecture Components: Dictionary vs. Knowledge Base • The standard dictionary differs from a knowledge base such as Cyc because: • Not intended to be a general English dictionary or contain knowledge facts about the world • Dictionary is evolved as new terms are required • Not all English words are used • Dictionary provides the systems with no “knowledge” • Since no facts are stored, system cannot deduce new facts • Dictionary terms are just semantic place holders, integrators determine the semantics of the database not the system • Simplified organization • Dictionary is organized as a tree for efficiency and simplicity in determining related concepts • Re-use of terms • Terms are re-used in semantic names
Architecture Components:Using the Standard Dictionary • SD terms are used to build semantic names describing semantics of schema elements. • Semantic names have the form: • semantic name := [CT_Type] | [CT_Type] CN • CT_Type := CT | CT {; CT} | CT {,CT} • CT := context term, CN := concept name • each CT and CN is a single term from the SD • Semantic names are included in specifications describing a database.
Integration Example (3) Page 17
What is a semantic name? • A semantic name is a universal, semantic identifier in a domain. • Similar to a field name in the Universal Relation. • Semantics are guaranteed unique by construction. • System has mechanism for comparing semantics across domains even though it does not understand them. (Exploiting semantics in English words.) • Important definitions: • context - a semantic name is a context if it maps to a table • concept-a semantic name is a concept if it maps to a field • context closure - of semantic name Si denoted Si* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1.
Architecture Components:X-Specs • Database metadata and semantic names are combined into specifications called X-Specs: • Stored and transmitted using XML • Contains information on a relational schema • Organized into database, table, and field levels • Stores semantic names to describe and integrate schema elements
Southstorm X-Spec <?xml version="1.0" ?> <Schema name = "Southstorm_xspec.xml” xmlns="urn:schemas-microsoft-com:xml-data" xmlns:dt="urn:schemas-microsoft-com:datatypes"> <ElementType name="[Order]" sys_name = "Orders_tb" sys_type="Table"> <element type = "[Order] Id" sys_name = "Order_num" sys_type = "Field"/> <element type = "[Order] Total Amount" sys_name = "Order_total" sys_type = "Field"/> <element type = "[Order;Customer] Name" sys_name = "Cust_name" sys_type = "Field"/> <element type = "[Order;Customer;Address] Address Line 1" sys_name="Cust_address" sys_type="Field"/> <element type = "[Order;Customer;Address] City" sys_name = "Cust_city" sys_type = "Field"/> <element type = "[Order;Customer;Address] Postal Code" sys_name="Cust_pc" sys_type="Field"/> <element type = "[Order;Customer;Address] Country" sys_name="Cust_country" sys_type="Field"/> <element type = "[Order;Product] Id" sys_name = "Item1_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item1_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item1_price" sys_type = "Field"/> <element type = "[Order;Product] Id" sys_name = "Item2_id" sys_type = "Field"/> <element type = "[Order;Product] Quantity" sys_name = "Item2_quantity" sys_type = "Field"/> <element type = "[Order;Product] Price" sys_name = "Item2_price" sys_type = "Field"/> </ElementType> </Schema> Page 21
Integration Product:The Integrated Context View • The product of the integration is a structurally-neutral hierarchy of concepts called an integrated context view. • Define a context view (CV) as follows: • If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV. • For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si. • A semantic name Si can only occur once in the CV. • A context view (CV) is a valid Universal Relation. • Each field is assigned a semantic name which uniquely identifies its semantic connotation.
Architecture Components: The Query Processor • The query processor: • Allows the user to formulate queries on the view. • Translates from semantic names in the context view to structural queries (SQL) on databases. • Involves determining correct field and table mappings and discovery of join conditions and join paths • Retrieves query results and formats them for display to the user. • Client-side query processing: • Perform joins between databases using common keys. • Data value formatting and transformation
Advanced Query Processing • Advanced query processor features include: • global keys and joins - a mechanism for specifying when a field stores a global key such as a social security number. • result normalization - a procedure for normalizing query results returned from each individual database. (e.g. Southstorm) • data integration - transforming data representational conflicts at the global level. • For example, “M” and “F” may represent “Male” and “Female” in one database, and another may represent these concepts using “0” and “1”.
Northwind & Southstorm Query Examples • Example 1: Retrieve all order ids ([Order] Id) and customers ([Customer] Name): • SS: SELECT Order_num, Cust_name FROM Orders_tb • NW: SELECT OrderID, CompanyName FROM Orders, Customers WHERE Orders.CustomerID = Customers.CustomerID • Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids. • SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb • NW: SELECT OrderID, ProductID FROM OrderDetails • Note: In NW, selects from two different order id mappings. In SS, result normalization is required.
Integration Example:Discussion • Important points: • System table and field names are not presented to the user who queries based on semantic names. • Database structure is not shown to the user. • Field and table mappings are automatically determined based on X-Spec information. • Join conditions are inserted as needed when available to join tables. • Different physical representations for the same concept are combined. • Hierarchically related concepts are combined based on their IS-A relationship in the standard dictionary.
Unity Overview • Unity is a software package that implements the integration architecture with a GUI. • Developed using Microsoft Visual C++ 6 and Microsoft Foundation Classes (MFC). • Unity allows the user to: • Construct and modify standard dictionaries • Build X-Specs to describe data sources • Integrate X-Specs into an integrated view • Transparently query integrated systems using ODBC and automatically generate SQL transactions • Unity is available for demonstration and distribution.
Architecture Discussion • The architecture automatically integrates relational schemas into a multidatabase. • Desirable properties: • Individual mappings - information sources integrated one-at-a-time and independently • Integrated view constructed for query transparency - user queries system by semantics instead of structure • Handles schema conflicts - including semantic, structural, and naming conflicts • Automated integration - integrated view constructed efficiently and automatically • No wrapper or mediator software is required • Transparent querying - users issue semantic queries which are translated to SQL by the query processor
Contributions • Architecture contributions: • Has an unique application of a standard dictionary which is not a knowledge base • Separates the capture and integration processes • Produces an integrated, high-level view of all concepts in the underlying databases • Allows transparent querying without structure • Provides algorithms for dynamically extracting database data (creating relevant views) and for mediation of global level conflicts • Arguably simpler method for capturing data semantics than using description logic • An implementation, Unity, which demonstrates the practical benefits of the architecture
Conclusions & Future Work • Automatic database integration is possible by using a standard term dictionary and defining semantic names for schema elements. • Users are able to transparently query integrated systems by concept instead of structure. • We are constantly refining Unity. • Develop an integration component for a web browser • Test the system in large industrial projects. • Allow distributed updates and global updates on all databases.
References • Publications: • Unity - A Database Integration Tool, R. Lawrence and K. Barker, TRLabs Emerging Technology Bulletin, January 2000. • Multidatabase Querying by Context, R. Lawrence and K. Barker, DataSem2000, pages 127-136, Oct. 2000. • Integrating Relational Database Schemas using a Standardized Dictionary, SAC’2001 - ACM Symposium on Applied Computing, March, 2001. • Sponsors: • NSERC, TRLabs • Further Information: • http://www.cs.umanitoba.ca/~umlawren/