290 likes | 413 Views
Multidatabase Querying by Context. Outline. Introduction, Motivation, and Background Integration architecture Integration architecture components standard dictionary, X-Specs, query processor Context View as an Universal Relation Query Processor Algorithms
E N D
Outline • Introduction, Motivation, and Background • Integration architecture • Integration architecture components • standard dictionary, X-Specs, query processor • Context View as an Universal Relation • Query Processor Algorithms • field/table mapping discovery, join selection • model extensions to simplify query construction • Future work and conclusions
Introduction and Motivation • Database integration requires conflict resolution during schema integration. • There are many integration environments: • operational systems within an organization • system integration during company merger • data warehouses, Intranets, and the WWW • Our system automatically integrates relational database schema and allows the user to transparently query the system. • This presentation is on the query system.
Background • Current techniques for database interoperability have some of these problems: • require integrator to understand all databases • integration process is manual • do not hide system complexity from the user • SQL and multidatabase query languages such as MSQL require user to query integrated system by structure • not desirable when accessing numerous databases
Previous Work • Research systems: • integrating systems by logical rules (Sheth) • defining global dictionaries (Castano) • Carnot Project using the Cyc knowledge base • wrapper and mediator systems: • Information Manifold, TSIMMIS, Infomaster • Industrial systems and standards: • Metadata Interchange Specification (MDIS) • XML, BizTalk, E-commerce portals • Query Languages: • SQL, MSQL, IDL, DIRECT, SchemaSQL
Integrated Context View X-Spec Editor Standard Dictionary Integration Algorithm Query Processor and ODBC Manager Integration Architecture Client Client • Architecture Components: • 1) Integrated Context View • user’s view of integration • 2) X-Spec Editor • stores schema & metadata • 3) Standard Dictionary • terms to express semantics • 4) Integration Algorithm • combines X-Specs into integrated context view • 5) Query Processor • accepts query on view • determines data source mappings and joins • executes queries and formats results Multidatabase Layer Subtransactions X-Spec X-Spec Database Database Local Transactions
Integration Architecture • The integration architecture consists of three separate processes: • Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec. • Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view. • Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries and the results are integrated and formatted.
Integration Architecture:The Capture Process Relational Schema Automatic Extraction X-Spec Specification Editor DBA Lookup of terms Standard Dictionary • Capture process involves: • automatically extracting the schema information and metadata using a specification editor • assigning semantic names to each schema element (tables and fields) to capture their semantics
Architecture Discussion • The architecture automatically integrates relational schemas into a multidatabase • Desirable properties: • individual mappings - information sources integrated one-at-a-time and independently • integrated view constructed for query transparency - user queries system by semantics instead of structure • handles schema conflicts - including semantic, structural, and naming conflicts • automated integration - integrated view constructed efficiently and automatically • no wrapper or mediator software is required
Architecture Components: The Standard Dictionary • A standard dictionary (SD) provides standardized terms to capture data semantics. • Hierarchy of terms related by IS-A or Has-A links • Contains base set of common database concepts, but new concepts can be added • A SD term is a single, unambiguous semantic definition. • Several SD entries for a single English word are required if the word has multiple definitions.
Architecture Components:Using the Standard Dictionary • SD terms are used to build semantic names describing semantics of schema elements. • Semantic names have the form: • semantic name = “[“CT [[;CT] | [,CT]] “]” CN • CT = context term, CN = concept name • each CT and CN is a single term from the SD • Semantic names are included in specifications describing a database.
Architecture Components:X-Specs • Database metadata and semantic names are combined into specifications called X-Specs: • stored and transmitted using XML • contains information on a relational schema • organized into database, table, and field levels • stores semantic names to describe and integrate schema elements
Architecture Components:Integrating X-Specs • Each database to be integrated is described using a X-Spec. • Identical concepts in different databases are identified by similar semantic names. • Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases. • Product of the integration algorithm is a structurally-neutral integrated view of all database concepts.
Integration Architecture:The Integration Process • Integration process involves: • automatically identifying identical concepts by matching semantic names • constructing a structurally-neutral integrated view of database concepts • resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)
Context View as a Universal Relation • Definitions: • dictionary term -unambiguous word phrase in dictionary • semantic name- combination of dictionary terms to represent schema element semantics • context - a semantic name is a context if it maps to a table • concept-a semantic name is a concept if it maps to a field • context closure - of semantic name Si denoted Si* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1. • Example: If Si = [A;B;C] D then Si* ={[A], [A;B], [A;B;C], [A;B;C] D}.
Context View as a Universal Relation (cont.) • Define a context view (CV) as follows: • If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV. • For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si. • A semantic name Si can only occur once in the CV. • A context view (CV) is a valid Universal Relation. • Each field is assigned a semantic name which uniquely identifies its semantic connotation. • Illustrates that context view provides structural transparency similar to Universal Relation
Architecture Components: The Query Processor • The query processor: • allows the user to formulate queries on the view • translates from semantic names in the context view to structural queries (SQL) on databases • involves determining correct field and table mappings • and discovery of join conditions and join paths • retrieves query results and formats them for display to the user
The Query Processor:Determining field/table mappings • For each database (D) in the context view • For each semantic name (S) in query • If S has only one semantic name mapping in D Then • Add field mapping to query and its parent table • Else If S has multiple mappings but all in one table Then • Add each field mapping to query and the parent table • Else S has multiple mappings in more than one table Then • If any field mapping has a table already in query take that one • Else take field mapping with best semantic name match • Else take first mapping found • End If • Next • Next
The Query Processor:Constructing Join Graphs • Given a set of fields (F) and tables (T) to access, joins are applied to connect the tables. • A join graph is an undirected graph where: • each node Ni is a table in the database • there is a link from node Ni to node Nj if there is a join between the two tables • A join path is a sequence of joins connecting two nodes in the graph. • A join tree is a set of joins connecting two or more nodes. • A join matrix M stores the shortest join paths between any two nodes (tables).
The Query Processor:Join Discovery Results • Join Discovery in a database with a connected, acyclic join graph and a join matrix M: • there exists only one join tree for any set of tables • the joins required to connect a table set T is found by taking any Ti of Tand unioning the join paths in M[Ni,N1], M[Ni,N2], ... M[Ni,Nn] where N1,N2,..Nn are the nodes corresponding to the set of tables T. • For a cyclic join graph: • there may exist more than one join tree for a set of tables and each tree may have different semantics • can allow the user to uniquely determine join tree by graphically displaying join conditions to the user as they browse the context view
Northwind & Southstorm Query Examples • Example 1: Retrieve all order ids ([Order] Id) and customers ([Customer] Name): • SS: SELECT Order_num, Cust_name FROM Orders_tb • NW: SELECT OrderID, CompanyName FROM Orders, Customers WHERE Orders.CustomerID = Customers.CustomerID • Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids. • SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb • NW: SELECT OrderID, ProductID FROM OrderDetails • Note: In NW, selects from two different order id mappings. In SS, result normalization is required.
Integration Example:Discussion • Important points: • system table and field names are not presented to the user who queries based on semantic names • database structure is not shown to the user • field and table mappings are automatically determined based on X-Spec information • join conditions are inserted as needed when available to join tables • different physical representations for the same concept are combined • hierarchically related concepts are combined based on their IS-A relationship in the standard dictionary
Conclusions • Automatic integration of database schema is possible by using a standard dictionary of terms and constructing semantic names for schema elements. • Integration of data sources has applications to the WWW and construction of data warehouses. • Users are able to transparently query integrated systems by concept instead of structure.
Future Work • The integration architecture is evolving with standards on XML and captures metadata information in XML documents. • The system is being tested on sample problems • We are refining a prototype of the system called Unity. • The query processor is being extended to resolve more complex queries and conflicts.