Multidatabase Querying by Context

Multidatabase Querying by Context

Outline • Introduction, Motivation, and Background • Integration architecture • Integration architecture components • standard dictionary, X-Specs, query processor • Context View as an Universal Relation • Query Processor Algorithms • field/table mapping discovery, join selection • model extensions to simplify query construction • Future work and conclusions

Introduction and Motivation • Database integration requires conflict resolution during schema integration. • There are many integration environments: • operational systems within an organization • system integration during company merger • data warehouses, Intranets, and the WWW • Our system automatically integrates relational database schema and allows the user to transparently query the system. • This presentation is on the query system.

Background • Current techniques for database interoperability have some of these problems: • require integrator to understand all databases • integration process is manual • do not hide system complexity from the user • SQL and multidatabase query languages such as MSQL require user to query integrated system by structure • not desirable when accessing numerous databases

Previous Work • Research systems: • integrating systems by logical rules (Sheth) • defining global dictionaries (Castano) • Carnot Project using the Cyc knowledge base • wrapper and mediator systems: • Information Manifold, TSIMMIS, Infomaster • Industrial systems and standards: • Metadata Interchange Specification (MDIS) • XML, BizTalk, E-commerce portals • Query Languages: • SQL, MSQL, IDL, DIRECT, SchemaSQL

Integrated Context View X-Spec Editor Standard Dictionary Integration Algorithm Query Processor and ODBC Manager Integration Architecture Client Client • Architecture Components: • 1) Integrated Context View • user’s view of integration • 2) X-Spec Editor • stores schema & metadata • 3) Standard Dictionary • terms to express semantics • 4) Integration Algorithm • combines X-Specs into integrated context view • 5) Query Processor • accepts query on view • determines data source mappings and joins • executes queries and formats results Multidatabase Layer Subtransactions X-Spec X-Spec Database Database Local Transactions

Integration Architecture • The integration architecture consists of three separate processes: • Capture process: independently extracts database schema information and metadata into a XML document called a X-Spec. • Integration process: combines X-Specs into a structurally-neutral hierarchy of database concepts called an integrated context view. • Query process: allows the user to formulate queries on the integrated view that are mapped by the query processor to structural queries and the results are integrated and formatted.

Integration Architecture:The Capture Process Relational Schema Automatic Extraction X-Spec Specification Editor DBA Lookup of terms Standard Dictionary • Capture process involves: • automatically extracting the schema information and metadata using a specification editor • assigning semantic names to each schema element (tables and fields) to capture their semantics

Architecture Discussion • The architecture automatically integrates relational schemas into a multidatabase • Desirable properties: • individual mappings - information sources integrated one-at-a-time and independently • integrated view constructed for query transparency - user queries system by semantics instead of structure • handles schema conflicts - including semantic, structural, and naming conflicts • automated integration - integrated view constructed efficiently and automatically • no wrapper or mediator software is required

Architecture Components: The Standard Dictionary • A standard dictionary (SD) provides standardized terms to capture data semantics. • Hierarchy of terms related by IS-A or Has-A links • Contains base set of common database concepts, but new concepts can be added • A SD term is a single, unambiguous semantic definition. • Several SD entries for a single English word are required if the word has multiple definitions.

Architecture Components:Using the Standard Dictionary • SD terms are used to build semantic names describing semantics of schema elements. • Semantic names have the form: • semantic name = “[“CT [[;CT] | [,CT]] “]” CN • CT = context term, CN = concept name • each CT and CN is a single term from the SD • Semantic names are included in specifications describing a database.

Architecture Components:X-Specs • Database metadata and semantic names are combined into specifications called X-Specs: • stored and transmitted using XML • contains information on a relational schema • organized into database, table, and field levels • stores semantic names to describe and integrate schema elements

Architecture Components:Integrating X-Specs • Each database to be integrated is described using a X-Spec. • Identical concepts in different databases are identified by similar semantic names. • Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases. • Product of the integration algorithm is a structurally-neutral integrated view of all database concepts.

Integration Architecture:The Integration Process • Integration process involves: • automatically identifying identical concepts by matching semantic names • constructing a structurally-neutral integrated view of database concepts • resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)

Context View as a Universal Relation • Definitions: • dictionary term -unambiguous word phrase in dictionary • semantic name- combination of dictionary terms to represent schema element semantics • context - a semantic name is a context if it maps to a table • concept-a semantic name is a concept if it maps to a field • context closure - of semantic name Si denoted Si* is the set of semantic names produced by taking ordered subsets of the terms of Si = {T1, T2 , … TN} starting with T1. • Example: If Si = [A;B;C] D then Si* ={[A], [A;B], [A;B;C], [A;B;C] D}.

Context View as a Universal Relation (cont.) • Define a context view (CV) as follows: • If a semantic name Si is in CV, then for any Sj in Si*, Sj is also in CV. • For each semantic name Si in CV, there exists a set of zero or more mappings Mi that associate a schema element Ej with Si. • A semantic name Si can only occur once in the CV. • A context view (CV) is a valid Universal Relation. • Each field is assigned a semantic name which uniquely identifies its semantic connotation. • Illustrates that context view provides structural transparency similar to Universal Relation

Architecture Components: The Query Processor • The query processor: • allows the user to formulate queries on the view • translates from semantic names in the context view to structural queries (SQL) on databases • involves determining correct field and table mappings • and discovery of join conditions and join paths • retrieves query results and formats them for display to the user

The Query Processor:Determining field/table mappings • For each database (D) in the context view • For each semantic name (S) in query • If S has only one semantic name mapping in D Then • Add field mapping to query and its parent table • Else If S has multiple mappings but all in one table Then • Add each field mapping to query and the parent table • Else S has multiple mappings in more than one table Then • If any field mapping has a table already in query take that one • Else take field mapping with best semantic name match • Else take first mapping found • End If • Next • Next

The Query Processor:Constructing Join Graphs • Given a set of fields (F) and tables (T) to access, joins are applied to connect the tables. • A join graph is an undirected graph where: • each node Ni is a table in the database • there is a link from node Ni to node Nj if there is a join between the two tables • A join path is a sequence of joins connecting two nodes in the graph. • A join tree is a set of joins connecting two or more nodes. • A join matrix M stores the shortest join paths between any two nodes (tables).

The Query Processor:Join Graph for Northwind

The Query Processor:Join Discovery Results • Join Discovery in a database with a connected, acyclic join graph and a join matrix M: • there exists only one join tree for any set of tables • the joins required to connect a table set T is found by taking any Ti of Tand unioning the join paths in M[Ni,N1], M[Ni,N2], ... M[Ni,Nn] where N1,N2,..Nn are the nodes corresponding to the set of tables T. • For a cyclic join graph: • there may exist more than one join tree for a set of tables and each tree may have different semantics • can allow the user to uniquely determine join tree by graphically displaying join conditions to the user as they browse the context view

Northwind & Southstorm Integration Example

Northwind & Southstorm Integration Example (page 2)

Northwind & Southstorm Query Examples • Example 1: Retrieve all order ids ([Order] Id) and customers ([Customer] Name): • SS: SELECT Order_num, Cust_name FROM Orders_tb • NW: SELECT OrderID, CompanyName FROM Orders, Customers WHERE Orders.CustomerID = Customers.CustomerID • Example 2: Retrieve all ordered products ([Order;Product] Id) and their order ids. • SS: SELECT Order_num, Item1_id, Item2_id FROM Orders_tb • NW: SELECT OrderID, ProductID FROM OrderDetails • Note: In NW, selects from two different order id mappings. In SS, result normalization is required.

Integration Example:Discussion • Important points: • system table and field names are not presented to the user who queries based on semantic names • database structure is not shown to the user • field and table mappings are automatically determined based on X-Spec information • join conditions are inserted as needed when available to join tables • different physical representations for the same concept are combined • hierarchically related concepts are combined based on their IS-A relationship in the standard dictionary

Conclusions • Automatic integration of database schema is possible by using a standard dictionary of terms and constructing semantic names for schema elements. • Integration of data sources has applications to the WWW and construction of data warehouses. • Users are able to transparently query integrated systems by concept instead of structure.

Future Work • The integration architecture is evolving with standards on XML and captures metadata information in XML documents. • The system is being tested on sample problems • We are refining a prototype of the system called Unity. • The query processor is being extended to resolve more complex queries and conflicts.

Multidatabase Querying by Context

Multidatabase Querying by Context

Presentation Transcript

Advanced Querying

Multidatabase Transaction Management

Querying XML

Querying XML

Querying

Querying XML

Querying XML

Querying Issues

In-Network Querying

Multidatabase and Distributed Manipulations Part 1

Location Querying

Multidatabase manipulations Part 2

Data analysis by querying

Querying Dynamic and Context-Sensitive Metadata in Semantic Web

Trigger Querying

Advanced Querying

Querying XML Documents

Visual Querying By Color Perceptive Regions