240 likes | 389 Views
Integrating data sources on the World-Wide Web. Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary umlawren,barker@cs.umanitoba.ca. Introduction. Integration of data is required when accessing multiple databases within an organization or on the WWW.
E N D
Integrating data sources on the World-Wide Web Ramon Lawrence and Ken Barker U. of Manitoba, U. of Calgary umlawren,barker@cs.umanitoba.ca
Introduction • Integration of data is required when accessing multiple databases within an organization or on the WWW. • Our focus is automatically combining database schema using schema integration. • Schema integration requires knowledge of data semantics and use of metadata.
Motivation • Organizations have several database systems which must interoperate. • Users often access multiple Web databases whose knowledge must be integrated and presented in a useful form. • Data warehouses and OLAP systems require data semantics to be understood and data to be cleansed and summarized.
Background • Schema integration involves combining diverse database schema into an integrated view by resolving conflicts. • Schema conflicts include naming, structural, and semantic conflicts. • Schema integration is required for database interoperability, but it is currently a manual process.
Previous Work • Research systems: • integrating systems by logical rules (Sheth) • defining global dictionaries (Castano) • Carnot Project using the Cyc knowledge base • Industrial systems and standards: • Metadata Interchange Specification (MDIS) • XML, BizTalk, E-commerce portals
Architecture Components: The Global Dictionary • A global dictionary (GD) provides standardized terms to capture data semantics. • Hierarchy of terms related by IS-A or Has-A links • Contains base set of common database concepts, but new concepts can be added • A GD term is a single, unambiguous semantic definition. • Several GD entries for a single English word are required if the word has multiple definitions.
Architecture Components:Using the Global Dictionary • GD terms are used to build semantic names to describe the semantics of schema elements. • Semantic names have the form: • semantic name = “[“CT [[;CT] | [,CT]] “]” CN • CT = context term, CN = concept name • each CT and CN is a single term from the GD • Semantic names are included in RIM specifications describing a data source.
Architecture Components:The Relational Integration Model • Database metadata and semantic names are combined into Relational Integration Model (RIM) Specifications (RIM Specs) • contains information on a relational schema • organized into database, table, and field levels • stores semantic names to describe and integrate schema elements
Architecture Components:Integrating RIM Specs • Each database to be integrated is described using a RIM specification. • Identical concepts in different databases are identified by similar semantic names. • Concepts with identical (or hierarchially related) semantic names are combined regardless of their physical representation in the individual databases.
Integration Architecture • Our integration architecture consists of two separate phases: • capture process: RIM specs are constructed for each data source independently • integration process: RIM specs are combined using the integration algorithm which matches semantic names using the global dictionary
Integration Architecture:The Capture Process • Capture process involves: • automatically extracting the schema information and metadata using a specification editor • assigning semantic names to each schema element (tables and fields) to capture their semantics
Integration Architecture:The Capture Process Relational Schema Automatic Extraction RIM Spec Specification Editor DBA Lookup of terms Global Dictionary
Integration Architecture:The Integration Process • Integration process involves: • automatically identifying identical concepts by matching semantic names • constructing a global view of database concepts consisting of a hierarchy of concept terms • resolving structural differences during query generation and submission (e.g. a concept may be represented as a table in one database and a field (attribute) in another)
Integration Architecture:The Integration Process …………. Client Client Integration Site Subtransactions RIM spec RIM spec …….. RDBMS RDBMS
Integration Architecture Benefits • The benefits of the two phase architecture are: • Dynamic integration: schemas integrated as needed • RIM Specs are constructed only once and independent of each other • Automatic conflict resolution by integrating based on semantic name rather than physical structure • Users are isolated from system names and organization by querying through a global view using semantic names for concepts
Integration Example • Two claims databases to be integrated: • ABC Company: Claims_tb(claim_id, claimant, net_amount, paid_amount) • XYZ Company: T_claims(id, customer, claim_amt), T_payments(cid, pid, amount) • First step is to construct RIM specs for each database.
Integration Example:Integrated View • Global view after integration: • [Claim] • Id • Net amount • [Customer] • name • [Payment] • id • amount
Integration Example:Discussion • Important points: • system and field names are not presented to the user who queries based on semantic names • database structure is not shown to the user • different physical representations for the same concept are combined (e.g. payment (attribute) in ABC with payment table in XYZ database) • hierarchially related concepts (customer vs. claimant) are combined based on their IS-A relationship in the global dictionary
Applications to the WWW • Integrating diverse data sources is involved in constructing a data warehouse and other operational systems. • The WWW is a diverse organizations of databases which users access. • Automatically integrating web data sources by a browser or portal reduces query complexity and integration of results for the user.
Conclusions • Automatic integration of database schema is possible by using a global dictionary of terms and constructing semantic names for schema elements. • Integration of data sources has applications to the WWW and construction of data warehouses.
Important Changes • The integration architecture is constantly being refined. Some notable differences in this presentation versus the paper: • Our integration system uses XML to represent a RIM spec which is renamed as a X-Spec. • An integration site is used as a central portal for integration and management. • No longer using semantic distance calculations between terms. • Format of semantic name has been simplified.
Future Work • The integration architecture is involving with standards on XML and now captures metadata information in XML documents. • The system is being tested on sample problems, and a query mechanism is work-in-progress. • We are refining a prototype of the system called Unity.