400 likes | 530 Views
Discovering, Maintaining, and Using Semantics for Database Schemas. Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ. Background. Information integration is the problem of sharing and using data across disparate information sources.
E N D
Discovering, Maintaining, and Using Semantics for Database Schemas Yuan An, Ph.D. iSchool at Drexel February 23, 2009 CS Department at Villanova Univ.
Background • Information integration is the problem of sharing and using data across disparate information sources. • What challenges information integration is that information sources are often distributed, autonomous, and heterogeneous.
Example of Information Integration • Patient healthcare and medical data usually resides in multiple sources such as different units of hospitals, labs, clinics, personal data management devices, and even drugstores. • Example tasks for Information integration: • obtaining a holistic view of patient health status • merging data for multiple healthcare providers
A Central Issue • A key component of any solutions for information integration is the definitions of mappings between different data sources/schemas. • Despite a decade’s effort, building schema mapping remains a very difficult problem. • The difficulty lies in the requirement of understanding the meaning of the schemas being mapped.
An Example Boston Mass General Hospital DB Patient ID MedCr# Name Diagnosis Treatment Admission ID ProgID Doc Date ID Admission ID PatRef AdmDate DisDate Progress ID PatRef Symptom Coronary Treatment Pulmonary ID Policy# Enter Leave Patient ID PatRef Doc Date Desc ID Policy# Enter Leave Patient Philadelphia General Hospital DB Transfer patient medical information from Philadelphia General Hospital to Boston Mass General Hospital.
Schema Semantics Patient relate ID Treatment MedCr# Name Diagnosis Progress Doctor Patient * 1 hasID hasDate hasID hasSymptom hasPhyID hasName hasRefN hasName Treatment ID ProgID Doc Date Progress ID PatRef Symptom Admission apply * 1 ID PatRef AdmDate DisDate Progress Progress prescribe * 1 ID PatRef Symptom hasID hasSymptom Treatment ID ProgID Doc Date Coronary Treatment ID Policy# Enter Leave Patient ID PatRef Doc Date Desc
We aim at developing an automatic tool for discovering semantic mappings from database schemas to conceptual models (CM). Discovering Semantics DB conceptual model
Benefits of Discovering Semantics for Schemas Boston Mass Hospital DB Conceptual Model apply * 1 Treatment Treatment Progress Progress Doctor Doctor Patient Patient hasID hasDate hasID hasDate hasID hasSymptom hasID hasSymptom prescribe hasPhyID hasName hasPhyID hasName hasRefN hasName hasRefN hasName * 1 monitor relate recommend 1 * * 1 * * Treatment ID PatRef Doc Date Desc Philadelphia General Hospital DB Schema
We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution. Maintaining Semantics conceptual model DB DB’ conceptual model’
Using Semantics for Discovering Schema Mapping conceptual model 2 conceptual model 1 DB2 DB1
Roadmap Background Contributions Discovering Semantics for Schemas Maintaining Semantics for Schemas Using the Semantics for Schema Mapping Conclusions
apply * 1 prescribe * 1 monitor relate Treatment Treatment Progress Progress recommend Doctor Doctor Patient Patient hasID hasDate hasID hasDate 1 hasID hasSymptom hasID hasSymptom * * 1 hasPhyID hasName hasPhyID hasName hasRefN hasName hasRefN hasName * * Treatment ID PatRef Doc Date Desc Challenges • Much more semantics in conceptual models, e.g., • weak entities, partOf, n-ary relationships, • ISA relationships… • Need to distinguish them all from schema structures. Discover all and only the “reasonable” trees we call semantic trees that are plausible semantics of the table.
Treatment ID ProgID Doc Date Treatment ID PatRef Doc Date Desc Existing Mapping Tools • Schema matching tools: associate atomic elements in different schemas using syntactic links. • Schema mapping tools: infer query expressions for translating/exchanging data. • unable to discover expected semantics of a schema in terms of a conceptual model.
apply * 1 prescribe * 1 monitor relate recommend Treatment Treatment Progress Progress Doctor Doctor Patient Patient 1 * * 1 hasID hasDate hasID hasDate hasID hasSymptom hasID hasSymptom * * hasPhyID hasName hasPhyID hasName hasRefN hasName hasRefN hasName Treatment ID PatRef Doc Date Desc Our Solution for Discovering Semantics • The key is to discover “reasonable” links based on • analysis key and foreign key constraints in • schemas. • 2. a careful study of standard database design • princiles. We focus on deriving semantic trees connecting the individual concepts using “reasonable” links. Simple correspondences can be specified manually or by using a schema matching tool.
apply * 1 prescribe * 1 monitor relate recommend Treatment Treatment Progress Progress Doctor Doctor Patient Patient 1 * * 1 hasID hasDate hasID hasDate hasID hasSymptom hasID hasSymptom * * hasPhyID hasName hasPhyID hasName hasRefN hasName hasRefN hasName Treatment ID PatRef Doc Date Desc Discovering Semantic Trees Step1: determine a skeleton tree and its anchor by key columns. Step2: determine skeleton trees the their anchors corresponded to by f.k. columns. Step3: link the skeleton trees using shortest functional paths. Step4: link any concepts corresponding to unaccounted-for columns.
“Divide and Conquer” • A gradual manner: • ER0 – an initial subset with binary relationships. • ER1 – adding n-ary relationships • ER2 – adding ISA relationships.
“Good” Properties of the Algorithm • Guarantees only for “standard” relational schemas. • A sense of “completeness”: the algorithm finds all the “correct” semantics. • A sense of “soundness”: for multiple candidates, each one would result in an “indistinguishable” table by the standard database design methodology.
The MAPONTO Tool the mapping formulas
Evaluation Results • correct semantics for 85% of the tested tables. • maximum number of semantics candidates is 4. • Average execution time less than 1 second.
Roadmap • Background • Contributions • Discovering Semantics for Schemas • Maintaining Semantics for Schemas • Using the Semantics for Schema Mapping • Conclusions
We aim to develop a round-trip engineering solution for maintaining semantics under CM/schema evolution. Maintaining Semantics conceptual model DB DB’ conceptual model’
Challenges in Maintenance • What to maintain: how to define the property for maintenance and how to detect violation on the property. • How to capture changes to CMs and relational schemas. • How to reconcile CMs and schemas according to the intent of users.
Our Goals of Mapping Maintenance • To keep the mapping consistent: a consistent conceptual-relational mapping allows two-way legal instances translation. • To reconcile the conceptual model when the associated schema evolve. • To update the mapping when associated conceptual model evolve.
Capturing CM/Schema Changes • A user can change CM/schema in different ways: • Modifying the original model. • Generating a new model. • It is difficult to ask the user to provide a sequence of primitive actions. • It would be easier to ask the user to draw correspondences. Biosample(bsid,species,organ,…,donor_disease) Biosample(bsid,species,organ,…) tissue(bsid,donor_disease)
Reconciling CM and Schema • Analyzing the existing semantics in the original mappings in terms of skeleton trees and connections between anchors. • Discovering changes through correspondences between old and new models. • Synchronizing models and adapting the mapping accordingly.
Evaluation Methodology and Results • The same data sets for discovering conceptual-relational mappings. • Measuring efficiency and benefits in comparison to mapping reconstructing approach. • Comparing the number of mapping candidates generated by maintaining and reconstructing approaches. • The maintenance approach can save at least 80% of user effort for reaching consistent mappings. Execution time is insignificant: avg. < 1 sec.
Roadmap • Background • Contributions • Discovering Semantics for Schemas • Maintaining Semantics for Schemas • Using the Semantics for Schema Mapping • Conclusions
Using CM-Relational Mappings for Discovering Schema Mapping conceptual model 2 conceptual model 1 DB2 DB1
Progress ID PatRef Symptom Treatment ID ProgID Doc Date Treatment ID PatRef Doc Date Desc Current Solutions for Schema Mapping TARGET: SOURCE: compose Progress(ID,PatRef,Symptom) with Treatment(ID’,ProgID,Doc,Date) where Progress.ID=Treatment.ProgID → Treatment(ID’,PatRef,Doc,Date,Symptom).
Employee Employee ssn name ssn name Doctor Doctor Scientist Scientist ssn clinic ssn clinic ssn lab ssn lab eid name clinic lab Doctor employee ssn name clinic Scientist ssn name lab Using the Semantics X X • load Doctor.name and Doctor.clinic into employee as employee.name and employee.clinic in the target. • 2. load Scientist.name and Scientist.lab into employee as employee.name and employee.lab in the target. • 3. compose Doctor(ssn,name’,clinic) with • Scientist(ssn,name,lab) where they have the same ssn • → employee(z,name,clinic,lab).
Principles of the Semantic Approach • Discovering two conceptual subgraphs (CSG) that are “semantically similar” (≠ “structurally match”) and then translating the CSGs into algebraic expressions • connections between corresponding pairs of nodes are semantically similar or compatible, e.g., ISA, partOf… • maintaining desirable properties in database queries. • the principle of parsimony: smallest trees.
Evaluation Methodology • Comparison between the semantic approach and traditional approachs based on referential integrity constraint. • Manually specified mapping expressions as a “gold standard”. • Traditional “precision” and “recall” as evaluation criteria. • Data collection from a variety of domains.
Summary of the Evaluation Results • Found all the expected mappings as found by the traditional approach. • Improved precision (70% of the test cases) by eliminating suspicious pairings. • Improved recall (40% of the test cases) by considering ISA as functional relationship. • No much complicated semantics, no improvements.
Roadmap • Background • Contributions • Discovering Semantics for Schemas • Maintaining Semantics for Schemas • Using the Semantics for Schema Mapping • Conclusions
Conclusions • A novel and effective tool for discovering semantics for schemas in terms of conceptual models. • A round-trip engineering process for maintaining semantic mappings. • A semantic approach for improving schema mappings using the semantics. • A suite of tools for assisting users to discover and maintain mappings between different data representations in a variety of information integration situations.