360 likes | 469 Views
The MOMIS approach for Information Integration. Domenico Beneventano. Integration : motivation and problems. Internet has as side effect: the “fragmentation” of information in data sources of small size
E N D
The MOMIS approach for Information Integration Domenico Beneventano
Integration : motivation and problems • Internet has as side effect: the “fragmentation” of information in data sources of small size • Companies ask for an increased capability of analyzing data from heterogeneous data sources coming from legacy systems • Efficient Network technologies make it convenient to develop distributed information systems • An integration process has to take into account the problems deriving from the heterogeneity of data sources : • Platforms Heterogeneity, it involves hardware/software/DBMSs Solution: different standards, as SQL, ODBC, XML, … • Structural Heterogeneity, it involvesthe different conceptual, logic data models and the different representationsof the same concept • Semantic Heterogeneity, it involvesthe different meaning of terms in data sources (homonymy and synomymy)
Approaches for interoperability and integration • Different approaches have been proposed to guarantee interoperability and integration of data sources: • Read-only integrated Views : Mediation. The goal of the Mediator based architectures is to obtain an integrated view, read only, of the data stored in different sources.Essentially, it is necessary to build a component, the Mediator, able to build an integrated schema and to allow users to pose queries with respect to this schema, receive a unified answer, thus disregarding the heterogeneity of the different sources. • Databases which share information: Federation. This architecture implement a framework that allows database to behave as in a federation. Each source may extend its schema by including subschemas (and data) of other federated databases (Information Integrator IBM).
Wrapper/Mediator Architecture • Wrapper • extracting the schema of the data sources : local schema(there is one wrapper for each different kind of data source). • converting the local schema into the common format of the Mediator • sending a query to the data sources • exporting the local queries results to the Mediator. MEDIATOR User Wrapper Wrapper Wrapper ... File System DB1
Semantic Integration of Heterogeneous Data Global Virtual View(GVV) Query Mapping LocalSchema LocalSchema LocalSchema Source Source Source • The Local Schema of each source is available (source wrapping) • Data integration provides a Global Virtual View (GVV) that • is a conceptualization describing the involved sources. • allows a user to raise a query and to receive a single unified answer
Main problems in data integration [Lenzerini 2003] • (Automatic) source wrapping • How to construct the Global Virtual View • How to discover interschema properties among the sources and mappings between the sources and the Global Virtual View • How to model the mappings between the sources and the Global Virtual View • How to process updates expressed on the Global Virtual View, and updates expressed on the sources (Schema Evolution) • How to answer queries expressed on the Global Virtual View (Global Query Management) • Query optimization • Data extraction, cleaning and reconciliation (Extensional Integration)
MOMIS • MOMIS (Mediator envirOnment for Multiple Information Sources) is a framework to perform information extraction and integration of heterogeneous data sources • Semantic Integration of Information • A common data model ODLI3 derived from ODL-ODMG : standard languages for Object DataBasesI3 (Intelligent Information Integration) is a project of the ARPA (Advanced Research Project Agency) with the objective to define the reference architecture for automatic integration of heterogeneous sources • Tool-supported techniques to construct the Global Virtual View • Local Schema Annotation w.r.t. a common lexical ontology (WordNet) • Semi-automatic generation of relationships between local schemata • Clustering techniques • Semi-automatic generation of mappings between the GVV and sources
MOMIS: Overview of the GVV-generation process COMMON THESAURUSGENERATION GVV GENERATION GLOBAL CLASSES SYNSET# SCHEMA DERIVEDRELATIONSHIPS SYNSET4 LEXICON DERIVEDRELATIONSHIPS SYNSET1 SYNSET2 Common Thesaurus clustersgeneration USER SUPPLIEDRELATIONSHIPS MAPPING TABLES INFERRED RELATIONSHIPS WRAPPING ODLI3LOCAL SCHEMA 1 <XML> <DATA> Semi-Structured Source … ODLI3 LOCAL SCHEMA N RDB Structured source AUTOMATIC/ MANUALANNOTATION SEMI-AUTOMATIC ANNOTATION
Wrapping: from relational to object UNI (University Source) : relational source (Relational DB) ODLI3 schema Person(Name, address) Professor(Name, belongs_to,rank) FK: Name REF. Person FK: belongs_to REF. Division Student (Name, year,e-mail) FK: Name REF. Person Takes(Name, CourseName) FK: Name REF. Student FK: CourseName REF. Course class Person(key Name) { attribute string Name; attribute string Name; } class Professor : Person{ attribute Division belongs_to; attribute string rank; } class Student: Person { attribute set <Course> takes; attribute string rank; } class Course { … } • Professor IS-A Person • Properties inheritance
Example of object source CS (Computer Science Source) : object source (Object DB) Interface Researcher { attribute string name; attribute string e-mail; attribute set <Article> articles; } Interface Teaching { attribute string denomination; attribute string description; } Interface Class : Teaching { attribute string name; attribute integer year; attribute string specification; attribute Researcher professor; } … • complex object attributes: articles is a set of Article • object attributes: professor is a researcher
Wrapping of Web: from pages to data Wrapper • Road Runner Project (Università di Roma Tre & della Basilicata)http://www.dia.uniroma3.it/db/roadRunner/ • A large number of Web sites contain now highly structured regions • Large web sites are generated automatically (statically or dynamically) • Regularities occur at the page level, and at the site level as well • We aim at leveraging these regularities in order to face the issues of extracting fine grain data at a large scale xml documents or relational DB • Writing and maintaining wrappers is a labor intensive (and expensive) task
Wrapping of Web pages • Roadrunner: a tool to automatically generate wrappers for web pages • Target: Data-intensive Web sites • HTML pages generated by scripts that extract data from a large database • Page class: collection of pages generated by the same script => similar structure • Main Idea: exploit similarities and differences to infer a data extraction program (wrapper) Input sample pages Pages similar in structure to those of the sample set Roadrunner Wrapper generator Output data Output wrapper
Common Thesaurus • Set of terminological (intensional) relationships between terms (classes and attributes names) to express both intra-schema and inter-schema knowledge: • < Ti SYN Tj > Synonymy • < Ti BT Tj > Broader Terms (BT) : hypernymy • < Ti NT Tj > (Narrower Terms - NT) hyponymy • < Ti RT Tj > (Related Terms - RT) positive association or holonomy • Common Thesaurus Generation: • lexicon derived relationships • schema derived relationships • designer supplied relationships • inferred relationships
Schema-derived Relationships • automatic extraction of intra schema relationships from each schema separately • BT from IS-A Relationships < CS.TEACHING BT CS.CLASS> • BT from relational foreign keys among primary keys < UNI.PERSON BT UNI.PROFESSOR> • RT from object attributes < CS.CLASS RT CS.RESEARCH> • RT from relational foreign keys < UNI.PROFESSOR RT UNI.DIVISION>
Annotation and Lexicon-derived Relationships NT Hyponymy <UNI.COURSE SYN CS.CLASS> <CS.COURSENT UNI.TEACHING> Lexicon-derived relationships: • LOCAL SOURCE ANNOTATION : To assign meanings of class and attribute names w.r.t. a common lexical ontology (WordNet) • Lexicon relationships among terms of the sources can be derived on the basis of the semantic relationships between synset in WordNet (Hyponymy / Hypernymy, Meronymy, Correlation (between synsets having the same Hypernym), …) :
Extending WordNet: WordNet Editor • If a class or attribute name has no correspondent in WordNet, the designer may add a new meaning and proper relationships to the existing meanings. • The designer may add a new meaning (for an existing word-form or for a new one) by: • writing the gloss explicitly, or • using an existing synset chosen among a list of candidates obtained by an explicit search (using one or more keywords) or by exploiting similarity search techniques. • The designer may add relationships for the new synset • Related synsets are obtained by an explicit search (using one or more keywords) or by exploiting similarity search techniques.
Inferred Relationships • Relationships derived as a logical consequence from a set of other Relationships. As a simple example, from <Researcher RT Department> and < Researcher BT Professor> • we can derive that : <Professor RT Department> • The inferences are based on Description Logics techniques • Description Logics (DLs), describe knowledge in terms of concepts and their roles. • DLs provide a means of describing structured knowledge in a way that one can reason about it efficiently (i.e., infer new knowledge). • DLs are subsets of the First-Order Logic, providing decidability and computationally feasible reasoning at the cost of expressiveness. • ODB-Tools : DLs techniques for Object DataBase(www.dbgroup.unimo.it/ODB-Tools.html):
Global Virtual View Generation Affinity evaluation and Clusters Generation Clusters generation Common Thesaurus Clusters of Classes Affinity evaluation • Affinity evaluation among classes: • Name Affinity : computed by exploiting relationships in the Common Thesaurus; a weight (W) is assigned to each kind of relationships, with WSYN >= WBT >= WRT • Structural Affinity: measures the affinity of class attributes • Clusters generation : Classes with affinity are grouped together in clusters by employing a hierarchical clustering technique. • ARTEMIS tools (Università di Milano) https://islab.dico.unimi.it/artemis/.
Affinity Matrix School_Member Research_Staff Univ_Student Professor CS_Person Student Course Section ... Research_Staff - 0.54 0.45 0.5 0.42 0.43 0.36 0.39 CS_Person 0.68 0.56 0.2 0.2 - - 0.6 0.68 - School_Member - - 0.45 0.65 0.6 0.2 0.2 - - Professor - - 0.62 0.43 0.25 0.25 - - - Student - 0.51 0.25 0.25 - - - - Univ_Student - - - 0.2 0.2 ...
Cluster Generation 0.25 0.375 CL5 Location 0.39 0.6 0.39 CL3 Division Department 0.54 0.66 CL4 Room 0.6 Research_Staff Section Course 0.65 Univ_Student CL2 0.68 School_Member 0.68 Student CS_Person Professor CL1 Affinity Matrix Clustering Algorithm Clustering Algorithm: (Hierarchical clustering techniques - Heveritt 74)
Global Virtual View Generation Global Classes and Mapping Table Generation Global Classes and Attributes generation Mapping Table generation Global Virtual View Common Thesaurus & Clusters • Global Classes generation :A global class C=(L,GA) is generated for each cluster : • L are the local classes of the cluster • GA are the global attributes of C • Union of the local attributes • Fusion of “similar attributes” (by using the Common Thesaurus) • Mapping generation :For each global class C=(L,GA), a Mapping Table (MT) is generated, to represent the mappings between global and local attributes • MT is a table GAXL : An element MT[GA][L] represents the attributes of the local class L mapped into the global attribute GA.
GVV and MT generation : example • Cluster C= {UNI.Course, CS.Teaching, CS.Class} • Mapping Table of C • MT generation : Since CS.denomination NT UNI.Name, these local attributes correspond to the same global attribute GA • GVV annotation : • the name and the meaning of the class C corresponds to the name and the meaning of CS.Teaching (the broadest class) • the name and the meaning of GA corresponds to the name and the meaning of CS.description
How to model the mappings between the sources and the GVV? GVV : Teaching(Name,Description,Year,Period,Professor) Sources :CS.Teaching(Denomination,Description)UNI.Course(Name,Year,Period)CS.Class(Name,Specification,Year,Professor) • Is the GVV defined in terms of the sources? • Approach global-schema-centric, or global-as-view (GAV) • Are the sources defined in terms of the GVV? • Approach source-centric, or local-as-view (LAV) • A mixed approach? Approach called GLAV
Local-as-View • Local-as-View (LAV) mappings: • associated to source classes we have views over the GVV GVV : Teaching(Name,Description,Year,Period,Professor) Sources :CS.Teaching(Denomination,Description)UNI.Course(Name,Year,Period)CS.Class(Name,Specification,Year,Professor) LAV mappings :CS.Teaching --> SELECT Name,Description FROM Teaching UNI.Course --> SELECT Name,Year,Period FROM Teaching CS.Class --> SELECT Name,Description Year,Professor FROM Teaching
Global-as-View • Global-as-View (LAV) mappings: • associated to classes in the GVV we have views over the sources GVV : Teaching(Name,Description,Year,Period,Professor) Sources :CS.Teaching(Denomination,Description)UNI.Course(Name,Year,Period)CS.Class(Name,Specification,Year,Professor) GAV mappings :Teaching --> SELECT f1(CS.Teaching.Denomination,…) AS Name, f2(…) As Description, …FROM CS. Teaching FULL JOIN UNI.Course ON (…)FULL JOIN CS.Class ON (…)
Global-as-View: the MOMIS approach • Global-as-View (GAV) mappings: for each global class C of the GVV we must define a conjunctive queryQC over the local classes of C. • GAV mappings: the MOMIS approachStarting from the Mapping Table of C, the integration designer, supported by the Ontology Builder graphical interface, can implicitly defineQC by: • using and extending the Mapping Table with • Data Conversion Functions from local to global attributes • Join Conditions among pairs of local classes belonging to C • Resolution Functions for global attributes to solve data conflicts of local attribute values. • using and extending the Full Disjunction operator
Building the Mappings: an example Data Conversion Functions Join Conditions StringConc (REGION,STATE) Join Attribute Resolution Functions Precedence(L1,L2) Select COMPANY_ID, precedence(T_L1.ADDRESS, T_L2.ADRESS) as Address, T_L2.SUBCONTRACTOR, … Full Disjunction from T_L1 full join T_L2 on (T_L1.COMPANY_ID = T_L2.COMPANY_ID)
Join Conditions and Full Disjunction • Object Identification: Merging data from different sources requires different representations of the same real world object to be identified. • Join Conditions: To identify instances of the same object and fuse them among pairs of local classes belonging to the same global class. As an example, for the global class Company the designer can define JC(L1.Company,L2.Company) :L1.COMPANY_ID = L2.COMPANY_ID • Full Disjunction (FD) “computing the natural outer-join of many relations preserving all possible connections among facts” • Given a global class C composed of L1,L2, ..., Ln we consider FD(T(L1), T(L2), . . . , T(Ln)) computed on the basis of the Join Conditions where T(L) denotes L transformed by the Data Conversion Function, i.e., the full disjunction operator is applied after data conversion.
Full Disjunction: an example L1=L1.Company L2=L2.Company FD(L1,L2) : select S(L1) union S(L2) from L1 outer join L2 on JC(L1,L2) whereJC(L1,L2) = L1.COMPANY_ID = L2.COMPANY_ID
Resolution Functions • The fusion of data coming from different sources has to take into account the problem of inconsistent information among sources. • In the context of MOMIS we adopt the Resolution Function. • A Resolution Function may be defined for each global attribute mapping onto local attributes coming from several local sources, in order to solve data conflicts due to different local attribute values. • Homogeneous Attributes : If there are no data conflicts for a global attribute mapped onto more than one source • As an example, in GVV.Company, we define all the global attributes as Homogeneous Attributes except for Address where we used a precedence function: L1.ADDRESS has a higher precedence than L2.ADDRESS
Resolution Functions Application of the resolution functions
FD Computation with more that 2 classes L1 JC(L1,L3) JC(L1,L2) JC(L2,L3) L3 L2 • [Rajarama, Ullman - PODS 1996] : There is a outerjoin sequence producing FD if and only if the set of relation forms a connected, acyclic hypergraph. • A Global Class C with more than 2 local classes is a cyclic hypergraph we need a new method Example with n = 3 : • The computation of FD is performed assuming: • each L contains a key, • all the join conditions are on key attributes, • all the join attributes are mapped into the same set of global attributes (K). Then, FD can be computed as:L1 full join L2 on JC(L1,L2)) full join L3 on (JC(L1,L3) OR JC(L2,L3)
Global Query Management • The querying problem: How to answer queries expressedon the GVV (global queries)? • Query rewriting : to rewrite a global query as an equivalent set of queries expressed on the local schemata (local queries). • GAV approach: the query is processed by means of unfolding :by expanding a query on a global class C of the GVV according to the definition of QC • Fusion and Reconcilation of the local answers into the global answer • Object Fusion & Resolution Functions • Inconsistencies between sources • Query Optimization
Query unfolding example Global Class: Company = { L1.Company, L2.Company} Global Query Q: SELECT COMPANY_ID, REGION, ADDRESS FROM company WHERE CAPITAL_STOCK > 50 AND REGION LIKE '*VENETO*' AND SUBCONTR LIKE ’yes’ LQ1 SELECT COMPANY_ID, REGION,ADDRESS, FROM L1.company WHERE (REGION like '*VENETO*' and CAPITAL_STOCK > 50) Local queries LQ2 SELECT COMPANY_ID, REGION,STATE,ADDRESS, FROM L2.company WHERE ( strconc(REGION,STATE) like '*VENETO*' and SUBCONTR like 'yes')
Query unfolding example Q: SELECT REGION, ADDRESS FROM company WHERE CAPITAL_STOCK > 50 AND REGION LIKE '*VENETO*' AND SUBCONTR LIKE ’yes’ Full disjunction computation: FD : Select * from T_LQ1 full outer join T_LQ2 on (T_LQ1.COMPANY_ID = T_LQ2.COMPANY_ID) Residual predicate: RP: SELECT * FROM FD WHERE CAPITAL_STOCK > 50 AND SUBCONTR LIKE ’yes’ Query Answer Select COMPANY_ID, REGION, precedence(T_LQ1.ADDRESS, T_LQ2.ADDRESS) AS ADDRESS from RP