370 likes | 529 Views
Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data. Outline. Introduction Graph-based Datastore Representation Application Ontology Construction and Representation Datastore Annotation ETL Transformations Conclusions. Outline. Introduction
E N D
Ontology-based Conceptual Design of ETL Processes for both Structured and Semi-structured Data
Outline • Introduction • Graph-based Datastore Representation • Application Ontology Construction and Representation • Datastore Annotation • ETL Transformations • Conclusions
Outline • Introduction • Graph-based Datastore Representation • Application Ontology Construction and Representation • Datastore Annotation • ETL Transformations • Conclusions
Problem description • Conceptual design of ETL processes • is a critical task performed at the early stages of a DW project • describe the integration of data from heterogeneous sources into the Data Warehouse • Two main goals • specify inter-schema mappings • identify appropriatetransformations
Motivation • The problem of heterogeneity in data sources • structural heterogeneity • data stored under different schemata • semantic heterogeneity • different naming conventions • e.g., homonyms, synonyms • different representation formats • e.g., units of measurement, currencies, encodings • different ranges of values
Overview of our approach • Key idea • an ontology-based approach to facilitate the conceptual design of an ETL scenario • An ontology • is a “formal, explicit specification of a shared conceptualization” • describes the knowledge in a domain in terms of classes, properties, and relationships between them • machine processable • formal semantics • reasoning mechanisms • The Web Ontology Language (OWL) is used as the language for the ontology • W3C recommendation • based on Description Logics
Overview of our approach • Method • Construct a graph representation for each datastore • datastore graph • Construct a suitable application ontology • ontology graph • Annotate the datastores • Establish mappings between the datastore graph and the ontology graph • Apply reasoning techniques to • select relevant sources • to identify required transformations
Outline • Introduction • Graph-based Datastore Representation • Application Ontology Construction and Representation • Datastore Annotation • ETL Transformations • Conclusions
Datastore schema • The schema SD of a datastore comprises • elements containing the actual data • elements containing or referring to other elements
Datastore graph • Each element e defined in the schema SD is represented by a node ve∈VD. • Each containment relationship between elements e1, e2 is represented by an edge (v1, v2). • Each reference from element e1 to element e2 is represented by an edge (v1, v2). • Each edge is assigned a label of the form [min, max] denoting the corresponding cardinality. • Elements containing the actual data are represented by leaf nodes
Reference example (cont’d) • Datastore graphs
Outline • Introduction • Graph-based Datastore Representation • Application Ontology Construction and Representation • Datastore Annotation • ETL Transformations • Conclusions
Application Ontology • A suitable application ontology is constructed to model • the concepts of the domain • the relationships between those concepts • the attributes characterizing each concept • the different representation formats and (ranges of) values for each attribute
Application Ontology • The application ontology comprises • a set of classes C = CC∪ CT∪ CG • CC : classes representing domain concepts • CT : classes representing value types • CG : classes representing aggregate functions • a set of properties P containing • PP : properties representing attributes of concepts or relationships between concepts • property: convertsTo • property: aggregates • property: groups
Ontology Graph • A graph representation specified for the ontology • Graph nodes represent classes in the ontology • Graph edges represent properties in the ontology • Different symbols are used for the different types of classes and properties
Reference example (cont’d) • The application ontology graph
Outline • Introduction • Graph-based Datastore Representation • Application Ontology Construction and Representation • Datastore Annotation • ETL Transformations • Conclusions
Datastore annotation • The semantic annotation of each datastore consists in establishing the appropriate mappings between the datastore graph GS and the ontology graph GO. • Each internal node of GS may be mapped to one concept-node of GO. • A leaf node of GS may be mapped to one or more nodes of GO of the following types: • type-node • format-node • range-node • aggregated-node • A node may have zero or more mappings. • Mappings are represented as node labels.
Datastore annotation • A defined class is created in the ontology for each internal labeled node of the datastore graph. • The definition for a node is constructed based on its neighbor labeled nodes. • A neighbor labeled node of n is each node n΄ such that: • n΄ is labeled • there is a path p in the datastore graph from node n to node n΄ • p contains no other labeled nodes, except n and n΄
Reference example (cont’d) • Datastore mappings
Reference example (cont’d) • Datastore definitions
Outline • Introduction • Graph-based Datastore Representation • Application Ontology Construction and Representation • Datastore Annotation • ETL Transformations • Conclusions
ETL Transformations • Generic types of ETL transformations
Generating ETL transformations • Two main steps • select relevant sources to populate each DW element • identify required data transformations between the sources and the DW
Generating ETL transformations • Selecting relevant sources • a source node nS, mapped to class cS • a target node nT, mapped to class cT • nS is provider for nT, if • cS and cT have a common superclass • ensures that the integrated data records have the same semantics • cS and cT are not disjoint • prevents data integration between datastores with conflicting constraints
Generating ETL transformations • Identifying data transformations (I) • a RETRIEVE operation for each provider node n • a MERGE operation to combine data from several provider nodes • an EXTRACT operation to extract a portion of data from a provider node
Generating ETL transformations • Identifying data transformations (II) • if CS ≡ CT or CS ⊏CT, no transformations are required • if CT⊏ CS, AGGREGATE, FILTER and/or MINCARD/MAXCARD operations are required • else, as previous plus CONVERT operations
Generating ETL transformations • Identifying data transformations (III) • a JOIN operation to combine recordsets from nodes, whose corresponding classes are related by a property. • a UNION operation, followed by a DD operation, to combine recordsets from nodes, whose corresponding classes have a common superclass. • a STORE operation to denote loading of data to the target datastore.
Reference example (cont’d) • Provider nodes and transformations for DS2
Outline • Introduction • Graph-based Datastore Representation • Application Ontology Construction and Representation • Datastore Annotation • ETL Transformations • Conclusions
Conclusions • A graph-based representation, datastore graph, as a common model for the datastores. • A suitable application ontology and a corresponding graph representation, ontology graph. • Datastore annotation through mappings from the datastore graph to the ontology graph. • Reasoning on the mappings to identify relevant sources and required transformations.
Current and Future Work • Semi-automatic construction of the application ontology • Semi-automatic annotation of the datastores • Executable workflow • Evaluation on real-world ETL scenarios • Maintenance/adaptation of the ETL workflow