• 100 likes • 311 Views
WG 3: Data Integration. Louiqa, Alejandro, Vincent, Monica, Stefan, Frank, Gerd, Felix. Data Integration. Architectures and application domains Quality criteria and issues specific to integration Compared 2 data sources based on quality criteria.
E N D
WG 3: Data Integration Louiqa, Alejandro, Vincent, Monica, Stefan, Frank, Gerd, Felix
Data Integration • Architectures and application domains • Quality criteria and issues specific to integration • Compared 2 data sources based on quality criteria. • Characterize data integration (2 sources) based on quality criteria.
Architectures and Domains • Architectures • Materialized integration (data warehouse) • Virtual integration (mediator/wrapper) • Materialized solution provides more opportunities for improving quality OFFLINE. • Application domains • Bibliographic
Quality Criteria • Duplicates • Inherited (already in sources) • Through integration • Synonyms and homonyms • Granularity • Higher granularity typically means higher level of detail and expressive power of queries. • Potentially lowest common denominator for result. • Completeness • Object cardinality • Attribute cardinality • CWA not applicable (open to discussion) • Content description of and knowledge about a source
Quality Criteria 2 • Ontologies • Conceptual schema is a sub-concept of ontology • Databases described by schemata • Integration possible without ontologies, • but easier and better with them. • Quality of ontologies versus quality of data • High quality of schema => High quality of data • High quality of schema => Ease of integration
Quality Criteria 3 • Currency • Timeliness, freshness, up-to-dateness,… • Materialization => Reduced currency • Virtual integration => Increased currency • Data warehouse: out-of-date data can be interpreted as missing data. • Response Time (Time, Cost, Delay) (later) • Greater impact on virtual integration • Trade-offs and user interests • Availability • DW increases availability
Compare 2 sources based on DQ • Citeseer • Domain: All online CS papers • and DBLP • Domain: All DB- and LP- and algorithmic papers • Duplicates • DBLP eliminated more duplicates • DBLP had higher granularity, e.g., pubtype • Citeseer is automated => more duplicates • Completeness • Citeseer higher object and attribute cardinality • Citeseer: 600,000 publications • DBLP: 420,000 publications • Ontologies • Both have schemata • Currency … • DBLP: Manual updates
Improved quality through integration? • Duplicates + Potential to improve (reduce) duplicates using combined data sources. - Potential of introducing new duplicates • Granularity - Structured model may force choice of lowest common denominator (month vs. day). Depends on integrating operator and model. • Completeness + Typically improves • Degree depends on level of object / attribute overlap
Improved quality through integration? • Ontologies - Comparing/merging ontologies typically requires a trade-off that reduces quality. + With increased effort, quality of combined ontology (integration) may improve. • Currency • An old value is not necessarily an out-of-date value. • Application specific • Stock quotes with different up-to-date values: retain all values? • Address information: Delete out-of-date information. + Integrated result has a guaranteed currency (lower bound).
Improved quality through integration? • Response Time (Time, Cost, Delay) - Greater impact on virtual integration • Trade-offs and user interests • Availability + DW increases availability • Depends on degree of overlap • Depends on specific query • Type of operator: Union (partial) or Join (fail)