360 likes | 546 Views
Data Fusion. Jens Bleiholder and Felix Naumann Presented by Aaron Stewart. Data Integration. Schema mapping Duplicate detection Data fusion. Complete / Concise. Like recall/precision Complete: coverage of real-world objects Concise: avoid duplicates. Conflicts. Schematic conflicts
E N D
Data Fusion Jens Bleiholder and Felix Naumann Presented by Aaron Stewart
Data Integration • Schema mapping • Duplicate detection • Data fusion
Complete / Concise • Like recall/precision • Complete: coverage of real-world objects • Concise: avoid duplicates
Conflicts • Schematic conflicts • Identity conflicts • Data conflicts • Uncertainty • Contradiction
Uniqueness • Uniqueness-preserving • Uniqueness-enforcing
Value preservation • Value-preserving • Non-value-preserving • Object-preserving
Joins • Equi-join • Natural join • Full outer join • Key join • Left join • Right join
Equi-join SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name
Equi-join Result SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name
Natural Join SELECT U1.Name, U1.Age, U1.Status, U1.Address, U1.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name AND U1.Age=U2.Age AND U1.Status=U2.Status AND U1.Address=U2.Address AND U1.Field=U2.Field
Natural Join Result SELECT U1.Name, U1.Age, U1.Status, U1.Address, U1.Field, U1.Library, U2.Phone FROM U1 JOIN U2 ON U1.Name=U2.Name AND U1.Age=U2.Age AND U1.Status=U2.Status AND U1.Address=U2.Address AND U1.Field=U2.Field
Full Outer Join SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 FULL OUTER JOIN U2 ON U1.Name=U2.Name
Full Outer Join Result SELECT U1.Name, U2.Name, U1.Age, U2.Age, U1.Status, U2.Status, U1.Address, U2.Address, U1.Field, U2.Field, U1.Library, U2.Phone FROM U1 FULL OUTER JOIN U2 ON U1.Name=U2.Name
Full Disjunction • Generalizes outer join to more than two tables
Information Systems for Data Fusion • Conflict resolution • Conflict avoidance • Conflict ignorance • No conflict handling
Architecture • Database management system (DBMS) • Multidatabase management system (MDBMS) • Mediator-wrapper (MW) • Multi-agent system (MAS) • Stand-alone application (APP)
Integration Model • Global-as-view (GaV) • Local-as-view (LaV) • Global-Local-as-view (GLaV)
1. Conflict-Resolving Systems • Multibase • Hermes • Fusionplex • HumMer • Ajax
Multibase • C. 1983 • Solution: • Outer join • Aggregation (min, max, sum, choose, etc.)
Hermes • HEterogeneous Reasoning and MEdiator System • C. 1996 • Mediator-specified conflict resolution • Created by an expert
Fusionplex • Multiplex, Fusionplex, Autoplex • Classifies quality of data • User-prioritized feature “importance” • Able to incorporate new/unknown databases
HumMer • Humboldt-Merger • C. 2006 • Handles conflicts in schema, identity, data • Clusters duplicates • User-defined aggregation functions
Ajax • Format and unit conversion • User-defined cleansing process • Compiled to Java
2. Conflict-Avoiding Systems • TSIMMIS • SIMS and Ariadne • Infomix • HIPPO • ConQuer • Rainbow
Conflict-Ignoring Systems • Pegasus • Nimble • Carnot • InfoSleuth • Potter’s Wheel
Other Systems • Research Systems • Trio • Information Manifold • Garlic • Disco (Distributed Information Search Component) • Papyrus, Nomenclature • DIOM, KOMET, Infomaster, Occam, SIMS, Internet Softbot • Singapore, Magic, Observer • Lore, Tukwila • SIRIUS-DELTA, DDTS, Mermaid, UNIBASE • MRDSM, OMNIBASE, CALIDA, DQS
Other Systems • Commercial • IBM, Oracle, Microsoft, others • IBM Information Server (IIS) • Microsoft SQL Server Integration Services (SSIS)
Other Systems • Peer Data Management Systems • Orchestra • Hyper
Analysis • Weaknesses • Difficult to show utility of a tool on paper • Strengths • Covered a lot of theory • Covered a lot of systems