DATA FUSION

DATA FUSION Resolving Inconsistencies at Schema, Tuple and Value Level NaveenRajamoorthy Nachiappan Chidambaram ArunkarthikeyanPalaniswamy SriramakrishnanSoundarrajan

Need for Data Fusion • To compare different Data Sets • Example: • Shopping Agents • Disaster Management System

GOALS OF DATA INTEGRATION • Completeness - amount of data (number of attributes and tuples) - achieved by adding more data sources • Conciseness - number of unique objects - number of unique attributes of the objects - achieved by reducing schematic inconsistencies by schema mapping • Correctness - validity of data - achieved by performing duplicate detection and data fusion Data Sources Schema Mapping Duplicate Detection Data Fusion

Humboldt Merger(HumMer) • Fusing data from heterogeneous sources. • All Steps are performed at run-time. • Data Cleaning • Maximum Flexibility

Components of Data Fusion • Heterogeneous and Dirty data • Three Steps • Schema Matching and Data Transformation • Duplicate Detection • Data Fusion

Application Data Fusion Resolve inconsistencies at value level Step 3 Step 2 Duplicate Detection Resolve inconsistencies at tuple level Step 1 Resolve inconsistencies at schema level Schema Matching Data Sources Three Steps in Data Fusion

Schema Matching and Data Transformation

Schema Matching • Process of resolving schematic heterogeneity. 1. DUMAS Schema Matching Algorithm (Duplicate-based Matching of Schemas ) 2. TF IDF Similarity (term frequency–inverse document frequency)

Example Consider the relation R(A,B,C,D,E) and S(B’,F,E,’G)

ExampleConsider the relation R(A,B,C,D,E) and S(B’,F,E,’G)

Example

Schema Matching • Overlap of R and S schema

Transformation • Preferred schema • Names of attributes are renamed or determined. • sourceID attribute is added to all tables in the schema.

Duplicate Detection

EXAMPLE <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> Source A Source B <publication> <title> Database Systems: The Complete Book </title> <author> Molina & Ullman</author> <year> 1990 </year> </publication>

SCHEMA MAPPING <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> </title> <Authors> <author> </author> <author> </author> </Authors> <year> </year> </pub> Source A Source B <publication> <title> Database Systems: The Complete Book </title> <Author> Molina & Ullman</Author> <year> 1990 </year> </publication>

DATA TRANSFORMATION <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> Source A <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> XQuery <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> Source B XQuery

DUPLICATE DETECTION AND FUSION <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> <year> 1990 </year> </pub>

QUESTION • Give the correct order in which integration needs to be carried out: A) Data Transformation -> Schema Mapping -> Duplicate detection ->Fusion B) Duplicate detection -> Data Transformation -> Schema Mapping -> Fusion C)Schema Mapping -> Data Transformation -> Duplicate detection ->Fusion D) Data Transformation -> Schema Mapping -> Fusion -> Duplicate detection

Duplicate Detection • Problem • Given one or more data sets, find all sets of objects that represent the same real-world entity. • Difficulties • Duplicates are not identical • Similarity measures – Levenshtein, Jaccard, etc. • Large volume, cannot compare all pairs • Partitioning strategies – Sorted neighborhood, Blocking, etc.

PARTITIONING STRATEGIES • General Strategy • Sorted Neighborhood Method

GENERAL STRATEGY Compare each record with every other record and calculate distance measure. Assuming there are n records in database then we need to compute n(n-1)/2 distance measures. Number of records, n = 6 Number of Distance measures to be computed = 10 If there are say, 100000 records, Then, Number of Distance Measures to be computed = 5*10^8 calculations EXPENSIVE

SORTED NEIGHBORHOOD METHOD • Using Sorted Neighborhood method we can reduce the number of potential duplicate pairs. • Different fields are identified as key. • The database is sorted using this key. • After sorting a window of fixed size slides over the sorted database and duplicate records are identified. • The technique generates O(wN) pairs where w is window size and N is the total number of records in database.

DUPLICATE DETECTION WITH DESCRIPTIONS Criteria For Attribute Selection: Attributes that are: related to the currently considered object Child elements having a Foreign key constraint over the attributes of the parent table. (ii) useable by our similarity measure Attribute City corresponding to attribute Zip code cannot be used to calculate similarity measure (iii) likely to distinguish duplicates from non-duplicates. Attribute for Denomination is unlikely to distinguish duplicate records

DUPLICATE DETECTION WITH DESCRIPTIONS Description: • Consider attributes from other tables that have a foreign key relationship with the existing tables. • For efficiency, only direct child attributes are considered, i.e. no descendants reached by following more than one reference are discarded. Let tables T1 and T2 be the two matched tables, and let {T1,1, . . . , T1,k} and {T2,1, . . . , T2,m} be their respective children tables. Then, every pair of tables (T1,i, T2,j), 1<=i<=k, i<=j<=m is matched. Thus Actor(Movie),Actress(Movie) and Actors(Film) can also be used for Duplicate Detection

Example Table 1 Table 2 ID in Table 1 is a foreign key for Country ID in Table 2 From Sim(Country) in Table 1 we understand row 1 and 3 are duplicates (row 1 = row 3) Now on using the attribute City in child table, Table 2 for Duplicate Detection we come to the conclusion that row 1 = row 2 = row 3 in table 1. i.e: USA = United States = US

Detection From Similarity Measure Source 1 Source1 x Source2 Partitioning Similarity measure θ1>sim<θ2 Source 2 Sure Duplicates sim > θ2 Possibile Duplicates sim < θ1 Non-Duplicates

Data Fusion • Objective Given a duplicate, create a single object-representation while resolving conflicting data values. Simple Example: Min(price) Max_length(author) Concat(Month,Year) Source 1 ID Source 2

TYPES OF DATA CONFLICT • Uncertainty • NULL value vs. non-NULL value • “Easy” case • Uncertainty Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity Causes: Missing information, such as null values in a source or a completely missing attribute in a source • Contradiction It is a conflict between two or more different non-null values that are all used to describe the same property of the same entity. Causes: Contradiction is caused by different sources providing different values for the same attribute of a real-world entity. • Contradiction • Non-NULL value vs. (different) non-NULL value

NULL TYPES • unknown • There is a value, but I do not know it. • E.g.: Unknown date-of-birth • notapplicable • There is no meaningful value. • E.g.: Spouse for singles • withheld • There is a value, but we are not authorized to see it. • E.g.: Private phone line

Question • ________ refers to “Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity” • A. Contradiction • B. Uncertainty • C. Resolution • D. Ignorance

Classification of Functions conflict resolution strategies conflict ignorance conflict avoidance conflict resolution Escalate instance based metadata based instance based metadata based Coalesce ChooseDepending Concat Choose deciding mediating deciding mediating CommonAncestor MostRecent MostAbstract MostSpecific MIN, MAX Random Vote AVG, SUM

Conflict Resolution Functions

Data Fusion Goals Assume 2 sources, Source 1(A,B,C) and Source 2(A,B,D) a, b, - a, b, -, - Identical tuples   a, b, -, - a, b, - a, b, -, - a, b, c a, b, c, - Subsumed tuples   a, b, c, - a, b, - a, b, -, - a, b, c a, b, c, -   a, b, c, d Complementing tuples a, b, d a, b, -, d a, b, c a, b, c, -   a, f(b,e), c, d Conflicting tuples a, e, d a, e, -, d

Relational Operators – Overview • Identical tuples (duplicates) • UNION, OUTER UNION • Subsumed tuples(uncertainty) • MINIMUM UNION • Complementing tuples (uncertainty) • COMPLEMENT UNION, MERGE • Conflicting tuples (contradiction) • MATCH, GROUP, FUSE

UNION ( SELECT Title AS Name,Author,ISBN FROM R)UNION( SELECT Name,Author,ID AS ISBNFROM S) UNION

MINIMUM UNION = Select A,B,C,D AS NULL FROM R UNION ALL SELECT A,B, C AS NULL,D FROM S A tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values.  +

FULL DISJUNCTION = |⋈| SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B;

MERGE AND PRIORITIZED MERGE SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B; |⋈ = =  = = ⋈|

FUSE BY SELECT Name, RESOLVE(Age, max), RESOLVE(Address,choose(EE_Students)) FUSE FROM EE_Students,CS_Students FUSE BY (Name) RESULT

FUSE BY IMDB FILMBUFF RESULT SELECT ID, RESOLVE(Title, Choose(IMDB)), RESOLVE(Year, Max), RESOLVE(Director,Concat), RESOLVE(Rating), FUSE FROM IMDB, Filmdienst FUSE BY (ID) ON ORDER Year DESC

Question Match The Following a, b, -, - Complementing tuples a 1 a, b, -, - a, b, c, - Conflicting tuples 2 b a, b, -, - a, b, c, - Subsumed tuples 3 c a, b, -, d a, b, c, - d Identical tuples 4 a, e, -, d

Hummer Screenshot

REFERENCES • http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion_vldb.pdf • http://vldb.idi.ntnu.no/program/slides/demo/s1251-bilke.pdf • http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion-3step.pdf • http://www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/Modena05.pdf • http://vldb2009.org/files/DataFusionFinal.pdf • http://disi.unitn.it/~p2p/RelatedWork/Matching/dublicatesICDE05.pdf

DATA FUSION

DATA FUSION

Presentation Transcript

Environmental Data Integration (Fusion Synthesis)

Data Fusion

Multi-Sensor Data Fusion

Data Fusion

FATPOT – Virtual Data Fusion ™

Data Fusion

Data Fusion – Resolving Data Conflicts in Integration

Fusion: Big Data, Big Art

Multisensor Data Fusion and Applications

AUTOMATIC TARGET RECOGNITION AND DATA FUSION

Data Fusion

Multisensor Data Fusion

Multisensor Data Fusion

Data Fusion and Data Mining

Data Fusion via SOA Middleware

DATA FUSION & the CAAQS

Post Disaster Data Fusion Visualization

Kernel based data fusion

MISR/MODIS intercomparisons and data fusion

Fast & Automatic Data Fusion

Fusion-Incomplete Fusion

Fusion Neutronics - Methods, Data, Applications -

DATA FUSION

DATA FUSION

Presentation Transcript

Environmental Data Integration (Fusion Synthesis)

Data Fusion

Multi-Sensor Data Fusion

Data Fusion

FATPOT – Virtual Data Fusion ™

Data Fusion

Data Fusion – Resolving Data Conflicts in Integration

Fusion: Big Data, Big Art

Multisensor Data Fusion and Applications

AUTOMATIC TARGET RECOGNITION AND DATA FUSION

Data Fusion

Multisensor Data Fusion

Multisensor Data Fusion

Data Fusion and Data Mining

Data Fusion via SOA Middleware

DATA FUSION &amp; the CAAQS

Post Disaster Data Fusion Visualization

Kernel based data fusion

MISR/MODIS intercomparisons and data fusion

Fast &amp; Automatic Data Fusion

Fusion-Incomplete Fusion

Fusion Neutronics - Methods, Data, Applications -

DATA FUSION & the CAAQS

Fast & Automatic Data Fusion