540 likes | 937 Views
DATA FUSION. Resolving Inconsistencies at Schema, Tuple and Value Level Naveen Rajamoorthy Nachiappan Chidambaram Arun karthikeyan Palaniswamy Sriramakrishnan Soundarrajan. Need for Data Fusion. To compare different Data Sets Example: Shopping Agents Disaster Management System.
E N D
DATA FUSION Resolving Inconsistencies at Schema, Tuple and Value Level NaveenRajamoorthy Nachiappan Chidambaram ArunkarthikeyanPalaniswamy SriramakrishnanSoundarrajan
Need for Data Fusion • To compare different Data Sets • Example: • Shopping Agents • Disaster Management System
GOALS OF DATA INTEGRATION • Completeness - amount of data (number of attributes and tuples) - achieved by adding more data sources • Conciseness - number of unique objects - number of unique attributes of the objects - achieved by reducing schematic inconsistencies by schema mapping • Correctness - validity of data - achieved by performing duplicate detection and data fusion Data Sources Schema Mapping Duplicate Detection Data Fusion
Humboldt Merger(HumMer) • Fusing data from heterogeneous sources. • All Steps are performed at run-time. • Data Cleaning • Maximum Flexibility
Components of Data Fusion • Heterogeneous and Dirty data • Three Steps • Schema Matching and Data Transformation • Duplicate Detection • Data Fusion
Application Data Fusion Resolve inconsistencies at value level Step 3 Step 2 Duplicate Detection Resolve inconsistencies at tuple level Step 1 Resolve inconsistencies at schema level Schema Matching Data Sources Three Steps in Data Fusion
Schema Matching • Process of resolving schematic heterogeneity. 1. DUMAS Schema Matching Algorithm (Duplicate-based Matching of Schemas ) 2. TF IDF Similarity (term frequency–inverse document frequency)
Example Consider the relation R(A,B,C,D,E) and S(B’,F,E,’G)
ExampleConsider the relation R(A,B,C,D,E) and S(B’,F,E,’G)
Schema Matching • Overlap of R and S schema
Transformation • Preferred schema • Names of attributes are renamed or determined. • sourceID attribute is added to all tables in the schema.
EXAMPLE <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> Source A Source B <publication> <title> Database Systems: The Complete Book </title> <author> Molina & Ullman</author> <year> 1990 </year> </publication>
SCHEMA MAPPING <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> </title> <Authors> <author> </author> <author> </author> </Authors> <year> </year> </pub> Source A Source B <publication> <title> Database Systems: The Complete Book </title> <Author> Molina & Ullman</Author> <year> 1990 </year> </publication>
DATA TRANSFORMATION <pub> <Name> Database Systems: The Complete Book</Name> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> Source A <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> XQuery <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> Source B XQuery
DUPLICATE DETECTION AND FUSION <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> </pub> <pub> <title> Database Systems: The Complete Book</title> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> <year> 1990 </year> </pub> <pub> <title> Database Systems: The Complete Book </title> <Authors> <Author> Hector Garcia-Molina</Author> <Author> Jeffrey D. Ullman</Author> <Author> Jennifer D. Widom</Author> </Authors> <year> 1990 </year> </pub>
QUESTION • Give the correct order in which integration needs to be carried out: A) Data Transformation -> Schema Mapping -> Duplicate detection ->Fusion B) Duplicate detection -> Data Transformation -> Schema Mapping -> Fusion C)Schema Mapping -> Data Transformation -> Duplicate detection ->Fusion D) Data Transformation -> Schema Mapping -> Fusion -> Duplicate detection
Duplicate Detection • Problem • Given one or more data sets, find all sets of objects that represent the same real-world entity. • Difficulties • Duplicates are not identical • Similarity measures – Levenshtein, Jaccard, etc. • Large volume, cannot compare all pairs • Partitioning strategies – Sorted neighborhood, Blocking, etc.
PARTITIONING STRATEGIES • General Strategy • Sorted Neighborhood Method
GENERAL STRATEGY Compare each record with every other record and calculate distance measure. Assuming there are n records in database then we need to compute n(n-1)/2 distance measures. Number of records, n = 6 Number of Distance measures to be computed = 10 If there are say, 100000 records, Then, Number of Distance Measures to be computed = 5*10^8 calculations EXPENSIVE
SORTED NEIGHBORHOOD METHOD • Using Sorted Neighborhood method we can reduce the number of potential duplicate pairs. • Different fields are identified as key. • The database is sorted using this key. • After sorting a window of fixed size slides over the sorted database and duplicate records are identified. • The technique generates O(wN) pairs where w is window size and N is the total number of records in database.
DUPLICATE DETECTION WITH DESCRIPTIONS Criteria For Attribute Selection: Attributes that are: related to the currently considered object Child elements having a Foreign key constraint over the attributes of the parent table. (ii) useable by our similarity measure Attribute City corresponding to attribute Zip code cannot be used to calculate similarity measure (iii) likely to distinguish duplicates from non-duplicates. Attribute for Denomination is unlikely to distinguish duplicate records
DUPLICATE DETECTION WITH DESCRIPTIONS Description: • Consider attributes from other tables that have a foreign key relationship with the existing tables. • For efficiency, only direct child attributes are considered, i.e. no descendants reached by following more than one reference are discarded. Let tables T1 and T2 be the two matched tables, and let {T1,1, . . . , T1,k} and {T2,1, . . . , T2,m} be their respective children tables. Then, every pair of tables (T1,i, T2,j), 1<=i<=k, i<=j<=m is matched. Thus Actor(Movie),Actress(Movie) and Actors(Film) can also be used for Duplicate Detection
Example Table 1 Table 2 ID in Table 1 is a foreign key for Country ID in Table 2 From Sim(Country) in Table 1 we understand row 1 and 3 are duplicates (row 1 = row 3) Now on using the attribute City in child table, Table 2 for Duplicate Detection we come to the conclusion that row 1 = row 2 = row 3 in table 1. i.e: USA = United States = US
Detection From Similarity Measure Source 1 Source1 x Source2 Partitioning Similarity measure θ1>sim<θ2 Source 2 Sure Duplicates sim > θ2 Possibile Duplicates sim < θ1 Non-Duplicates
Data Fusion • Objective Given a duplicate, create a single object-representation while resolving conflicting data values. Simple Example: Min(price) Max_length(author) Concat(Month,Year) Source 1 ID Source 2
TYPES OF DATA CONFLICT • Uncertainty • NULL value vs. non-NULL value • “Easy” case • Uncertainty Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity Causes: Missing information, such as null values in a source or a completely missing attribute in a source • Contradiction It is a conflict between two or more different non-null values that are all used to describe the same property of the same entity. Causes: Contradiction is caused by different sources providing different values for the same attribute of a real-world entity. • Contradiction • Non-NULL value vs. (different) non-NULL value
NULL TYPES • unknown • There is a value, but I do not know it. • E.g.: Unknown date-of-birth • notapplicable • There is no meaningful value. • E.g.: Spouse for singles • withheld • There is a value, but we are not authorized to see it. • E.g.: Private phone line
Question • ________ refers to “Conflict between a non-null value and one or more null values that are all used to describe the same property of a real-world entity” • A. Contradiction • B. Uncertainty • C. Resolution • D. Ignorance
Classification of Functions conflict resolution strategies conflict ignorance conflict avoidance conflict resolution Escalate instance based metadata based instance based metadata based Coalesce ChooseDepending Concat Choose deciding mediating deciding mediating CommonAncestor MostRecent MostAbstract MostSpecific MIN, MAX Random Vote AVG, SUM
Data Fusion Goals Assume 2 sources, Source 1(A,B,C) and Source 2(A,B,D) a, b, - a, b, -, - Identical tuples a, b, -, - a, b, - a, b, -, - a, b, c a, b, c, - Subsumed tuples a, b, c, - a, b, - a, b, -, - a, b, c a, b, c, - a, b, c, d Complementing tuples a, b, d a, b, -, d a, b, c a, b, c, - a, f(b,e), c, d Conflicting tuples a, e, d a, e, -, d
Relational Operators – Overview • Identical tuples (duplicates) • UNION, OUTER UNION • Subsumed tuples(uncertainty) • MINIMUM UNION • Complementing tuples (uncertainty) • COMPLEMENT UNION, MERGE • Conflicting tuples (contradiction) • MATCH, GROUP, FUSE
UNION ( SELECT Title AS Name,Author,ISBN FROM R)UNION( SELECT Name,Author,ID AS ISBNFROM S) UNION
MINIMUM UNION = Select A,B,C,D AS NULL FROM R UNION ALL SELECT A,B, C AS NULL,D FROM S A tuple t1 subsumes a tuple t2, if it has same schema, has less NULL-values, and coincides in all non-NULL-values. +
FULL DISJUNCTION = |⋈| SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B;
MERGE AND PRIORITIZED MERGE SELECT * FROM R FULL OUTER JOIN S ON R.A = S.A AND R.B = S.B; |⋈ = = = = ⋈|
FUSE BY SELECT Name, RESOLVE(Age, max), RESOLVE(Address,choose(EE_Students)) FUSE FROM EE_Students,CS_Students FUSE BY (Name) RESULT
FUSE BY IMDB FILMBUFF RESULT SELECT ID, RESOLVE(Title, Choose(IMDB)), RESOLVE(Year, Max), RESOLVE(Director,Concat), RESOLVE(Rating), FUSE FROM IMDB, Filmdienst FUSE BY (ID) ON ORDER Year DESC
Question Match The Following a, b, -, - Complementing tuples a 1 a, b, -, - a, b, c, - Conflicting tuples 2 b a, b, -, - a, b, c, - Subsumed tuples 3 c a, b, -, d a, b, c, - d Identical tuples 4 a, e, -, d
REFERENCES • http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion_vldb.pdf • http://vldb.idi.ntnu.no/program/slides/demo/s1251-bilke.pdf • http://coitweb.uncc.edu/~wwu18/itcs6010/presentation/fusion-3step.pdf • http://www.hpi.uni-potsdam.de/fileadmin/hpi/FG_Naumann/publications/Modena05.pdf • http://vldb2009.org/files/DataFusionFinal.pdf • http://disi.unitn.it/~p2p/RelatedWork/Matching/dublicatesICDE05.pdf