170 likes | 293 Views
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration. Li Xu David W. Embley David Jackman. Background . Problem : Attribute matching Techniques Data values Data-dictionary information Structural properties Ontologies Terminological relationships.
E N D
Multifaceted Exploitation of Metadata for Attribute Match Discovery in Information Integration Li Xu David W. Embley David Jackman
Background • Problem : Attribute matching • Techniques • Data values • Data-dictionary information • Structural properties • Ontologies • Terminological relationships
Approach • Target Schema T • Source Schema S • Framework • Individual Facet Matching; • Combining Multiple Facets; • Iteration.
Year Year Year Year Make Make Make Feature Make has has has has has 0:1 0:1 0:1 0:1 0:* 0:1 0:1 Car Cost Model Model Model Car Model has has 0:1 has 0:1 has Phone Mileage Miles Example Car Car Style 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S
Individual Facet Matching • Terminological relationships • Data value characteristics • Target-specific, regular-expression matches
Terminological Relationships • Names of Attributes • T : A • S : B • WordNet • C4.5 Decision Tree • Feature selection • f0: Same word • f1: Synonym • f2: Sum of the distances of A and B to a common hypernym root • f3: Number of different common hypernym roots of A and B • f4: Sum of the number of senses of A and B
The number of different common hypernym roots of A and B The sum of the number of senses of A and B Sum of distances of A and B to a common hypernym WordNet Rule
Data-Value Characteristics • C4.5 Decision Tree • Features [LC94] • Numeric data • Mean, variation, coefficient variation, standard deviation; • Alphanumeric data • String length, numeric ratio, space ratio.
Expected Data Values • Target Schema T • Data frame • Source Schema S • Data instances • Hit Ratio = N’/N (A, B) • N’ : number of B data instances consistent with specifications of A data frame; • N: number of B data instances.
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 Combined Confidences Threshold: 0.5
F1 93.75% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% Experimental Results • Matched Attributes • 100% (32 of 32); • Unmatched Attributes • 99.5% (374 of 376); • “Feature” ---”Color”; • “Feature” ---”Body Type”.
Future Work • Additional facets of metadata • More sophisticated combinations • Additional application domains • Automating feature selection