660 likes | 683 Views
Automating Schema Matching for Data Integration. David W. Embley Brigham Young University. Funded by NSF. Leverage this …. … to do this. Information Exchange. Source. Target. Information Extraction. Schema Matching. Presentation Outline. Information Extraction
E N D
Automating Schema Matchingfor Data Integration David W. Embley Brigham Young University Funded by NSF
Leverage this … … to do this Information Exchange Source Target Information Extraction Schema Matching
Presentation Outline • Information Extraction • Schema Matching for HTML Table • Direct Schema Matching • Indirect Schema Matching • Conclusions and Future Work
Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension A Conceptual Modeling Solution
Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d,[^\d]"; substitute "^" -> "19"; }, … End;
Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stero 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (363)835-8597 0002 1998 Elandra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction
Table-Schema Matching(Basic Idea) • Many tables on the Web • Ontology-Based Extraction: • Works well for unstructured or semistructured data • What about structured data – tables? • Method: • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns
? Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}
? ? Problem: Attribute-Value is Value
Table extending over several pages Single-Column Table (formated a list) Problem: Information Behind Links
Solution • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
ACURA ACURA Legend Unnest: (Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM CDTable AutoAir CondAM/FM Yes, Yes, Yes, Yes, Solution: Replace Boolean Values ACURA ACURA Legend
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Form Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Adjust Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Make(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Each row is a car. YearTable Model(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car). Solution: Infer Mappings ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Model(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Do Extraction ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend PriceTable {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend ColourFeatureColourTable AutoFeatureAutoAutoTable Air Cond.FeatureAir Cond. Air Cond.Table AM/FMFeatureAM/FMAM/FMTable CDFeatureCDCDTable Yes, Yes, Yes, Yes, {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Experiment • Tables from 60 sites • 10 “training” tables • 50 test tables • 357 mappings (from all 60 sites) • 172 direct mappings (same attribute and meaning) • 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)
Results • 10 “training” tables • 100% of the 57 mappings (no false mappings) • 94.6% of the values in linked pages (5.4% false declarations) • 50 test tables • 94.7% of the 300 mappings (no false mappings) • On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision • 16 missed mappings • 4 partial (not all unions included) • 6 non-U.S. car-ads (unrecognized makes and models) • 2 U.S. unrecognized makes and models • 3 prices (missing $ or found MSRP instead) • 1 mileage (mileages less than 1,000)
Attribute Matchingfor Populated Schemas • Central Idea: Exploit All Data & Metadata • Matching Possibilities (Facets) • Attribute Names • Data-Value Characteristics • Expected Data Values • Data-Dictionary Information • Structural Properties
Approach • Target Schema T • Source Schema S • Framework • Individual Facet Matching • Combining Facets • Best-First Match Iteration
Year Year Year Year Make Make Make Feature Make has has has has has 0:1 0:1 0:1 0:1 0:* 0:1 0:1 Car Cost Model Model Model Car Model has has 0:1 has 0:1 has Phone Mileage Miles Example Car Car Style 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S
Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values
Attribute Names • Target and Source Attributes • T : A • S : B • WordNet • C4.5 Decision Tree: feature selection, trained on schemas in DB books • f0: same word • f1: synonym • f2: sum of distances to a common hypernym root • f3: number of different common hypernym roots • f4: sum of the number of senses of A and B
The number of different common hypernym roots of A and B The sum of the number of senses of A and B The sum of distances of A and B to a common hypernym WordNet Rule
Data-Value Characteristics • C4.5 Decision Tree • Features • Numeric data (Mean, variation, standard deviation, …) • Alphanumeric data (String length, numeric ratio, space ratio)
Expected Data Values • Target Schema T and Source Schema S • Regular expression recognizer for attribute A in T • Data instances for attribute B in S • Hit Ratio = N’/N for (A, B) match • N’ : number of B data instances recognized by the regular expressions of A • N: number of B data instances
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Combined Measures Threshold: 0.5
F1 93.8% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% F1: WordNet F2: Value Characteristics F3: Expected Values Experimental Results • This schema, plus 6 other schemas • 32 matched attributes • 376 unmatched attributes • Matched: 100% • Unmatched: 99.5% • “Feature” ---”Color” • “Feature” ---”Body Type”