650 likes | 821 Views
Automating Schema Matching. David W. Embley, Cui Tao, Li Xu Brigham Young University. Funded by NSF. Leverage this …. … to do this. Information Exchange. Source. Target. Information Extraction. Schema Matching. Presentation Outline. Information Extraction
E N D
AutomatingSchema Matching David W. Embley, Cui Tao, Li Xu Brigham Young University Funded by NSF
Leverage this … … to do this Information Exchange Source Target Information Extraction Schema Matching
Presentation Outline • Information Extraction • Schema Matching for Tables • Direct Schema Matching • Indirect Schema Matching • Conclusions and Future Work
Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension A Conceptual-Modeling Solution
Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End;
Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction
Schema Matching for HTML Tables with Unknown Structure Cui Tao
Table-Schema Matching(Basic Idea) • Many Tables on the Web • Ontology-Based Extraction • Works well for unstructured or semistructured data • What about structured data – tables? • Method • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns
Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}
? ? Problem: Attribute-Value is Value
`` `` `` Problem: Implied Values
Table extending over several pages Single-Column Table (formatted as list) Problem: Information Behind Links
Solution • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
ACURA ACURA Legend Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM βCDTable βAutoβAir CondβAM/FM Yes, Yes, Yes, Yes, Solution: Replace Boolean Values ACURA ACURA Legend
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Form Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Adjust Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Each row is a car. πYearTable πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car). Solution: Infer Mappings ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Do Extraction ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend πPriceTable {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend ρColour←Feature πColourTable U ρAuto←Feature πAuto βAutoTable UρAir Cond.←Feature πAir Cond. βAir Cond.Table UρAM/FM←Feature πAM/FM βAM/FMTable UρCD←FeatureπCDβCDTable Yes, Yes, Yes, Yes, {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Experiment • Tables from 60 sites • 10 “training” tables • 50 test tables • 357 mappings (from all 60 sites) • 172 direct mappings (same attribute and meaning) • 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)
Results • 10 “training” tables • 100% of the 57 mappings (no false mappings) • 94.6% of the values in linked pages (5.4% false declarations) • 50 test tables • 94.7% of the 300 mappings (no false mappings) • On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision • 16 missed mappings • 4 partial (not all unions included) • 6 non-U.S. car-ads (unrecognized makes and models) • 2 U.S. unrecognized makes and models • 3 prices (missing $ or found MSRP instead) • 1 mileage (mileages less than 1,000)
Direct Schema Matching Li Xu
Attribute Matchingfor Populated Schemas • Central Idea: Exploit All Data & Metadata • Matching Possibilities (Facets) • Attribute Names • Data-Value Characteristics • Expected Data Values • Data-Dictionary Information • Structural Properties
Approach • Target Schema T • Source Schema S • Framework • Individual Facet Matching • Combining Facets • Best-First Match Iteration
Year Year Year Year Make Make Make Feature Make has has has has has 0:1 0:1 0:1 0:1 0:* 0:1 0:1 Car Cost Model Model Model Car Model has has 0:1 has 0:1 has Phone Mileage Miles Example Car Car Style 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S
Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values
Attribute Names • Target and Source Attributes • T : A • S : B • WordNet • C4.5 Decision Tree: feature selection, trained on schemas in DB books • f0: same word • f1: synonym • f2: sum of distances to a common hypernym root • f3: number of different common hypernym roots • f4: sum of the number of senses of A and B
The number of different common hypernym roots of A and B The sum of the number of senses of A and B The sum of distances of A and B to a common hypernym WordNet Rule
Data-Value Characteristics • C4.5 Decision Tree • Features • Numeric data (Mean, variation, standard deviation, …) • Alphanumeric data (String length, numeric ratio, space ratio)
Expected Data Values • Target Schema T and Source Schema S • Regular expression recognizer for attribute A in T • Data instances for attribute B in S • Hit Ratio = N'/N for (A, B) match • N' : number of B data instances recognized by the regular expressions of A • N: number of B data instances
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Combined Measures Threshold: 0.5
Experimental Results • This schema, plus 6 other schemas • 32 matched attributes • 376 unmatched attributes • Measures • Recall: 100% • Precision: 94% • F Measure: 97% • False Positives • “Feature” ---”Color” • “Feature” ---”Body Type”