470 likes | 483 Views
Discover direct and indirect matches for schema elements using a conceptual modeling solution. Leverage attribute and data value characteristics for successful schema matching. Explore attribute matching possibilities in populated schemas for optimal information exchange.
E N D
Discovering Direct and Indirect Matches for Schema Elements Li Xu and David W. Embley Brigham Young University Funded by NSF
Leverage this … … to do this Information Exchange Source Target Information Extraction Schema Matching
Outline • Information Extraction • Direct Schema Matching • Indirect Schema Matching • Schema Matching for HTML Tables • Conclusions
Outline • Information Extraction • Direct Schema Matching • Indirect Schema Matching • Schema Matching for HTML Tables • Conclusions
Year Price 1..* 1..* 1..* has has Make 1..* Mileage 0..1 0..1 0..1 0..1 has has Car 0..1 0..1 0..* is for PhoneNr has has 1..* Model 0..1 1..* 1..* has Feature 1..* Extension A Conceptual-Modeling Solution
Car-Ads Ontology Car [->object]; Car [0..1] has Year [1..*]; Car [0..1] has Make [1..*]; Car [0...1] has Model [1..*]; Car [0..1] has Mileage [1..*]; Car [0..*] has Feature [1..*]; Car [0..1] has Price [1..*]; PhoneNr [1..*] is for Car [0..*]; PhoneNr [0..1] has Extension [1..*]; Year matches [4] constant {extract “\d{2}”; context "([^\$\d]|^)[4-9]\d[^\d]"; substitute "^" -> "19"; }, … … End;
Car Feature 0001 Auto 0001 AC 0002 Black 0002 4 door 0002 tinted windows 0002 Auto 0002 pb 0002 ps 0002 cruise 0002 am/fm 0002 cassette stereo 0002 a/c 0003 Auto 0003 jade green 0003 gold Car Year Make Model Mileage Price PhoneNr 0001 1989 Subaru SW $1900 (336)835-8597 0002 1998 Elantra (336)526-5444 0003 1994 HONDA ACCORD EX 100K (336)526-1081 Recognition and Extraction
Outline • Information Extraction • Direct Schema Matching • Indirect Schema Matching • Schema Matching for HTML Tables • Conclusions
Attribute Matchingfor Populated Schemas • Central Idea: Exploit All Data & Metadata • Matching Possibilities (Facets) • Attribute Names • Data-Value Characteristics • Expected Data Values • Data-Dictionary Information • Structural Properties
Approach • Target Schema T • Source Schema S • Framework • Individual Facet Matching • Combining Facets • Best-First Match Iteration
Year Year Year Year Make Make Make Feature Make has has has has has 0:1 0:1 0:1 0:1 0:* 0:1 0:1 Car Cost Model Model Model Car Model has has 0:1 has 0:1 has Phone Mileage Miles Example Car Car Style 0:1 has 0:* 0:1 0:1 has has has Mileage Miles Cost Target Schema T Source Schema S
Individual Facet Matching • Attribute Names • Data-Value Characteristics • Expected Data Values
Attribute Names • Target and Source Attributes • T : A • S : B • WordNet • C4.5 Decision Tree: feature selection, trained on schemas in DB books • f0: same word • f1: synonym • f2: sum of distances to a common hypernym root • f3: number of different common hypernym roots • f4: sum of the number of senses of A and B
The number of different common hypernym roots of A and B The sum of the number of senses of A and B The sum of distances of A and B to a common hypernym WordNet Rule
Data-Value Characteristics • C4.5 Decision Tree • Features • Numeric data (Mean, variation, standard deviation, …) • Alphanumeric data (String length, numeric ratio, space ratio)
Expected Data Values • Target Schema T and Source Schema S • Regular expression recognizer for attribute A in T • Data instances for attribute B in S • Hit Ratio = N'/N for (A, B) match • N' : number of B data instances recognized by the regular expressions of A • N: number of B data instances
1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Combined Measures Threshold: 0.5
F1 93.8% F2 84% F3 92% F1 98.9% F2 97.9% F3 98.4% F1: WordNet F2: Value Characteristics F3: Expected Values Experimental Results • This schema, plus 6 other schemas • 32 matched attributes • 376 unmatched attributes • Matched: 100% • Unmatched: 99.5% • “Feature” ---”Color” • “Feature” ---”Body Type”
Outline • Information Extraction • Direct Schema Matching • Indirect Schema Matching • Schema Matching for HTML Tables • Conclusions
Car Target Schema Matching Color Year Year Make Feature Make & Model Body Type Cost Model Car Style Phone Mileage Miles Cost Source
Mapping Generation • Direct Matches as described earlier: • Attribute Names based on WordNet • Value Characteristics based on value lengths, averages, … • Expected Values based on regular-expression recognizers • Indirect Matches: • Direct matches • Structure Evaluation • Union • Selection • Decomposition • Composition
Car Target Union and Selection Color Year Year Make Feature Make & Model Body Type Cost Model Car Style Phone Mileage Miles Cost Source
Car Target Decomposition and Composition Color Year Year Make Feature Make & Model Body Type Cost Model Car Style Phone Mileage Miles Cost Source
Structure Example Taken From [MBR, VLDB’01] PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo Count Address ItemCount Item City Street City Street Item ItemNumber City Street Line Qty UoM Quantity UnitOfMeasure Target Source
Structure(Nonlexical Matches) PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo DeliverTo Count Address Count Item City Street City Street Item ItemNumber City Street Line Qty UoM Quantity UnitOfMeasure Target Source
Structure(Join over FD Relationship Sets, …) PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo City Count City Count Item Street City Street City Street Item Street ItemNumber Line Qty UoM Quantity UnitOfMeasure Target Source
Structure(Lexical Matches) PO PurchaseOrder Items POShipTo POBillTo POLines InvoiceTo DeliverTo City City Count Count City City Count Count Item Street Street City City Street Street City City Street Street Item Street Street ItemNumber Line Line Qty Qty UoM Quantity Quantity UnitOfMeasure Target Source
Experimental Results Indirect Matches: 94% (precision, recall, F-measure) Data borrowed from Univ. of Washington [DDH, SIGMOD01] Rough Comparison with U of W Results (direct matches only) * Course Schedule – Accuracy: ~71% * Faculty Members – Accuracy, ~92% * Real Estate (2 tests) – Accuracy: ~75%
Outline • Information Extraction • Direct Schema Matching • Indirect Schema Matching • Schema Matching for HTML Tables • Conclusions
Problem: Different Schemas Target Database Schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature} Different Source Table Schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy}
Solution • Form attribute-value pairs • Do extraction • Infer mappings from extraction patterns
ACURA ACURA Legend Unnest: μ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Remove Internal Factoring Discover Nesting: Make, (Model, (Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*)*
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM βCDTable βAutoβAir CondβAM/FM Yes, Yes, Yes, Yes, Solution: Replace Boolean Values ACURA ACURA Legend
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Form Attribute-Value Pairs ACURA ACURA Legend <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>, <CD, >
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πMakeμ(Model, Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*μ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Each row is a car. πYearTable πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Note: Mappings produce sets for attributes. Joining to form records is trivial because we have OIDs for table rows (e.g. for each Car). Solution: Infer Mappings ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM πModelμ(Year, Colour, Price, Auto, Air Cond, AM/FM, CD)*Table Solution: Do Extraction ACURA ACURA Legend {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend πPriceTable {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Auto Air Cond. AM/FM CD Auto AM/FM Auto Air Cond. AM/FM CD AM/FM Air Cond. AM/FM Auto Air Cond. AM/FM Solution: Do Extraction ACURA ACURA Legend ρColour←Feature πColourTable U ρAuto←Feature πAuto βAutoTable UρAir Cond.←Feature πAir Cond. βAir Cond.Table UρAM/FM←Feature πAM/FM βAM/FMTable UρCD←FeatureπCDβCDTable Yes, Yes, Yes, Yes, {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {PhoneNr, Extension}, {Car, Feature}
Experiment • Tables from 60 sites • 10 “training” tables • 50 test tables • 357 mappings (from all 60 sites) • 172 direct mappings (same attribute and meaning) • 185 indirect mappings (29 attribute synonyms, 5 “Yes/No” columns, 68 unions over columns for Feature, 19 factored values, and 89 columns of merged values that needed to be split)
Results • 10 “training” tables • 100% of the 57 mappings • No false mappings • 50 test tables • 94.7% of the 300 mappings • No false mappings • 16 missed mappings • 4 partial (not all unions included) • 6 non-U.S. car-ads (unrecognized makes and models) • 2 U.S. unrecognized makes and models • 3 prices (missing $ or found MSRP instead) • 1 mileage (mileages less than 1,000)
Conclusions • Direct Attribute Matching • Matched 32 of 32: 100% Recall • 2 False Positives: 94% Precision • Direct and Indirect Attribute Matching • Matched 494 of 513: 96% Recall • 22 False Positives: 96% Precision • Table Mappings • Matched 284 of 300: 94.7% Recall • No False Positives: 100% Precision www.deg.byu.edu