280 likes | 294 Views
Explore how to integrate data from diverse tables on the web, detect tables of interest, extract attribute-value pairs, infer mappings, and address challenges like factored, split, or merged values. The solution involves table detection, pair formation, extraction, and mapping inference. Experimental results show high precision and recall rates. This work contributes to automating information extraction from HTML tables and offers a unique solution to schema matching problems.
E N D
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF
Introduction • Many tables on the Web • How to integrate data stored in different tables? • Detect the table of interest • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
? ProblemDetecting The Table of Interest
Problem Different schemas • Different source table schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} • Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
? ? Problem Attribute-Value is Value
Table extending over several pages Single-Column Table (formatted as list) ProblemInformation Behind Links
Solution • Detect the table of interest • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
SolutionDetect The Table of Interest • ‘Real’ table test • Same number of values • Table size • Attribute test • Density measure test # of ontology extracted values total # of values in the table
2001 2001 2001 2000 2000 2000 2000 2000 2000 1999 1999 Solution Remove Factoring
SolutionForm Attribute-Value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
SolutionAdjust Attribute-Value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Unstructured and semi-structured: concatenate < Single attribute value pairs: Pair them together <Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879> List: Mark the beginning and the end > SolutionAdd Information Hidden Behind Links
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Each row is a car. SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Experimental Results Car Advertisement Application domain • 10 “training” tables • 100% of the 57 mappings (no false mappings) • 94.6% precision of the values in linked pages (5.4% false declarations) • 50 test tables • 94.7% of the 300 mappings (no false mappings) • On the bases of sampling 3,000 values in linked pages, we obtained 97% recall and 86% precision
Other Applications • Cell Phone Plan Application domain • Soccer Player Application domain
Contribution • Provides an approach to extract information automatically from HTML tables • Suggests a different way to solve the problem of schema matching