320 likes | 335 Views
Learn how to automatically extract structured data from HTML tables, solve schema matching issues, infer mappings, and improve accuracy. Includes solutions and experimental results.
E N D
Schema Matching and Data Extraction over HTML Tables Cui Tao Data Extraction Research Group Department of Computer Science Brigham Young University supported by NSF
Introduction • Many tables on the Web • Ontology-based extraction: • Works well for unstructured or semi-structured data • What about structured data – tables? • How to integrate data stored in different tables? • Detect the table of interest • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
? ProblemDetecting The Table of Interest
Problem Different schemas • Different source table schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} • Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
? ? Problem Attribute-Value is Value
Table extending over several pages List ProblemInformation Behind Links
Solution • Detect the table of interest • Form attribute-value pairs (adjust if necessary) • Do extraction • Infer mappings from extraction patterns
SolutionDetect The Table of Interest • Top-level tables • Table size: at least 3 rows and columns • Grid layout: same # of values • Attributes • Value density: # of ontology extracted values total # of values in the table
SolutionDetect The Table of Interest • Linked-page tables • Table size: at least 2 rows and columns • Attributes • Attribute-value-pair pattern • Page-spanning tables
2001 2001 2001 2000 2000 2000 2000 2000 2000 1999 1999 Solution Remove Factoring
SolutionForm Attribute-Value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
SolutionAdjust Attribute-Value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Unstructured and semi-structured: concatenate < Single attribute value pairs: Pair them together <Price, $7,988>, <Mileage, 63,168 miles>, <Body Type, Car>, <Body Style, 4 DR Sedan>, <Transmission, Automatic>, <Engine, 3.0 L V-6>, <Doors, 4>, <Fuel Type, Gas>, <Stock Number, 22764>, <VIN, 1FAFP52U2WA139879> List: Mark the beginning and the end > SolutionAdd Information Hidden Behind Links
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Each row is a car. SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
SolutionInferred Mapping Creation {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Precision:100% Recall:87% 15 46 Testing Set 53 87%(46) 2 28 12 13 100%(7) Training Set 7 100%(7) Top Table Location Structured Linked Page Location Linked Pages Experimental Results − Table Location Car advertisement application domain Precision:86% Recall: 92%
Experimental Results − Mapping Car advertisement application domain • 46 recognized tables in the testing set • Total 319 mappings • Precision: 95.8% Recall: 92.8% • Top-level tables: 77% of the 296 correct mappings • Linked tables: 19.6% • Both: 3.4%
Precision:100% Recall:92% 11 Testing Set 12 92%(11) 3 100%(5) 100%(5) Training Set 5 Top Table Location Linked Pages Experimental Results − Table Location Cell-phone sales application domain
Experimental Results − Mapping Cell-phone sales application domain • 11 recognized tables in the testing Set • Total 97 mappings • Precision: 90.1% Recall: 85.4% • Top-level tables: 85.4% of the 88 correct mappings • Linked tables: 50.5% • Both: 35.9%
Contribution • Provides an approach to extract information automatically from HTML tables • Suggests a different way to solve the problem of schema matching