230 likes | 355 Views
Scheme Matching and Data Extraction over HTML Tables. Cui Tao June, 2002. supported by NSF. Introduction. Many tables on the Web Ontology-based extraction: Works for unstructured or semi-structured data Does not work well for structured data -- tables
E N D
Scheme Matching and Data Extraction over HTML Tables Cui Tao June, 2002 supported by NSF
Introduction • Many tables on the Web • Ontology-based extraction: • Works for unstructured or semi-structured data • Does not work well for structured data -- tables • Only tables for information, not for layout
Problems Different schemas • Different source table schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} • Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Problems Attribute value pairs ?
Problems Attribute value switch
Problems Attribute/value combinations Year/sty Cyl. # Dr Tran Color
Model Problems Attribute/value split
Problems • Information in linked pages • Tables • Lists • Unstructured data • … • Header information
Thesis Statement Extraction Ontology Mapping Rules Extracted Data HTML table with Unknown-structure
Understand table • Recognize table and its element • <TABLE>, </TABLE> • <TR>: Row; <TD>: Data Entry; <TH>: Header. • Understand Table. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data. Methods • Understand Table • Recognize Attributes and Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden Behind • Links • Infer Mapping • Extract Data
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Form attribute-value pairs • Regular table • Table with factors Nrcom = Most common number of columns in the table
Replace Boolean Values: • Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Form Attribute-Value Pairs • Regular Table • Table with factors • Table has Boolean values
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Form Attribute-Value Pairs • Regular Table • Table with factors • Table has Boolean values • Form Attribute-Value pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Adjust attribute-value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM> <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Add Information Hidden Behind Links • Unstructured and semi structured: concatenate <Manufacturer, Honda>, <Model, Civic EX>, <Door, 4>, <Year, 1995>, <Color, White>, <Engine, 2.0L 4 Cylinders> <Transmission, Auto>, <Mileage, 82,628> <Price, $6300> • Table: attribute-value pairs
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Add Information Hidden Behind Links • Unstructured and semi- structured: concatenate • Table: attribute-value pairs
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Add Information Hidden Behind Links • Unstructured and semi- structured: concatenate • Table:attribute value pairs • List: <Features, AIR CONDITIONING, CD, AM/FM, CLOTH UPHOLSTERY, CONSOLE, CRUISE CONTROL, DUAL AIR BAGS, INSIDE HOOD RELEASE, POWER DOOR LOCKS, POWER STEERING, POWER SUNROOF, POWER WINDOWS, RADIAL TIRES, REAR DEFROSTER, REAR SPOILER, RECLINING SEATS>
Each row is a car. • Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method • Inferred Mapping Creation: {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Inferred Mapping Creation • Data Extraction. • Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}
Evaluation • Measure percentage of correct mappings: • Correct mapping • Partially correct mapping • Incorrect mapping • Measure precision and recall: • Data in the table • Data in linked pages • Compare the results for extracted data before mapping and after mapping
Contribution • Provides an approach to extract information automatically from HTML tables • Suggests a different way to solve the problem of schema matching