1 / 23

Scheme Matching and Data Extraction over HTML Tables

This study explores ontology-based extraction methods for structured data in HTML tables, identifying attribute-value pairs and overcoming schema mismatches by inferring mappings to extract data accurately.

wbishop
Download Presentation

Scheme Matching and Data Extraction over HTML Tables

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Scheme Matching and Data Extraction over HTML Tables Cui Tao June, 2002 supported by NSF

  2. Introduction • Many tables on the Web • Ontology-based extraction: • Works for unstructured or semi-structured data • Does not work well for structured data -- tables • Only tables for information, not for layout

  3. Problems Different schemas • Different source table schemas • {Run #, Yr, Make, Model, Tran, Color, Dr} • {Make, Model, Year, Colour, Price, Auto, Air Cond., AM/FM, CD} • {Vehicle, Distance, Price, Mileage} • {Year, Make, Model, Trim, Invoice/Retail, Engine, Fuel Economy} • Target database schema {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

  4. Problems Attribute value pairs ?

  5. Problems Attribute value switch

  6. Problems Attribute/value combinations Year/sty Cyl. # Dr Tran Color

  7. Model Problems Attribute/value split

  8. Problems • Information in linked pages • Tables • Lists • Unstructured data • … • Header information

  9. Thesis Statement Extraction Ontology Mapping Rules Extracted Data HTML table with Unknown-structure

  10. Understand table • Recognize table and its element • <TABLE>, </TABLE> • <TR>: Row; <TD>: Data Entry; <TH>: Header. • Understand Table. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data. Methods • Understand Table • Recognize Attributes and Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden Behind • Links • Infer Mapping • Extract Data

  11. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Form attribute-value pairs • Regular table • Table with factors Nrcom = Most common number of columns in the table

  12. Replace Boolean Values: • Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Form Attribute-Value Pairs • Regular Table • Table with factors • Table has Boolean values

  13. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Form Attribute-Value Pairs • Regular Table • Table with factors • Table has Boolean values • Form Attribute-Value pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM>

  14. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Adjust attribute-value Pairs <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto, Auto>, <Air Cond., Air Cond.>, <AM/FM, AM/FM> <Make, Honda>, <Model, Civic EX>, <Year, 1995>, <Colour, White>, <Price, $6300>, <Auto>, <Air Cond>, <AM/FM>

  15. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Add Information Hidden Behind Links • Unstructured and semi structured: concatenate <Manufacturer, Honda>, <Model, Civic EX>, <Door, 4>, <Year, 1995>, <Color, White>, <Engine, 2.0L 4 Cylinders> <Transmission, Auto>, <Mileage, 82,628> <Price, $6300> • Table: attribute-value pairs

  16. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Add Information Hidden Behind Links • Unstructured and semi- structured: concatenate • Table: attribute-value pairs

  17. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Methods • Add Information Hidden Behind Links • Unstructured and semi- structured: concatenate • Table:attribute value pairs • List: <Features, AIR CONDITIONING, CD, AM/FM, CLOTH UPHOLSTERY, CONSOLE, CRUISE CONTROL, DUAL AIR BAGS, INSIDE HOOD RELEASE, POWER DOOR LOCKS, POWER STEERING, POWER SUNROOF, POWER WINDOWS, RADIAL TIRES, REAR DEFROSTER, REAR SPOILER, RECLINING SEATS>

  18. Each row is a car. • Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method • Inferred Mapping Creation: {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

  19. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

  20. Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

  21. Table Understanding. • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Inferred Mapping Creation • Data Extraction. • Understand Table • Recognize Attributes and • Values • Form Attribute-Value Pairs • Adjust Attribute-Value Pairs • Add Information Hidden • Behind Links • Infer Mapping • Extract Data Method {Car, Year, Make, Model, Mileage, Price, PhoneNr}, {Car, Feature}

  22. Evaluation • Measure percentage of correct mappings: • Correct mapping • Partially correct mapping • Incorrect mapping • Measure precision and recall: • Data in the table • Data in linked pages • Compare the results for extracted data before mapping and after mapping

  23. Contribution • Provides an approach to extract information automatically from HTML tables • Suggests a different way to solve the problem of schema matching

More Related