1 / 23

From HTML Documents to Web Tables and Rules

krystal
Download Presentation

From HTML Documents to Web Tables and Rules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. From HTML Documents to Web Tables and Rules ICEC ‘06, Kai Simon, Georg Lausen, Harold Boley

    2. Motivation Most Information available on the Web is only human accessible through presentation-oriented HTML pages. We still lack techniques which enables machines [agents] to extract and understand such kind of information resources to act, on behalf of humans.

    3. Reverse Engineering

    4. Overview

    5. Automatic Data Extraction System ViPER [Visual Perception-based Extraction of Records] [CIKM'05] Features Data extraction process operates on a single Web page needs at least two similar consecutive data records finds data record segmentations visually visually weights data regions Data item alignment global alignment techniques (gene sequence alignment) incorporate string similarity and tree information

    6. Automatic Data Extraction Mention towel rings /Mention towel rings /

    7. Post-Processing

    8. Data Representation

    9. Record Alignment

    10. Column Splitting Brackets, dash, comma separated list, Brackets, dash, comma separated list,

    11. Label Assignment Column headings == label assignment slot names POSLColumn headings == label assignment slot names POSL

    12. Label Assignment

    13. Splitting and Labeling Results

    14. Functional Dependencies Compute statistic dependencies between columns Cx, and Cy Chi-Square Test / Cramer´s V-test Simple functional dependency heuristics Numeric and boolean valuesNumeric and boolean values

    15. Arithmetic Dependencies Find arithmetic dependencies between numeric columns by checking the homogeneous system of linear equations for non trivial solutions. lamba ändern anonyme Functionen in POSLlamba ändern anonyme Functionen in POSL

    16. POSL - Rules

    17. POSL POSL - Syntax variables are prefixed with "?" anonymous variables are noted by stand-alone "?" positional arguments are separated by "," F-logic inspired syntax "name ? filler" are separated by ";" and unordered names represent table labels and fillers represent table cells unmentioned slots are made explicit by "!" rests Facts and rule heads can be anchored by an OID (usually URI) as a special "zeroth" argument separated by "^" Rules are written with Prolog's IF infix ":-" Strichpunkt (Semicolon) questionmark colon dash :-Strichpunkt (Semicolon) questionmark colon dash :-

    18. Row compactification highlight highlight

    19. Column compactification Ergebnis ?w ?v=a lamda (head) anonymer HeadErgebnis ?w ?v=a lamda (head) anonymer Head

    20. Column compactification remove the dependent columnremove the dependent column

    21. Column Compactification

    22. Enriched and Personalized Web Data Information invariant familiar view of the informationinvariant familiar view of the information

    23. Summary and Future Work Formal description of structured Web content Focused on information integration (constraints) Quality of websites (data quality, redundancy, normalized information, data consistency) Compactify and later enrich tables via rules Personalized table projections describable as rules

More Related