E N D
1. From HTML Documents to Web Tables and Rules ICEC ‘06, Kai Simon, Georg Lausen, Harold Boley
2. Motivation Most Information available on the Web is only human accessible through presentation-oriented HTML pages.
We still lack techniques which enables machines [agents]
to extract and
understand
such kind of information resources to act, on behalf of
humans.
3. Reverse Engineering
4. Overview
5. Automatic Data Extraction System ViPER [Visual Perception-based Extraction of Records] [CIKM'05]
Features
Data extraction process
operates on a single Web page
needs at least two similar consecutive data records
finds data record segmentations visually
visually weights data regions
Data item alignment
global alignment techniques (gene sequence alignment)
incorporate string similarity and tree information
6. Automatic Data Extraction Mention towel rings /Mention towel rings /
7. Post-Processing
8. Data Representation
9. Record Alignment
10. Column Splitting Brackets, dash, comma separated list, Brackets, dash, comma separated list,
11. Label Assignment Column headings == label assignment
slot names POSLColumn headings == label assignment
slot names POSL
12. Label Assignment
13. Splitting and Labeling Results
14. Functional Dependencies Compute statistic dependencies between columns Cx, and Cy
Chi-Square Test / Cramer´s V-test
Simple functional dependency heuristics
Numeric and boolean valuesNumeric and boolean values
15. Arithmetic Dependencies Find arithmetic dependencies between numeric columns by checking the homogeneous system of linear equations for non trivial solutions.
lamba ändern anonyme Functionen in POSLlamba ändern anonyme Functionen in POSL
16. POSL - Rules
17. POSL POSL - Syntax
variables are prefixed with "?"
anonymous variables are noted by stand-alone "?"
positional arguments are separated by ","
F-logic inspired syntax "name ? filler" are separated by ";" and unordered
names represent table labels
and fillers represent table cells
unmentioned slots are made explicit by "!" rests
Facts and rule heads can be anchored by an OID (usually URI) as a special "zeroth" argument separated by "^"
Rules are written with Prolog's IF infix ":-" Strichpunkt (Semicolon)
questionmark
colon dash :-Strichpunkt (Semicolon)
questionmark
colon dash :-
18. Row compactification highlight highlight
19. Column compactification Ergebnis ?w
?v=a
lamda (head) anonymer HeadErgebnis ?w
?v=a
lamda (head) anonymer Head
20. Column compactification remove the dependent columnremove the dependent column
21. Column Compactification
22. Enriched and Personalized Web Data Information invariant familiar view of the informationinvariant familiar view of the information
23. Summary and Future Work Formal description of structured Web content
Focused on information integration (constraints)
Quality of websites (data quality, redundancy, normalized information, data consistency)
Compactify and later enrich tables via rules
Personalized table projections describable as rules