Data Extraction from Web Tables: the Devil is in the Details

DATA FLOW 1. Web page (HTML) Excel import 2. CSV table (text file) Python critical cell location 3. List of critical cells (CSV) VeriClick confirmation /correction 4. Corrected lists of critical cells (CSV) Python path extraction 5. Header paths (text file) Sisfactoring 6. Canonical expression (text file) Java constructor 7. Relational tables and RDF triples SQL or OWL This work was supported by NSF Grants # 044114854 (at RPI) and # 0414644 (at BYU) and by the Rensselaer Center for Open Software. MangeshTamhankar (RPI) developed VeriClick. Table Notation 1. Web table Table 1.9 Renewable Energy Resources Stub Row header Column header Data (delta) cells Virtual header (“CH1”) needed for category A ! Wang categories: Data Extraction from Web Tables: the Devil is in the Details WFT Every delta cell of a well-formed table is indexed completely and uniquelyby its row and column headers. The headers form trees. A table with only one row or column of delta cells is degenerate. A structure missing any row or column headers is a list. Other semi-structured data: forms. Tables are meant to disseminate information. Forms are meant to collect information. 2. CSV intermediate format Segmentation and path extraction are programmed from CSV because of ease of cell-level operations. ,,B,,,,, ,,B1,,B2,, ,,A1,A2,A3,A1,A2,A3 C,C1,D11,D12,D13,D14,D15,D16 ,C2,D21,D22,D23,D24,D25,D26 • 7a. MySQL relational table generation • CREATE TABLE Fig_1(C varchar(2),B varchar(2), • CH1_A1 varchar(3),CH1_A2 varchar(3),CH1_A3 varchar(3), • PRIMARY KEY (C, B)); • INSERT INTO Fig_1 VALUES("C1", "B1", "D11", "D12", "D13"); • INSERT INTO Fig_1 VALUES("C1", "B2", "D14", "D15", "D16"); • INSERT INTO Fig_1 VALUES("C2", "B1", "D21", "D22", "D23"); • INSERT INTO Fig_1 VALUES("C2", "B2", "D24", "D25", "D26"); 3. Critical cells are verified or corrected: • 7b. RDF triple generation • <rdf:RDF • xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" • xmlns:Fig_1="mysql://localhost:3306/Fig_1#"> • <rdf:Description • rdf:about="mysql://localhost:3306/Fig_1/C-B_0" • Fig_1:C="C1" • Fig_1:B="B1" • Fig_1:CH1_A1="D11" • Fig_1:CH1_A2="D12" • Fig_1:CH1_A3="D13" • /> • ... • </rdf:RDF> 4. Critical cells are: a1, b3, c4, h5 Details:Missing header roots Ambiguous roots in stub Missing headers Dedented headers Unit rows Blank rows Duplicate header cells Duplicate header paths Aggregates Table titles Notes and footnotes Missing data Special symbols Nested tables Concatenated tables Incorrect tables 5. Header paths are extracted: rowpaths= (("<0,3>C"*"<1,3>C1") +("<0,4>C"*"<1,4>C2")); colpaths = (("<2,0>B"*"<2,1>B1"*"<2,2>A1") +("<3,0>B"*"<3,1>B1"*"<3,2>A2") +("<4,0>B"*"<4,1>B1"*"<4,2>A3") +("<5,0>B"*"<5,1>B2"*"<5,2>A1") +("<6,0>B"*"<6,1>B2"*"<6,2>A2") +("<7,0>B"*"<7,1>B2"*"<7,2>A3")); 6. Canonical expression using Sis: C*(C1+C2)+B*(B1+B2)+CH1*(A1+A2+A3) Experimental results: 200 web tables  197 segmented (26 errors corrected) 196 canonical expressions 376 relational tables 34,110 subject-predicate-object tuples

Data Extraction from Web Tables: the Devil is in the Details

Data Extraction from Web Tables: the Devil is in the Details

Presentation Transcript

Data Extraction

Excel JavaScript API Update

Domain-Independent Data Extraction: Person Names

Chapter 2 Presenting Data in Charts and Tables

The Devil is in the Details: Case Study from NC’s Smokefree Restaurant and Bars Law

Excel Functions and Tables

Schema Extraction

Devil Figures

Data Extraction, Visualization and Processing

Towards Domain-Independent Information Extraction from Web Tables

Relational Data Model

Tutorial 5: Working with Excel Tables, PivotTables, and PivotCharts

Data Streams

Gather ‘Round the Tables

Conceptual-Model-Based Web Data Extraction by Example

Kap. 12 – The Devil Is in the Details

The Devil and Tom Walker

QC Your RDBMS Data Using Dictionary Tables

DB-16: In Any Case, the Devil’s in the DataServer Details

New Data Extraction Tool

Semiautomatic Generation of Resilient Data-Extraction Ontologies