200 likes | 332 Views
Transforming Arbitrary Tables into F-Logic Frames with TARTAR. Aleksander Pivk , York Sure, Philipp Cimiano , Matjaz Gams , Vladislav Rajkovic , Rudi Studer Presented By Stephen Lynn. Information Extraction. Free-form Text Linguistic/NLP approaches Tabular Structures
E N D
Transforming Arbitrary Tables into F-Logic Frames with TARTAR AleksanderPivk, York Sure, Philipp Cimiano, MatjazGams, VladislavRajkovic, Rudi Studer Presented By Stephen Lynn
Information Extraction • Free-form Text • Linguistic/NLP approaches • Tabular Structures • Table comprehension task • html, excel, pdf, text, etc. • Semantic interpretation task • More effort???
Semantic Representation • Frame Logic (F-Logic) • Model-theoretic semantics • Complete resolution-based proof theory • Expressive power of logic • Availability of efficient reasoning tools
Table Comprehension • Dimensions – a grouping of cells representing similar entities
Table Comprehension • Stub – dimension with headers used to index elements in body
Table Comprehension • Box head – column headers (often nested)
Table Comprehension • Body – data values
Table Classes • 1D, 2D, Complex
Cleaning & Canonicalization • Clean DOM tree • CyberNeko HTML Parser • Rowspan/Colspan expansion
Structure Detection • Token Type Hierarchy • Assign Functional Types and Probabilities
Structure Detection • Detect Logical Table Orientation
Structure Detection • Discover and Level Regions • Logical Units
FTM Building • Functional Table Model (FTM) • Arrange regions into a tree • Leaf nodes are data
Semantic Enriching of FTM • Labeling • WordNet and GoogleSets • Map FTM to a frame
Evaluation • Crawl, extract, filter web tables • 135 tables • 85.4% success rate • Mostly problems with complex tables • Compare auto-generated frames with human generated frames • 14 people transformed 3 tables each • 21 total tables (each done twice) • Syntactic/Semantic correctness (Strict and Soft)
Results Inter-annotator agreement System-annotator agreement
Benefits • Fully automated knowledge formalization • Arbitrary tables • Independent of domain knowledge • Independent of document type • Explicit semantics of generated frames • Query answering over heterogeneous tables