470 likes | 479 Views
From Tessellations to Table Interpretation. R. C. Jandhyala 1 , M. Krishnamoorthy 1 , G. Nagy 1 , R. Padmanabhan 1 , S. Seth 2 , W. Silversmith 1 1 DocLab, Rensselaer Polytechnic Institute 2 Computer Science and Engineering, University of Nebraska-Lincoln
E N D
From Tessellations to Table Interpretation R. C. Jandhyala1,M. Krishnamoorthy1, G. Nagy1,R. Padmanabhan1, S. Seth2,W. Silversmith1 1DocLab, Rensselaer Polytechnic Institute 2Computer Science and Engineering, University of Nebraska-Lincoln (Supported by NSF Grants # 044114854 and 0414644, and Rensselaer Center for Open Source Software)
Goal: Construction of a narrow-domain ontology from semi-structured web data (“table understanding” )
Outline Tilings (rectangular tessellations) X-Y trees (1984) Grammars Tables Wang Categories (1996) A B C D
Outline Tilings (rectangular tessellations) X-Y trees (1984) Grammars Tables Wang Categories (1996) A B C D
Web tables • Cannot precisely define human-understandable tables. • Convert to smaller set of admissible tables. • Why? Algorithmic ease.
Admissible Tables • Have stub, headings and data cells.
Outline Tilings (rectangular tessellations) X-Y trees (1984) Grammars Tables Wang Categories (1996) A B C D
Rectangular Tessellations • Partition of an isothetic rectangle into rectangles. • Uniquely defined by junction points (location and type). • Number of tessellations increases rapidly with table size.
XY Tessellations • Special case of rectangular tessellations. • Successive horizontal and vertical cuts. • Easily represented by trees.
A tiling and its X-Y Tree(aka slicing structure, puzzle tree, tree map)
Non-slicing structures – No XY tree In fact, X-Y tilings are an infinitesimal fraction of all tilings. This helps, because tables never contain this “spiral” structure.
Fundamental Idea Use XY trees to automate table processing and understanding.
Table to XY tree – EX2XY • Applicable to any XY tessellation. • Input – Excel Table • Copy and paste or Import. • Edit to make admissible. • Output – XY tree • as XML for portability. • as parenthesized string for grammars.
Example (http://www40.statcan.ca/l01/cst01/econ50-eng.htm)
Output - XML … <block id='1.1.2.1' range='17,2:30,2'> <content> Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars) </content> </block> …
Outline Tilings (rectangular tessellations) X-Y trees (1984) Grammars Tables Wang Categories (1996) A B C D
Table Grammars • Can characterize entire families of tables. • Developed grammar for one family. • Input - Nested parenthesized notation . • Output – Accept/Reject as example of family.
Grammar • For parsing column headers S := A (Rule 1) A := {B} (Rule 2) B := c [X] B | c [X] (Rules 3 and 4) X := c X | A X | A | c (Rules 5, 6, 7 and 8) • S is start symbol. • A generates all admissible column headers. • B generates category trees. • c is a root category. • X generates sub-categories.
Table Grammars • Cannot check if table is consistent. • Need further geometric alignment and lexical checks.
Outline Tilings (rectangular tessellations) X-Y trees (1984) Grammars Tables Wang Categories (1996) A B C D
Logical Structure of Tables • How to interpret a table? • Describe relationship between header cells and content cells [Wang, U. Waterloo,1996]. • Wang notation • Elegant description. • Dimensionality: Number of category trees. • Cartesian product maps categories to data.
Layout independent Wang Notation Different layout and same information means same Wang Notation
Wang Category Trees for either table • characteristic gonsity hepth • fleck burlam falder multon • Any data cell can be designated by a path through each category tree. • Leaves correspond to row or column headings.
“Real” Table Understanding • Analyzing logical structure not sufficient. • Need additional information from title, footnotes, captions, etc. • Semantic analysis of the labels also important – need external knowledge.
Does Wang Notation always exist? • Not always! • Inconsistent tables do not have Wang Notation. • Others can be edited using virtual headers.
XY tree to Wang Notation Algorithm • Input – XY trees. • Output – XML version of Wang Notation. • Checks for table consistency.
Algorithm • Locate principal regions - stub, headers and content cells. • Extract Wang categories. • Compute Cartesian product of category paths. • Match each key to the content of a delta cell.
Conclusions • Admissible layouts identified for ease of processing. • Algorithms developed for • extracting XY trees from tables. • extracting Wang notation from XY trees. • Family of tables identified using a grammar.
Future work • Augmentations - captions, aggregates, units, etc. • Expand the grammar. • Automate conversion of table to admissible formats. (http://www40.statcan.ca/l01/cst01/agri111a-eng.htm)
Goal: construction of a narrow-domain ontologyfrom semi-structured web data(“table understanding” ) • Currently multon is the best choice for rapitting velters. It is about 25% better than burlam or falder, which have the same girby (hepth/gonsity ratio). • Check another table to see whether elmer is even better. • NOT TODAY!
H-first tree can be transformed into V-first tree(and vice-versa)
EX2XY: Algorithm • Two workhorses: • Vertical_cut – returns leftmost sub-rectangle of a given rectangle. • Horizontal_cut – returns topmost sub-rectangle of a given rectangle.
EX2XY: Algorithm (contd.) • Used in a pair of procedures P1 and P2. • P1 cuts vertically and submits first sub-rectangle to P2 for horizontal cuts. • Similarly with P2.
Parenthesized notation • P-notation has 1:1 correspondence with general trees. • For above table, the XY tree sentence is: Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]}.
XY2WANG: Other features • Handles more complex scenarios: • Higher dimensionality. • Deeper nesting of headers. • Repetitive headers.
Conclusion • Average total time to process a table - 231 seconds. • Average table size - 587 cells before preprocessing. • Average preprocessing time - 104 seconds. • 3 category tables took approximately 27 seconds more than 2 category tables.
Conclusion (Contd.) • Tables with aggregates and footnotes - more time to process. • Strong correlation between processing time and table size. • For future: automatically segmenting augmentations, categories and delta cells using visual cues.