From Tessellations to Table Interpretation

From Tessellations to Table Interpretation Ramana C. Jandhyala DocLab, RPI

Introduction • Novel aspects of our work • Focus on computer-constructed web tables • Using commercial software • Describing tables using XY trees • Extracting relationship of headers to content cells • Formalizes the 200 table-experiment conducted by Raghav. These tables were imported from 10 websites into Excel and manually edited into a form that can be processed algorithmically. • Average editing time – 104 sec. • Average table size – 587 cells. • Augmentations not considered!

Rectangular Tessellations • Rectangular Tiling/Discrete Rectangular Tessellation • Partition of an isothetic rectangle into rectangles • Geometry uniquely defined by locations and types of junction points • Number Nall(m) increases exponentially with table size. • XY Tessellations • Special case of rectangular tessellations • Got by successive horizontal and vertical cuts • Number of XY tilings Nxy(m) decrease rapidly (Klarner-Magliveras), i.e. Lim Nxy(m) / Nall(m) = 0 m->inf

Taxonomy of web tables • All tables have a stub, row headings, column headings and data cells. • Some common layouts – admissible tessellations

Taxonomy of web tables (contd.) • Human-understandable tables - NT,S,xy(m), mathematically indefinable and unknown number • Convert them to smaller set of admissible tables – NA,S,xy(m) • Layout-equivalent tables enough for algorithmic analysis.

Taxonomy of web tables (contd.) • Number of different layout-equivalent admissible candidates - NL,S,xy(m) • For now, NL,S,xy(m) <NA,S,xy(m) • Context-free grammars – characterize entire families of layout-equivalent tables

Logical Structure of Tables • XY trees only capture physical layout • To understand a table – need to analyse logical structure, i.e. relationship between header cells and content cells [Wang]. • Wang notation – consists of category trees (headings) and delta cells (content). • Number of category trees – dimensionality of the table • Cartesian product of category trees lead to delta cells. • Size of table – product of number of rows and columns of delta cells

Logical Structure of Tables (contd.) • Well-formed tables – Labeled table candidates for which Wang Notation exists • Most tables not well-formed, but easily convertible into well-formed format using virtual headers. • Analyzing logical structure not sufficient for table understanding!

Our project – front end for creating narrow-domain ontologies by combining information from web tables • Our work based on following inequalities NL,S,xy(m)<NA,S,xy(m) <NT,S,xy(m) <<NS,xy(m) <<Nxy(m) <<Nall(m) • Examples of each class shown in next slide.

Tessellations to XY trees • Horizontally and vertically ordered lists of junction points – not sufficient for reconstructing XY tree! • Do not capture the adjacency topology. • Need coordinates and junction types (NE-corner, T-junction, crossing etc.)

Table to XY tree – EX2XY • Applicable to any tessellation for which XY tree exists. • Input – Excel Table • Output – XY tree (parenthesized notation) • Algorithm: • CutV(R) – cuts a rectangle R vertically and returns leftmost sub-rectangle. • CutH(R) – cuts R horizontally and returns topmost sub-rectangle. • Both used in a pair of procedures P1 and P2, which call each other recursively. • P1 cuts given rectangle vertically and submits first sub-rectangle to P2 for horizontal cuts. Similarly with P2. • Main procedure calls P1 for vertical cuts, and P2 for horizontal cuts.

Example – Original HTML table

Example (contd.) – After import into Excel

Example – After Editing

Parenthetical version of the output ( [ { ::15,2:15,2 ::16,2:16,2 Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars)::17,2:30,2 } { ::15,3:15,3 ::16,3:16,3 Canada::17,3:17,3 Newfoundland and Labrador::18,3:18,3 Prince Edward Island::19,3:19,3 Nova Scotia::20,3:20,3 New Brunswick::21,3:21,3 Quebec::22,3:22,3 Ontario::23,3:23,3 Manitoba::24,3:24,3 Saskatchewan::25,3:25,3 Alberta::26,3:26,3 British Columbia::27,3:27,3 Yukon::28,3:28,3 Northwest Territories::29,3:29,3 Nunavut::30,3:30,3 } { Year::15,4:15,8 [ 2004::16,4:16,4 2005::16,5:16,5 2006::16,6:16,6 2007::16,7:16,7 2008::16,8:16,8 ] . . . XML version of the output . . <block id='1.1.2.1' range='17,2:30,2'> <content>Real gross domestic product, expenditure-based, by province and territory (millions of chained (2002) dollars)</content> </block> <block id='1.1.2.2' range='17,3:30,3'> <content></content> </block> <block id='1.2.2.1' range='16,4:16,4'> <content>2004</content> </block> <block id='1.2.2.2' range='16,5:16,5'> <content>2005</content> </block> <block id='1.2.2.3' range='16,6:16,6'> <content>2006</content> </block> <block id='1.2.2.4' range='16,7:16,7'> <content>2007</content> </block> . . . A snippet of the output (both parenthetical and XML outputs)

Grammar for tables • The grammar uses nested parenthetical notation (P-notation). • P-notation has 1:1 correspondence with general trees. • For above table, the XY tree sentence is: Sxy = {c [c c] c [c {c [c c]} c {c [c c]}]} (neglecting the textual labels)

Grammar • Grammar for parsing the column headers of all such layout-equivalent tessellations: • S := A (Rule 1) • A := {B} (Rule 2) • B := c [X] B | c [X] (Rules 3 and 4) • X := c X | A X | A | c (Rules 5, 6, 7 and 8) • where • S – start symbol • A – nonterminal that generates all admissible strings for column headers • B – generates >=1 instances of categories in the form c[X] • Each c becomes a root category and X generates its subcategory tree • X generates strings of size >=1 with arbitrary occurrences of c and A. • The derivation for the previous example using a LALR parser is shown on the next slide

Example demonstrates both power and limitation of grammars. • A grammar can recognize broad classes. • But grammars cannot check that headings are properly labels for well-formed tables • If accepted by the grammar, need additional geometric alignment and lexical checks to verify Wang notation.

XY tree to Wang Notation • XY2WANG converts an XY tree generated from a restricted family of admissible tables to Wang Notation. • Example: • Uses an indented table-of-contents format as a data structure.

XY2WANG • Input – XY trees with arbitrary number of categories and arbitrary nesting. • Output – XML version of Wang Notation • For a table T = (C, d), • Category Notation: C = { (A,{(A1,phi),(A2,phi)}),(B,{(B1,phi),(B2,phi),(B3,phi)}) } • Delta mappings δ({A.A1,B.B1}) = d11 δ({A.A1,B.B2}) = d12 …

XY2WANG: Algorithm • Algorithm: • First locate 4 principal regions – stub, row/column headers and content cells. • Extract Wang labeled domains under assumption that each spanning cell is the header of smaller cells either to its right (row headers) or bottom (column headers). • Compute Cartesian product of category paths and match each key to the content of a delta cell.

XY2WANG: Table-of-contents data structure • Example of a table and its corresponding table-of-contents data structure is shown

XY2WANG also handles more complex scenarios like: • Higher Wang dimensionality • Deeper nesting of headers • Repetitive headers • Detection of not well-formed tables • These are included in the following pseudocode

Conclusion • Hierarchical structure of categories and flat structure of data cells is recovered from XY trees. • Geometric and topological equivalence classes on tessellations and their XY trees are defined. • Commonly encountered tables are examples of such classes. • These tables are identified by parsing XY trees with a grammar. • Assuming the header labels are consistent, Wang category notation is extracted.

Future work • Account for aggregates – major component of web tables. • Need to integrate other augmentations (footnotes, units, captions etc.) • Expand on the grammar: current version accounts only for column headers. • Automate the conversion from imported web tables to standard formats. • Semantic interpretation of groups of conceptually overlapping tables based on precise representation of layout-invariant syntax.

Current Work • Converting web tables to standard formats for ease of processing. • Internal conventions: A’, A’’, hybrids • Learning from XY trees using tree edit distance • Learning from existing manipulations. • Ex: The user modifies table T1 to a standard format T1’. The steps are all recorded. Now use this information to predict the standard format of a new table T2.

Current work (contd.) • Relation of tree-edit distance to pre-order and post-order string edit distance • Some interesting results and conjectures, but still half-boiled! • (Result) Pre- and post- order traversals enough for reconstructing a general tree. • (Conjecture) For 2 XY trees, distances between corresponding pre- and post-order strings equal, but not for general trees! • (Conjecture) For 2 XY trees, tree-edit distance equal to pre/post order distances • Are tables with same content, but different layouts, collinear (in terms of string/tree edit distance)? • Developing software to calculate tree edit distances, should clear many things. (Any suggestions?)

From Tessellations to Table Interpretation