220 likes | 337 Views
Table Understanding in DIADEM. Giorgio Orsi 1,2 and Ben Watson 2 1 Institute for the Future of Computing University of Oxford 2 Department of Computer Science University of Oxford. DIADEM 1.0. Table Understanding. Process that locates (or recognizes), analyses and interprets
E N D
Table Understanding in DIADEM Giorgio Orsi1,2 and Ben Watson2 1Institute for the Future of Computing University of Oxford 2Department of Computer Science University of Oxford DIADEM 1.0
Table Understanding • Process that • locates (or recognizes), • analyses and • interprets • a tabular structure with the goal of • classify (layout vs data tables), • extract data, • translate or, • other.
What is a Table? • Penn et Al. ’01 • a 2D assembly of cells, where • each cell is short in length and • contains no complex structures, and • there is semantic and syntactic coherence within the rows and columns.
What Information do we Have? • HTML • CSS Boxes <table border="1"> <tbody> <tr> <th colspan="2">NAME</th> <th rowspan="2">D.O.B.</th> </tr> <tr> <th>FIRST NAME</th> <th>SURNAME</th> </tr> <tr> <td>Sue</td> <td>Adams</td> <td>12th June 1980</td> </tr> <tr> <td>Jim</td> <td>Wright</td> <td>19th May 2000</td> </tr> </tbody> </table> • Domain xsd:string ox:firstName ox:dob xsd:date ox:person xsd:string ox:surname
Why Table Understanding in DIADEM • recognize and extract data in tabular format • layout tables • data tables • understand forms and result-pages • labelling • segmentation • let us focus first on HTML tables (e.g., <table>)
Leaf Tables • Goal: determine whether a table contains any inner table layout recursive check • if T1 contains T2 (e.g., there is a <table> element in the subtree rooted in T1), than T1 is a layout table.
Row and Column count • Goal: identify “sane” tables • at least two coherent adiacent cells (TD, DIV, TH) • e.g., two data cells, two header cells, 1 header one data • allow 1D tables (i.e., vectors) • allow empty tables
Longest String • Goal: identify “sane” cells • find the longest string w in every cell, T is a data table if |w|<δ • layout tables are likely to contain a large amount of text • ignore text nodes associated to <SELECT>, <FORM> and <TABLE> • in their subtree • siblings ignore
Empty Cell • Goal: identify “sane” cells • find empty cells, T is a data table if contains no empty cells • layout tables are likely to contain empty cells empty
TH Check • Goal: identify “sane” tables • find <TH> elements in a table • layout tables are not likely to contain <TH> elements
Picture • Goal: identify “sane” cells • check the size of pictures in a cell • T is a data table if p-area<δ • layout tables are likely to contain large pictures • e.g., ads and logos
Combining Rules • Identify the combination of rules that maximizes the recognition accuracy • cut-offs estimation • best-guess estimation • if T passes all the rules data table • cut-off calculation • cut-off = performance of each rule • If T passes all the rules data table • machine learning • decision trees white box model
Evaluation: Cut-Off Estimation • First run: all rules in AND • Second run: no empty cell • Third run: no empty cell, no table size • Fourth run: no empty cell, no table size, no picture rule
Evaluation: Cut-Off Computation • First run: all rules in AND • Second run: no empty cell, no table size
Evaluation: Decision Tree • Facts: • 65% training • 35% 10-fold validation • precision: 0.807 • recall: 0.836 • F-measure: 0.821 • Comparison: • F-Measure 0.740 (Gatterbauer)
Discussion • Most of the errors caused by missing information or bad combination of rules. • use visual and semantic information • combine the heuristics in an “organic” way • PDF-inspired extraction • guided by the HTML and CSS structure. • use a reference model as in form and result-page analysis