1 / 22

Table Understanding in DIADEM

Table Understanding in DIADEM. Giorgio Orsi 1,2 and Ben Watson 2 1 Institute for the Future of Computing University of Oxford 2 Department of Computer Science University of Oxford. DIADEM 1.0. Table Understanding. Process that locates (or recognizes), analyses and interprets

levi
Download Presentation

Table Understanding in DIADEM

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Table Understanding in DIADEM Giorgio Orsi1,2 and Ben Watson2 1Institute for the Future of Computing University of Oxford 2Department of Computer Science University of Oxford DIADEM 1.0

  2. Table Understanding • Process that • locates (or recognizes), • analyses and • interprets • a tabular structure with the goal of • classify (layout vs data tables), • extract data, • translate or, • other.

  3. What is a Table? • Penn et Al. ’01 • a 2D assembly of cells, where • each cell is short in length and • contains no complex structures, and • there is semantic and syntactic coherence within the rows and columns.

  4. What is a Table?

  5. What Information do we Have? • HTML • CSS Boxes <table border="1"> <tbody> <tr> <th colspan="2">NAME</th> <th rowspan="2">D.O.B.</th> </tr> <tr> <th>FIRST NAME</th> <th>SURNAME</th> </tr> <tr> <td>Sue</td> <td>Adams</td> <td>12th June 1980</td> </tr> <tr> <td>Jim</td> <td>Wright</td> <td>19th May 2000</td> </tr> </tbody> </table> • Domain xsd:string ox:firstName ox:dob xsd:date ox:person xsd:string ox:surname

  6. Why Table Understanding in DIADEM • recognize and extract data in tabular format • layout tables • data tables • understand forms and result-pages • labelling • segmentation • let us focus first on HTML tables (e.g., <table>)

  7. Why Table Understanding in DIADEM

  8. Why Table Understanding in DIADEM

  9. Leaf Tables • Goal: determine whether a table contains any inner table layout recursive check • if T1 contains T2 (e.g., there is a <table> element in the subtree rooted in T1), than T1 is a layout table.

  10. Row and Column count • Goal: identify “sane” tables • at least two coherent adiacent cells (TD, DIV, TH) • e.g., two data cells, two header cells, 1 header one data • allow 1D tables (i.e., vectors) • allow empty tables

  11. Longest String • Goal: identify “sane” cells • find the longest string w in every cell, T is a data table if |w|<δ • layout tables are likely to contain a large amount of text • ignore text nodes associated to <SELECT>, <FORM> and <TABLE> • in their subtree • siblings ignore

  12. Empty Cell • Goal: identify “sane” cells • find empty cells, T is a data table if contains no empty cells • layout tables are likely to contain empty cells empty

  13. TH Check • Goal: identify “sane” tables • find <TH> elements in a table • layout tables are not likely to contain <TH> elements

  14. Largest Cell

  15. Picture • Goal: identify “sane” cells • check the size of pictures in a cell • T is a data table if p-area<δ • layout tables are likely to contain large pictures • e.g., ads and logos

  16. Table Size

  17. Combining Rules • Identify the combination of rules that maximizes the recognition accuracy • cut-offs estimation • best-guess estimation • if T passes all the rules  data table • cut-off calculation • cut-off = performance of each rule • If T passes all the rules  data table • machine learning • decision trees  white box model

  18. Evaluation: Cut-Off Estimation • First run: all rules in AND • Second run: no empty cell • Third run: no empty cell, no table size • Fourth run: no empty cell, no table size, no picture rule

  19. Evaluation: Cut-Off Computation • First run: all rules in AND • Second run: no empty cell, no table size

  20. Evaluation: Decision Tree • Facts: • 65% training • 35% 10-fold validation • precision: 0.807 • recall: 0.836 • F-measure: 0.821 • Comparison: • F-Measure 0.740 (Gatterbauer)

  21. Discussion • Most of the errors caused by missing information or bad combination of rules. • use visual and semantic information • combine the heuristics in an “organic” way • PDF-inspired extraction • guided by the HTML and CSS structure. • use a reference model as in form and result-page analysis

  22. Thank you!

More Related