Enhancing Table Extraction Using Conditional Random Fields

Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15th 2004

Warm up • Why table extraction? • Applications: Question-Answering, data mining and IR • Tables: “textual tokens laid out in tabular form” • Tables: “databases designed for human eyes” • Related Work: • Pyreddy and Croft,1997: purely layout-based approach; a Character Alignment Graph (CAG) is used to identify the whole table • Ng et. al. ,1999: machine learning to identify rows and columns positions; no extraction of content. • Hurst, 2000: combination of layout and language perspective; text are broken into blocks by spatial and linguistic evidence • Pinto et. al., 2002: based on CAG, heuristic method to extract table cells for QA system.

Objectives • On this paper: • Only text tables are studied, not HTML tables • Table extraction can be broken down into 6 subproblems: • Locate the table (*) • Identify the row positions and types (*) • Identify columns positions and types • Segment tables into cells • Tag cells as data or headers • Associate data cells with their corresponding headers • Only (*) tasks are addressed in the paper • CRFs are compared to MaxEntropy and to HMM

Example • From www.FedStats.com , July 2001

12 Line Labels • Non-extraction labels • { NONTABLE, BLANKLINE, SEPARATOR } • Header Labels • { TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER, SECTIONHEADER } • Data Row Labels • { DATAROW, SECTIONDATAROW } • Caption Labels • { TABLEFOOTNOTE, TABLECAPTION }

Feature Set • White Space Features • Presence of: 4 consecutive white spaces, 4 space indents, 2 consecutive white space between non-space characters, a complete white space line, single space indent, etc • Percentage of: white space from the first non-white space on • Text Features • Presence of: 3 cells on a line, etc • Percentage of: digits (0-9) on a line, alphabet characters(a-z) on a line, header features (year strings, month abreviations, etc) on a line • Separator Features • Presence of: 4 consecutive periods • Percentage of: separator characters(-,+,! ,=,:,*) on a line • Conjunction of Features • Conjunctions: current&previous line, current&next line, next&nextnext

Task 1: Table Line Location • A table line is any label but NONTABLE, BLANKLINE and SEPARATOR • F-Measure = (2*Precision * Recall)/(Recall+Precision) • Both CRFs used a Gaussian Prior and were trained using L-BFGS • Training set (52 documents), develop. set (6 documents), test set (62 docs)

Task 2: Line Identification • How many of these lines were actually table lines?

Task 2: Line Identification

Additional Results • Pinto et. al. heuristic method • 4 labels: CAPTIONS, HEADERS, DATA, NON-TABLE

Conclusions • The Table extraction problem has complex linguistic and formatting characteristics. In order to attack this problem, a combination of textual and spatial features was used. • CRFs can handle very well arbitrary and overlapping features, and offer the combined benefits of conditional-probability training models and Markov finite-state context models.

Enhancing Table Extraction Using Conditional Random Fields