110 likes | 141 Views
This paper explores table extraction from text documents, focusing on identifying table components such as row and column positions, and cell tagging for Question-Answering, data mining, and IR applications. It compares Conditional Random Fields (CRFs) with MaxEntropy and HMM. Different labeling and feature sets are discussed, along with tasks and results for table line location and identification. The conclusion highlights the importance of combining textual and spatial features for tackling the complex linguistic and formatting aspects of table extraction.
E N D
Table Extraction Using Conditional Random Fields D. Pinto, A. McCallum, X. Wei and W. Bruce Croft - on SIGIR03 - Presented by Vitor R. Carvalho March 15th 2004
Warm up • Why table extraction? • Applications: Question-Answering, data mining and IR • Tables: “textual tokens laid out in tabular form” • Tables: “databases designed for human eyes” • Related Work: • Pyreddy and Croft,1997: purely layout-based approach; a Character Alignment Graph (CAG) is used to identify the whole table • Ng et. al. ,1999: machine learning to identify rows and columns positions; no extraction of content. • Hurst, 2000: combination of layout and language perspective; text are broken into blocks by spatial and linguistic evidence • Pinto et. al., 2002: based on CAG, heuristic method to extract table cells for QA system.
Objectives • On this paper: • Only text tables are studied, not HTML tables • Table extraction can be broken down into 6 subproblems: • Locate the table (*) • Identify the row positions and types (*) • Identify columns positions and types • Segment tables into cells • Tag cells as data or headers • Associate data cells with their corresponding headers • Only (*) tasks are addressed in the paper • CRFs are compared to MaxEntropy and to HMM
Example • From www.FedStats.com , July 2001
12 Line Labels • Non-extraction labels • { NONTABLE, BLANKLINE, SEPARATOR } • Header Labels • { TITLE, SUPERHEADER, TABLEHEADER, SUBHEADER, SECTIONHEADER } • Data Row Labels • { DATAROW, SECTIONDATAROW } • Caption Labels • { TABLEFOOTNOTE, TABLECAPTION }
Feature Set • White Space Features • Presence of: 4 consecutive white spaces, 4 space indents, 2 consecutive white space between non-space characters, a complete white space line, single space indent, etc • Percentage of: white space from the first non-white space on • Text Features • Presence of: 3 cells on a line, etc • Percentage of: digits (0-9) on a line, alphabet characters(a-z) on a line, header features (year strings, month abreviations, etc) on a line • Separator Features • Presence of: 4 consecutive periods • Percentage of: separator characters(-,+,! ,=,:,*) on a line • Conjunction of Features • Conjunctions: current&previous line, current&next line, next&nextnext
Task 1: Table Line Location • A table line is any label but NONTABLE, BLANKLINE and SEPARATOR • F-Measure = (2*Precision * Recall)/(Recall+Precision) • Both CRFs used a Gaussian Prior and were trained using L-BFGS • Training set (52 documents), develop. set (6 documents), test set (62 docs)
Task 2: Line Identification • How many of these lines were actually table lines?
Additional Results • Pinto et. al. heuristic method • 4 labels: CAPTIONS, HEADERS, DATA, NON-TABLE
Conclusions • The Table extraction problem has complex linguistic and formatting characteristics. In order to attack this problem, a combination of textual and spatial features was used. • CRFs can handle very well arbitrary and overlapping features, and offer the combined benefits of conditional-probability training models and Markov finite-state context models.