120 likes | 134 Views
This article discusses the problem of table extraction and formatting, specifically focusing on HTML tables and plain text tables. It explores how tags can help in understanding HTML tables and proposes a MaxEnt model for table extraction. A data set from the CS department at the University of Massachusetts Amherst is used for training and testing the model. The article also presents an error analysis and suggests future improvements.
E N D
Table Extraction Using MaxEnt Zonghui Lian
Introduction • Table extraction • Table format
Problem • HTML table • Tags can help us to understand it • How about plain text table?
title title title separator header header header header datarow datarow datarow datarow datarow datarow An Example
How to define features How to learn model weights MaxEnt
Data Set • CS dept university of Massachusetts Amherst (FedStats.gov) • Training data: 9321 Test data: 1200 • Format
Features • White space • Large gaps /Small gaps • Four space indents • Space percentage • Text feature • Digit percentage • Month and year
Features • Special characters -, +, =, :, |, .
TABLEFOOTNOTE -> NONTABLE DATAROW DATAROW -> SECTIONDATAROW TABLEHEADER -> SUPERHEADER Most error happened when recognizing … [TABLEFOOTNOTE : 0.2719665271966527 DATAROW : 0.12552301255230125 TABLEHEADER : 0.11715481171548117 Error Analysis TABLEFOOTNOTE 1 Includes Hawaii. TABLEFOOTNOTE 2 Includes processing total for dual usage crops.
Future Work • Improve the performance • Features For example Alphabet characters Previous label Next label • Data set size
Future Work • Identity columns • Add tags • Use table understanding algorithm