Harvesting Relational Tables from Lists on the Web

Harvesting Relational Tables from Lists on the Web Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc.

Outline • Introduction • The ListExtract Approach • Experiments • Conclusion

Lists on the Web

Lists on the Web • Our Goal: • Extract tabular data from all such lists in an unsupervised and domain-independent manner. • Not the typical wrapper generation problem.

Easy for Humans • Confusing for • Machines Cartoons Example A period (“.”) is used both as a delimiter and to terminate abbreviations A slash (“/”) is used both as a delimiter and as part of the text The slash (“/”) delimiter is missing (along with the prod. year)

Key Contributions • Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner • Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions • Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.

ListExtract Approach Deciding the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

Intermediate Outputs (Independent Splitting Phase)

Intermediate Outputs (Re-Splitting Long Records) Number of Columns = 4

Intermediate Outputs (Alignment Phase)

Final Output (Refinement Phase)

ListExtract Approach Deciding the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

Input Output Line Splitting Algorithm FQ Score Subsequence √ √ pre-processing: (removing delimiters) √ √

Field Quality (FQ) Score • Linear Combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc) • Score: 1 if match found, 0 otherwise • Table Corpus • Check if candidate sequence existed as a field in the table corpus • Score: 1 if exists, 0 otherwise • Language Model • Measure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text. • Score: a combination of the probabilities capturing both the likelihood and unlikelihood

ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Majority Voting across all records Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase

Input Output Re-Splitting Long Records FQ Score Subsequence √ √ pre-processing: (removing delimiters) √ √ Maximum Number of Output Fields = 3

Aligning Short Records(Null Insertion) Avg. FQ Score Independently Split Records 0.88 0.79 0.49 0.62 0.73 0.92 0.86

Aligning Short Records(Null Insertion) Avg. FQ Score Independently Split Records Output Table 0.92 0.86 0.79 0.62 0.88 0.73 0.49 1- Sorting 2- Iterative Alignment

Aligning Short Records(Null Insertion) • To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm. [NW, J. of Molecular Biology, 1970] • The two sequences: • Sequence #1: Table columns • Sequence #2: Fields of a short record • Design a Field-to-Field Consistency (F2FC) Score. • Use the average F2FC Score as the similarity measure for the alignment algorithm.

Field-to-Field Consistency (F2FC) Score • Linear combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Check if data types are consistent • Table Corpus • Check if two fields co-occur in the same column in a table in the corpus • Syntax • Measure the consistency of the syntax of the two fields (e.g. length, % of upper/lower case letters, digits, spaces, etc) • Delimiters • Measures the consistency between the delimiters on both sides of the two fields

Refinement Phase Output Table

Refinement Phase Output Table Detect Inconsistent Fields

Refinement Phase Output Table Detect Inconsistent Fields Consider streaks only

Refinement Phase Output Table Detect Inconsistent Fields Consider streaks only Re-merge

Refinement Phase Output Table • Detect Inconsistent Fields • Consider streaks only • Re-merge • Re-split (and re-align if needed) • Use extended FQ score

Field Quality (FQ) Score[Revisited] • Linear Combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Table Corpus • Language Model • List Support • favors candidates which are more consistent with the columns spanned by the streak

Table Extraction (TE) Score • Average FQ Score for all fields in the extracted table • Used to compare between and rank the extracted tables based on their extraction quality

Overall Performance for WLists and TDLists • WLists: A set of 20 manually-collected HTML lists spanning 20 different domains. • TDLists: A set of 100 lists derived from randomly-selected HTML tables

Effect of the Refinement Phase(WLists)

Large-Scale Experiment A crawl of 100K web pages (0.45, ~10,300 tables) 100K extracted lists 32K lists after filtering (0.65, ~1,000 tables) 11K extracted tables with multiple columns

Conclusion • Our work is a continuation of the efforts to extract structured data from the Web. • Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions. • Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.

Thank you Questions?

Harvesting Relational Tables from Lists on the Web