410 likes | 653 Views
Harvesting Relational Tables from Lists on the Web. Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc. Outline. Introduction The ListExtract Approach Experiments Conclusion. Lists on the Web. Lists on the Web. Lists on the Web. Lists on the Web. Our Goal:
E N D
Harvesting Relational Tables from Lists on the Web Hazem Elmeleegy Purdue University Jayant Madhavan and Alon Halevy Google Inc.
Outline • Introduction • The ListExtract Approach • Experiments • Conclusion
Lists on the Web • Our Goal: • Extract tabular data from all such lists in an unsupervised and domain-independent manner. • Not the typical wrapper generation problem.
Easy for Humans • Confusing for • Machines Cartoons Example A period (“.”) is used both as a delimiter and to terminate abbreviations A slash (“/”) is used both as a delimiter and as part of the text The slash (“/”) delimiter is missing (along with the prod. year)
Key Contributions • Developed the ListExtract System, which extracts tables from lists in an unsupervised and domain-independent manner • Introduced using external sources of information such as a large collection of tables collected from the web and a language model to help in the splitting decisions • Conducted a large-scale experimental study which suggests that tens of millions of high-quality lists can be exploited on the Web.
Outline • Introduction • The ListExtract Approach • Experiments • Conclusion
ListExtract Approach Deciding the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase
Intermediate Outputs (Re-Splitting Long Records) Number of Columns = 4
ListExtract Approach Deciding the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase
Input Output Line Splitting Algorithm FQ Score Subsequence √ √ pre-processing: (removing delimiters) √ √
Field Quality (FQ) Score • Linear Combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Regular expressions to capture different data types (e.g. dates, emails, currencies, … etc) • Score: 1 if match found, 0 otherwise • Table Corpus • Check if candidate sequence existed as a field in the table corpus • Score: 1 if exists, 0 otherwise • Language Model • Measure the likelihood that candidate sequence occurs in free text, and the unlikelihood that overlapping sequences occur in free text. • Score: a combination of the probabilities capturing both the likelihood and unlikelihood
ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Majority Voting across all records Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase
ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase
Input Output Re-Splitting Long Records FQ Score Subsequence √ √ pre-processing: (removing delimiters) √ √ Maximum Number of Output Fields = 3
ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase
Aligning Short Records(Null Insertion) Avg. FQ Score Independently Split Records 0.88 0.79 0.49 0.62 0.73 0.92 0.86
Aligning Short Records(Null Insertion) Avg. FQ Score Independently Split Records Output Table 0.92 0.86 0.79 0.62 0.88 0.73 0.49 1- Sorting 2- Iterative Alignment
Aligning Short Records(Null Insertion) • To align each record, we use the classical Needleman-Wunsch Sequence Alignment algortihm. [NW, J. of Molecular Biology, 1970] • The two sequences: • Sequence #1: Table columns • Sequence #2: Fields of a short record • Design a Field-to-Field Consistency (F2FC) Score. • Use the average F2FC Score as the similarity measure for the alignment algorithm.
Field-to-Field Consistency (F2FC) Score • Linear combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Check if data types are consistent • Table Corpus • Check if two fields co-occur in the same column in a table in the corpus • Syntax • Measure the consistency of the syntax of the two fields (e.g. length, % of upper/lower case letters, digits, spaces, etc) • Delimiters • Measures the consistency between the delimiters on both sides of the two fields
ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase
Refinement Phase Output Table
Refinement Phase Output Table Detect Inconsistent Fields
Refinement Phase Output Table Detect Inconsistent Fields Consider streaks only
Refinement Phase Output Table Detect Inconsistent Fields Consider streaks only Re-merge
Refinement Phase Output Table • Detect Inconsistent Fields • Consider streaks only • Re-merge • Re-split (and re-align if needed) • Use extended FQ score
Field Quality (FQ) Score[Revisited] • Linear Combination of multiple score components • Each component corresponds to a source of evidence • Score Components • Data Type • Table Corpus • Language Model • List Support • favors candidates which are more consistent with the columns spanned by the streak
ListExtract Approach Decide on the Number of Columns Detecting Inconsistent Fields Splitting Lines into Records Re-Splitting Long Records Re-Splitting Detected Field Streaks Aligning Short Records (Null Insertion) Re-Aligning Detected Field Streaks (Null Insertion) Independent Splitting Phase Alignment Phase Refinement Phase
Table Extraction (TE) Score • Average FQ Score for all fields in the extracted table • Used to compare between and rank the extracted tables based on their extraction quality
Outline • Introduction • The ListExtract Approach • Experiments • Conclusion
Overall Performance for WLists and TDLists • WLists: A set of 20 manually-collected HTML lists spanning 20 different domains. • TDLists: A set of 100 lists derived from randomly-selected HTML tables
Large-Scale Experiment A crawl of 100K web pages (0.45, ~10,300 tables) 100K extracted lists 32K lists after filtering (0.65, ~1,000 tables) 11K extracted tables with multiple columns
Outline • Introduction • The ListExtract Approach • Experiments • Conclusion
Conclusion • Our work is a continuation of the efforts to extract structured data from the Web. • Our system, ListExtract, is completely unsupervised and does not assume any domain knowledge. It uses multiple sources of information to make its decisions. • Our results validate the quality of table extraction and suggest that a large number of high-quality lists can be exploited on the Web.
Thank you Questions?