260 likes | 405 Views
Michael J. Cafarella , Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang VLDB 2008 2009. 01. 08. Summarized and Presented by Babar Tareen , IDS Lab., Seoul National University. WebTables : Exploring the Power of Tables on the Web. Introduction. Web is a corpus of unstructured data
E N D
Michael J. Cafarella, Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang VLDB 2008 2009. 01. 08. Summarized and Presented by Babar Tareen, IDS Lab., Seoul National University WebTables: Exploring the Power of Tables on the Web
Introduction • Web is a corpus of unstructured data • Some structure is imposed by • Hierarchical URLs • Hyperlink Graph • Web pages generally contain • Text as paragraphs • Tabular data (Relations) • Text and tables have different characteristics • Tables have more structured data than raw text
Introduction (2) • Tables can give some hints about semantics • Headers • Tuples • Regular keyword query techniques are not very effective for tables
Motivation • Enable analysis and integration of data on the web • User demand for structured data • For 30 million queries users clicked on results containing tables • This paper focuses on two fundamental questions • What are effective methods for searching within large collections of tables? • Is there additional power that can be derived by analyzing large corpus of tables?
WebTables - Data • WebTables system considers HTML tables that are already surfaced and crawlable • Deep Web refers to the content that is made available through filling HTML forms • Corpus • 14.1 Billion raw HTML tables • 154 Million distinct relational databases • Relational database form 1.1% of raw HTML tables • 60% of data from non-deep-web sources • 40% of data from parameterized URLs
Extracting Relations • Most HTML tables are used for page layouts • To filter relational and non relational tables • Handwritten detectors • Statistically trained classifiers • Training & Test data generated by two independent judges • Scale of relational quality 1-5 • Tables that received average score of 4 or above were considered as relational
Attribute Correlation Statistics Database (ACSDb) • For each Unique Schema Rs, ACSDb contains frequency count • A = {(Rs1,C1), (Rs2,C2), (Rs3,C3) … } • If schema appears multiple times under same domain name it is counted only once • ACSDb contains • 5.4M unique attribute names • 2.6M unique schemas • ACSDb is simple but can be used to compute probabilities • For example, conditional probability of finding attribute ‘Address’ in a schema given attribute ‘Name’ P(address|name) = count of schemas containing address, name / count of schemas containing name
Relation Search • WebTables search engine allows users to rank relations by relevance • Query appropriate visualizations can be created • Columns containing place names can be displayed on a map • Graphs can be generated from table data • Traditional structured operations can be applied over search results • Selection • Projection
Ranking • Keyword ranking for databases is a novel problem • Challenges • Relations does not exist in a domain specific schema graph • Word frequencies apply ambiguously to tables (Ex: which table in the page is described by which frequent word) • Attribute labels are extremely important • Attributes provide good summaries of the subject matter • Tuples may have a key like element that summaries the row • Ranking Functions • naïveRank • filterRank • featureRank • schemaRank
Ranking Function (1) • Naïve Rank • It simply uses the top k search engine result pages to generate relations. • If there are no relations in the top k search results, naïve Rank will emit no relations. • Roughly simulates modern search engine user
Ranking Function (2) • Filter Rank • Similar to naïve rank • It will go as far down the search result pages as necessary to find ‘k’ relations
Ranking Function (3) • Feature Rank • Does not rely on an existing search engine • Uses relation specific features to score each extracted relation in the Corpus • Sorts results by score • Different feature scores were combined using linear regression estimator • trained by a thousand (q, relation) pairs each scored by two human judges
Ranking Function (4) • Schema Rank • Same as feature Rank • Additionally uses ACSDb based Schema coherence score • Coherent Schema is one where attributes are strongly related • Make, Model • Make, Zipcode • PMI - Point Mutual Information • Gives a sense of how strongly two items are related • Coherence score for a schema is the average of all possible attribute-pairwise PMI scores for the schema
Indexing • Traditional Search Engines use Inverted Index • Inverted Index can not retrieve relational features • Inverted Index • Term -> (docid, offset) • WebTables data exists in two dimensions • Term -> (docid, offset-X, offset-Y)
ACSDb Application (1) • Schema Auto Complete • Designed to assist novice database designers when creating a relational schema • Schemas consisting of Single Relations • User enter one or more domain-specific attributes and the auto-completer guesses the rest if the attributes
ACSDb Application (2) • Attribute Synonym-Finding • Automatically find synonyms between arbitrary attribute strings • Based on a set of context attributes generates attribute pairs • Assumptions • Synonymous attributes will never appear together in same chema • Odds of synonymity are higher if p(a,b) = 0 despite a large value for p(a)p(b) • Two synonyms will appear in similar contexts
ACSDb Application (3) • Join Graph Traversal • Provide a useful way of navigating huge graph of 2.6M Schemas • Basic join graph • Contains a node ‘N’ for each unique schema • Undirected join link between any two schemas that share a attribute • Every schema that contains ‘name’ field is linked to every other schema that contains ‘name’ • Cluster together similar schemas to minimize graph clutter • Schema: X,Y • Shared Attribute: D
Exp. Results – Relation Ranking Rank-ACSD beats Naïve (simulates search engine users) by 78-100% All of the non-Naïve solutions improve as k (number of results) increases
Exp. Results – Schema Auto Complete • Test Scenario • 6 Humans designed schemas using given attributes • Auto-Complete tool got three tries • By 3rd output Auto complete was able to reproduce a large number of schemas • No test designer recognized ‘ab’ as an abbrevation for ‘at-bats’, baseball terminology
Exp. Results – Synonym Finding Ranked by quality An ideal ranking would present a stream of only correct synonyms, followed by only incorrect ones Poor ranking will mix them together
Conclusion • WebTables is first large scale attempt to extract relational information embedded in HTML tables • Relation Ranking • ACSDb uses • Schema auto complete • Attribute Synonym Finding • Join Graph Traversing • Adding signal for source page quality like PageRank will improve overall quality
Discussion • Pros • Handling tables separately for search is a good idea • Cons • Most of the paper is focused on uses of ACSDb