1 / 26

WebTables : Exploring the Power of Tables on the Web

Michael J. Cafarella , Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang VLDB 2008 2009. 01. 08. Summarized and Presented by Babar Tareen , IDS Lab., Seoul National University. WebTables : Exploring the Power of Tables on the Web. Introduction. Web is a corpus of unstructured data

kaiya
Download Presentation

WebTables : Exploring the Power of Tables on the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Michael J. Cafarella, Alon Halevy, Zhe Daisy Wang, Eugene Wu, Yang Zhang VLDB 2008 2009. 01. 08. Summarized and Presented by Babar Tareen, IDS Lab., Seoul National University WebTables: Exploring the Power of Tables on the Web

  2. Introduction • Web is a corpus of unstructured data • Some structure is imposed by • Hierarchical URLs • Hyperlink Graph • Web pages generally contain • Text as paragraphs • Tabular data (Relations) • Text and tables have different characteristics • Tables have more structured data than raw text

  3. Introduction (2) • Tables can give some hints about semantics • Headers • Tuples • Regular keyword query techniques are not very effective for tables

  4. Motivation • Enable analysis and integration of data on the web • User demand for structured data • For 30 million queries users clicked on results containing tables • This paper focuses on two fundamental questions • What are effective methods for searching within large collections of tables? • Is there additional power that can be derived by analyzing large corpus of tables?

  5. WebTables - Data • WebTables system considers HTML tables that are already surfaced and crawlable • Deep Web refers to the content that is made available through filling HTML forms • Corpus • 14.1 Billion raw HTML tables • 154 Million distinct relational databases • Relational database form 1.1% of raw HTML tables • 60% of data from non-deep-web sources • 40% of data from parameterized URLs

  6. Extracting Relations • Most HTML tables are used for page layouts • To filter relational and non relational tables • Handwritten detectors • Statistically trained classifiers • Training & Test data generated by two independent judges • Scale of relational quality 1-5 • Tables that received average score of 4 or above were considered as relational

  7. Data Model

  8. Attribute Correlation Statistics Database (ACSDb) • For each Unique Schema Rs, ACSDb contains frequency count • A = {(Rs1,C1), (Rs2,C2), (Rs3,C3) … } • If schema appears multiple times under same domain name it is counted only once • ACSDb contains • 5.4M unique attribute names • 2.6M unique schemas • ACSDb is simple but can be used to compute probabilities • For example, conditional probability of finding attribute ‘Address’ in a schema given attribute ‘Name’ P(address|name) = count of schemas containing address, name / count of schemas containing name

  9. ACSDb

  10. Relation Search • WebTables search engine allows users to rank relations by relevance • Query appropriate visualizations can be created • Columns containing place names can be displayed on a map • Graphs can be generated from table data • Traditional structured operations can be applied over search results • Selection • Projection

  11. Ranking • Keyword ranking for databases is a novel problem • Challenges • Relations does not exist in a domain specific schema graph • Word frequencies apply ambiguously to tables (Ex: which table in the page is described by which frequent word) • Attribute labels are extremely important • Attributes provide good summaries of the subject matter • Tuples may have a key like element that summaries the row • Ranking Functions • naïveRank • filterRank • featureRank • schemaRank

  12. Ranking Function (1) • Naïve Rank • It simply uses the top k search engine result pages to generate relations. • If there are no relations in the top k search results, naïve Rank will emit no relations. • Roughly simulates modern search engine user

  13. Ranking Function (2) • Filter Rank • Similar to naïve rank • It will go as far down the search result pages as necessary to find ‘k’ relations

  14. Ranking Function (3) • Feature Rank • Does not rely on an existing search engine • Uses relation specific features to score each extracted relation in the Corpus • Sorts results by score • Different feature scores were combined using linear regression estimator • trained by a thousand (q, relation) pairs each scored by two human judges

  15. Ranking Function (4) • Schema Rank • Same as feature Rank • Additionally uses ACSDb based Schema coherence score • Coherent Schema is one where attributes are strongly related • Make, Model • Make, Zipcode • PMI - Point Mutual Information • Gives a sense of how strongly two items are related • Coherence score for a schema is the average of all possible attribute-pairwise PMI scores for the schema

  16. Indexing • Traditional Search Engines use Inverted Index • Inverted Index can not retrieve relational features • Inverted Index • Term -> (docid, offset) • WebTables data exists in two dimensions • Term -> (docid, offset-X, offset-Y)

  17. ACSDb Application (1) • Schema Auto Complete • Designed to assist novice database designers when creating a relational schema • Schemas consisting of Single Relations • User enter one or more domain-specific attributes and the auto-completer guesses the rest if the attributes

  18. ACSDb Application (2) • Attribute Synonym-Finding • Automatically find synonyms between arbitrary attribute strings • Based on a set of context attributes generates attribute pairs • Assumptions • Synonymous attributes will never appear together in same chema • Odds of synonymity are higher if p(a,b) = 0 despite a large value for p(a)p(b) • Two synonyms will appear in similar contexts

  19. ACSDb Application (3) • Join Graph Traversal • Provide a useful way of navigating huge graph of 2.6M Schemas • Basic join graph • Contains a node ‘N’ for each unique schema • Undirected join link between any two schemas that share a attribute • Every schema that contains ‘name’ field is linked to every other schema that contains ‘name’ • Cluster together similar schemas to minimize graph clutter • Schema: X,Y • Shared Attribute: D

  20. Exp. Results – Relation Ranking Rank-ACSD beats Naïve (simulates search engine users) by 78-100% All of the non-Naïve solutions improve as k (number of results) increases

  21. Exp. Results – Schema Auto Complete • Test Scenario • 6 Humans designed schemas using given attributes • Auto-Complete tool got three tries • By 3rd output Auto complete was able to reproduce a large number of schemas • No test designer recognized ‘ab’ as an abbrevation for ‘at-bats’, baseball terminology

  22. Exp. Results – Synonym Finding Ranked by quality An ideal ranking would present a stream of only correct synonyms, followed by only incorrect ones Poor ranking will mix them together

  23. Exp. Results – Join Graph Traversal

  24. Conclusion • WebTables is first large scale attempt to extract relational information embedded in HTML tables • Relation Ranking • ACSDb uses • Schema auto complete • Attribute Synonym Finding • Join Graph Traversing • Adding signal for source page quality like PageRank will improve overall quality

  25. Discussion • Pros • Handling tables separately for search is a good idea • Cons • Most of the paper is focused on uses of ACSDb

More Related