1 / 10

Managing The Structured Web

Managing The Structured Web. Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010. The Structured Web. Web pages contain structure that is obvious to humans, though not machines Search engines are largely blind to it Databases need data that is perfectly structured.

ptigner
Download Presentation

Managing The Structured Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010

  2. The Structured Web • Web pages contain structure that is obvious to humans, though not machines • Search engines are largely blind to it • Databases need data that is perfectly structured

  3. Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • Progress in one reinforces others

  4. Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • (w/ Chris Re)

  5. WebTables Schema Statistics Applications • WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al] • An extracted relation is one table plus labeled columns • Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pages Raw HTML Tables Recovered Relations

  6. Schema Statistics • Schema stats useful for computing attribute probabilities • p(“make”), p(“model”), p(“zipcode”) • p(“make” | “model”), p(“make” | “zipcode”) • Allows many applications • Schema “tab-complete” • Synonym discovery • Others • Progress in extraction technique enables new data applications

  7. Manimal (ongoing) • MapReduce very popular for “big data” • Easy for non-database programmers • Parallelizable, but inefficient • RDBMSes challenging for “big data” • Programming and admin relatively difficult • When well-used, very efficient • Manimal is hybrid MapReduce/RDBS execution system • Static analysis to extract code semantics • if(score > 5)… database selection • Extractions enable RDBMS-style optimizations • Progress in extraction enables new data tools

More Related