100 likes | 210 Views
Managing The Structured Web. Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010. The Structured Web. Web pages contain structure that is obvious to humans, though not machines Search engines are largely blind to it Databases need data that is perfectly structured.
E N D
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010
The Structured Web • Web pages contain structure that is obvious to humans, though not machines • Search engines are largely blind to it • Databases need data that is perfectly structured
Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • Progress in one reinforces others
Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • (w/ Chris Re)
WebTables Schema Statistics Applications • WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al] • An extracted relation is one table plus labeled columns • Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pages Raw HTML Tables Recovered Relations
Schema Statistics • Schema stats useful for computing attribute probabilities • p(“make”), p(“model”), p(“zipcode”) • p(“make” | “model”), p(“make” | “zipcode”) • Allows many applications • Schema “tab-complete” • Synonym discovery • Others • Progress in extraction technique enables new data applications
Manimal (ongoing) • MapReduce very popular for “big data” • Easy for non-database programmers • Parallelizable, but inefficient • RDBMSes challenging for “big data” • Programming and admin relatively difficult • When well-used, very efficient • Manimal is hybrid MapReduce/RDBS execution system • Static analysis to extract code semantics • if(score > 5)… database selection • Extractions enable RDBMS-style optimizations • Progress in extraction enables new data tools