100 likes | 118 Views
Explore various approaches, tools, and applications for managing structured web data, including extraction techniques, large-scale entity extraction, schema statistics, and more, with a focus on improving data integration and search efficiency.
E N D
Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010
The Structured Web • Web pages contain structure that is obvious to humans, though not machines • Search engines are largely blind to it • Databases need data that is perfectly structured
Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • Progress in one reinforces others
Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • (w/ Chris Re)
WebTables Schema Statistics Applications • WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al] • An extracted relation is one table plus labeled columns • Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pages Raw HTML Tables Recovered Relations
Schema Statistics • Schema stats useful for computing attribute probabilities • p(“make”), p(“model”), p(“zipcode”) • p(“make” | “model”), p(“make” | “zipcode”) • Allows many applications • Schema “tab-complete” • Synonym discovery • Others • Progress in extraction technique enables new data applications
Manimal (ongoing) • MapReduce very popular for “big data” • Easy for non-database programmers • Parallelizable, but inefficient • RDBMSes challenging for “big data” • Programming and admin relatively difficult • When well-used, very efficient • Manimal is hybrid MapReduce/RDBS execution system • Static analysis to extract code semantics • if(score > 5)… database selection • Extractions enable RDBMS-style optimizations • Progress in extraction enables new data tools