Managing The Structured Web

Managing The Structured Web Michael J. Cafarella University of Michigan Michigan CSE April 23, 2010

The Structured Web • Web pages contain structure that is obvious to humans, though not machines • Search engines are largely blind to it • Databases need data that is perfectly structured

Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • Progress in one reinforces others

Different Approaches • Extraction Techniques • Tables: WebTables [WebDB’08, VLDB’08] • (w/ Alon Halevy, Yang Zhang, Daisy Wang, Eugene Wu) • Large-scale entity extraction: Structurepedia [ongoing] • Applications • Web data integration: Octopus [VLDB’09] • Structure-aware Web search: Meez [ongoing] • Tools • MapReduce Optimizer: Manimal [ongoing] • (w/ Chris Re)

WebTables Schema Statistics Applications • WebTables system automatically extracts dbs from web crawl [WebDB08, “Uncovering…”, Cafarella et al][VLDB08, “WebTables: Exploring…”, Cafarella et al] • An extracted relation is one table plus labeled columns • Estimate that our crawl of 14.1B raw HTML tables contains ~154M good relational dbs Raw crawled pages Raw HTML Tables Recovered Relations

Schema Statistics • Schema stats useful for computing attribute probabilities • p(“make”), p(“model”), p(“zipcode”) • p(“make” | “model”), p(“make” | “zipcode”) • Allows many applications • Schema “tab-complete” • Synonym discovery • Others • Progress in extraction technique enables new data applications

Manimal (ongoing) • MapReduce very popular for “big data” • Easy for non-database programmers • Parallelizable, but inefficient • RDBMSes challenging for “big data” • Programming and admin relatively difficult • When well-used, very efficient • Manimal is hybrid MapReduce/RDBS execution system • Static analysis to extract code semantics • if(score > 5)… database selection • Extractions enable RDBMS-style optimizations • Progress in extraction enables new data tools

Managing The Structured Web

Managing The Structured Web

Presentation Transcript

Structured Load Tests for Web Applications

Managing the Rhizome: METS for Web Archiving

Querying for relations from the semi-structured Web

Structured Annotations of Web Queries

Collectively Representing Semi-Structured Data from the Web

Creating and Sharing Structured Semantic Web Contents through the Social Web

Managing your web records?

Structured data, Web 2.0, libraries

Managing Structured Collections of Community Data

Deep Web Integration: Querying Structured Data on the Deep Web

The structured conversation

Managing Assessment Across the Centre – Web resource

Managing the Rhizome: METS for Web Archiving

Extracting Structured Data from Web Page

Managing Your Institutional Web Gateway: The Future

Extracting Structured Data from Web Pages

Managing Web Services Security

The Structured Specification

Scaleable Structured Datastorage for Web 2.0

Managing Web Site Data

Managing the Rhizome: METS for Web Archiving

Managing The Structured Web