The Web’s Many Models

The Web’s Many Models ? Michael J. Cafarella University of Michigan AKBC May 19, 2010

Web Information Extraction • Much recent research in information extractors that operate over Web pages • Snowball (Agichtein and Gravano, 2001) • TextRunner (Banko et al, 2007) • Yago (Suchanek et al, 2007) • WebTables (Cafarella et al, 2008) • DBPedia, ExDB, Freebase (make use of IE data) • Web crawl + domain-independent IE should allow comprehensive Web KBs with: • Very high, “web-style” recall • “More-expressive-than-search” query processing • But where is it?

Web Information Extraction • Omnivore • “Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA. • Suggested remedies for data ingestion, user interaction • This talk says why ideas in that paper might already be out of date, gives alternative ideas • If there are mistakes here, then you have a chance to save me years of work!

Outline • Introduction • Data Ingestion • Previously: Parallel Extraction • Alternative: The Data-Centric Web • User Interaction • Previously: Model Generation for Output • Alternative: Data Integration as UI • Conclusion

Parallel Extraction • Previous hypothesis • Many data models for interesting data, e.g., relational tables and E/R graphs, etc. • Should build large integration infrastructure to consume many extraction streams

Database Construction (1) • Start with a single large Web crawl

Database Construction (2) • Each of k extractors emits output that: • Has an extractor-dependent model • Has an extractor-and-Web-page-dependent schema

Database Construction (3) • For each extractor output, unfold into common entity-relation model

Database Construction (4) • Unify results

Database Construction (5) • Emit final database

Potential Problems • Pressing problems: • Recall • Simple intra-source reconciliation • Time • Tables, entities probably OK for now • Many data sources (DBPedia, Facebook, IMDB) already match one of these two pretty well • One possible different direction: the Data-Centric Web • Addresses recall only

The Data-Centric Web

Data-Centric Lists • Lists of Data-Centric Entities give hints: • About what the target entity contains • That all members of set are DCEs, or not • That members of set belong to a class or type (e.g., program committee members)

Build the Data-Centric Web • Download the Web • Train classifiers to detect DCEs, DCLs • Filter out all pages that fail both tests • Use lists to fix up incorrect Data-Centric Entity classifications • Run attr/val extractors on DCEs • Yields E/R dataset, for insertion into DBPedia, YAGO, etc • In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.

Research Question 1 • How many useful entities… • Lack a page in the Data-Centric Web? • (That means no homepage, no Amazon page, no public Facebook page, etc.) • AND are otherwise well-described enough online that IE can recover an entity-centric view? • Put differently: • Does every entity worth extracting already have a homepage on the Web?

Research Question 2 • Does a single real-world entity have more than one “authoritative” URL? • Note that Wikipedia provides pretty minimal assistance in choosing the right entity, but does a good job

Outline • Introduction • Data Ingestion • Previously: Parallel Extraction • Alternative: The Data-Centric Web • User Interaction • Previously: Model Generation for Output • Alternative: Data Integration as UI • Conclusion

Model Generation for Output • Previous hypothesis • Many different user applications built against single back-end database • Difficult task is translating from back-end data model to the application’s data model

Query Processing (1) • Query arrives at system

Query Processing (2) • Entity-relation database processor yields entity results

Query Processing (3) • Query Renderer chooses appropriate output schema

Query Processing (4) • User corrections are logged and fed into later iterations of db construction

Potential Problems • Many plausible front-end applications, none yet totally compelling and novel • Ad- and search-driven ones not novel • Freebase, Wolfram Alpha not compelling • Raw input to learners: useful, not an end-user application • Need to explore possible applications rather than build multi-app infrastructure • One possible different direction: data integration as user primitive

Data Integration as UI • Can we combine tables to create new data sources? • Many existing “mashup” tools, which ignore realities of Web data • A lot of useful data is not in XML • User cannot know all sources in advance • Transient integrations • Dirty data

Interaction Challenge • Try to create a database of all“VLDB program committee members”

Octopus • Provides “workbench” of data integration operators to build target database • Most operators are not correct/incorrect, but high/low quality (like search) • Also, prosaic traditional operators • Originally ran on WebTable data • [VLDB 2009, Cafarella, Khoussainova, Halevy]

Walkthrough - Operator #1 • SEARCH(“VLDB program committee members”)

Walkthrough - Operator #2 • Recover relevant data CONTEXT() CONTEXT()

Walkthrough - Union • Combine datasets Union()

Walkthrough - Operator #3 • Add column to data • Similar to “join” but join target is a topic “publications” EXTEND( “publications”, col=0) • User has integrated data sources with little effort • No wrappers; data was never intended for reuse

CONTEXT Algorithms • Input: table and source page • Output: data values to add to table • SignificantTerms sorts terms in source page by “importance” (tf-idf)

Related View Partners • Looks for different “views” of same data

CONTEXT Experiments

Data Integration as UI • Compelling for db researchers, but will large numbers of people use it?

Conclusion • Automatic Web KBs rapidly progressing • Recall still not good enough for many tasks, but progress is rapid • Not clear what those tasks should be, and progress is much slower • Difficult to predict what’s useful • Sometimes difficult to write a “new app” paper • Omnivore’s approach not wrong, but did not directly address these problems

The Web’s Many Models