470 likes | 572 Views
The Web’s Many Models. ?. Michael J. Cafarella University of Michigan AKBC May 19, 2010. Web Information Extraction. Much recent research in information extractors that operate over Web pages Snowball (Agichtein and Gravano, 2001) TextRunner (Banko et al, 2007)
E N D
The Web’s Many Models ? Michael J. Cafarella University of Michigan AKBC May 19, 2010
Web Information Extraction • Much recent research in information extractors that operate over Web pages • Snowball (Agichtein and Gravano, 2001) • TextRunner (Banko et al, 2007) • Yago (Suchanek et al, 2007) • WebTables (Cafarella et al, 2008) • DBPedia, ExDB, Freebase (make use of IE data) • Web crawl + domain-independent IE should allow comprehensive Web KBs with: • Very high, “web-style” recall • “More-expressive-than-search” query processing • But where is it?
Web Information Extraction • Omnivore • “Extracting and Querying a Comprehensive Web Database.” Michael Cafarella. CIDR 2009. Asilomar, CA. • Suggested remedies for data ingestion, user interaction • This talk says why ideas in that paper might already be out of date, gives alternative ideas • If there are mistakes here, then you have a chance to save me years of work!
Outline • Introduction • Data Ingestion • Previously: Parallel Extraction • Alternative: The Data-Centric Web • User Interaction • Previously: Model Generation for Output • Alternative: Data Integration as UI • Conclusion
Parallel Extraction • Previous hypothesis • Many data models for interesting data, e.g., relational tables and E/R graphs, etc. • Should build large integration infrastructure to consume many extraction streams
Database Construction (1) • Start with a single large Web crawl
Database Construction (2) • Each of k extractors emits output that: • Has an extractor-dependent model • Has an extractor-and-Web-page-dependent schema
Database Construction (3) • For each extractor output, unfold into common entity-relation model
Database Construction (4) • Unify results
Database Construction (5) • Emit final database
Potential Problems • Pressing problems: • Recall • Simple intra-source reconciliation • Time • Tables, entities probably OK for now • Many data sources (DBPedia, Facebook, IMDB) already match one of these two pretty well • One possible different direction: the Data-Centric Web • Addresses recall only
Data-Centric Lists • Lists of Data-Centric Entities give hints: • About what the target entity contains • That all members of set are DCEs, or not • That members of set belong to a class or type (e.g., program committee members)
Build the Data-Centric Web • Download the Web • Train classifiers to detect DCEs, DCLs • Filter out all pages that fail both tests • Use lists to fix up incorrect Data-Centric Entity classifications • Run attr/val extractors on DCEs • Yields E/R dataset, for insertion into DBPedia, YAGO, etc • In progress now… with student Ashwin Balakrishnan, entity detector >95% acc.
Research Question 1 • How many useful entities… • Lack a page in the Data-Centric Web? • (That means no homepage, no Amazon page, no public Facebook page, etc.) • AND are otherwise well-described enough online that IE can recover an entity-centric view? • Put differently: • Does every entity worth extracting already have a homepage on the Web?
Research Question 2 • Does a single real-world entity have more than one “authoritative” URL? • Note that Wikipedia provides pretty minimal assistance in choosing the right entity, but does a good job
Outline • Introduction • Data Ingestion • Previously: Parallel Extraction • Alternative: The Data-Centric Web • User Interaction • Previously: Model Generation for Output • Alternative: Data Integration as UI • Conclusion
Model Generation for Output • Previous hypothesis • Many different user applications built against single back-end database • Difficult task is translating from back-end data model to the application’s data model
Query Processing (1) • Query arrives at system
Query Processing (2) • Entity-relation database processor yields entity results
Query Processing (3) • Query Renderer chooses appropriate output schema
Query Processing (4) • User corrections are logged and fed into later iterations of db construction
Potential Problems • Many plausible front-end applications, none yet totally compelling and novel • Ad- and search-driven ones not novel • Freebase, Wolfram Alpha not compelling • Raw input to learners: useful, not an end-user application • Need to explore possible applications rather than build multi-app infrastructure • One possible different direction: data integration as user primitive
Data Integration as UI • Can we combine tables to create new data sources? • Many existing “mashup” tools, which ignore realities of Web data • A lot of useful data is not in XML • User cannot know all sources in advance • Transient integrations • Dirty data
Interaction Challenge • Try to create a database of all“VLDB program committee members”
Octopus • Provides “workbench” of data integration operators to build target database • Most operators are not correct/incorrect, but high/low quality (like search) • Also, prosaic traditional operators • Originally ran on WebTable data • [VLDB 2009, Cafarella, Khoussainova, Halevy]
Walkthrough - Operator #1 • SEARCH(“VLDB program committee members”)
Walkthrough - Operator #2 • Recover relevant data CONTEXT() CONTEXT()
Walkthrough - Operator #2 • Recover relevant data CONTEXT() CONTEXT()
Walkthrough - Union • Combine datasets Union()
Walkthrough - Operator #3 • Add column to data • Similar to “join” but join target is a topic “publications” EXTEND( “publications”, col=0) • User has integrated data sources with little effort • No wrappers; data was never intended for reuse
CONTEXT Algorithms • Input: table and source page • Output: data values to add to table • SignificantTerms sorts terms in source page by “importance” (tf-idf)
Related View Partners • Looks for different “views” of same data
Data Integration as UI • Compelling for db researchers, but will large numbers of people use it?
Conclusion • Automatic Web KBs rapidly progressing • Recall still not good enough for many tasks, but progress is rapid • Not clear what those tasks should be, and progress is much slower • Difficult to predict what’s useful • Sometimes difficult to write a “new app” paper • Omnivore’s approach not wrong, but did not directly address these problems