1 / 42

Building Data Integration Systems for the Web

Explore the importance of data integration on the web, including issues with structured data, searching and managing tables, and the future of integrating data on the internet.

lclerk
Download Presentation

Building Data Integration Systems for the Web

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Building Data Integration Systems for the Web Alon Halevy Google NSF Information Integration Workshop April 22, 2010

  2. Without (too much) Loss of Generality Web Enterprise, Science projects, … Information integration ≅ data management

  3. A Few Principles • Data management “in situ” • Data meaning is derived from its context • Manipulate data in its natural location • Pay-as-you-go data management • Provide services before modeling is done • Data can be about any domain • Collaboration should be built in • Query answering is only step the first step

  4. Alex Labrinidis @via Facebook

  5. Structured Data & The Web

  6. Hard to find structured data via search engines Discover Requires infrastructure, concerns about losing control Data is embedded in web page, behind forms Publish Extract Manage, Analyze, Combine Hard to query, visualize, combine data across organizations

  7. Outline • Surfacing the Deep Web • Searching tables on the surface Web • Fusion Tables: a platform for data management on the Web.

  8. What is the Deep Web? • Deep = not accessible through general purpose search engines • Major gap in the coverage of search engines. used cars store locations recipes patents radio stations

  9. Tree Search Amish quilts Parking tickets in India Horses

  10. Solution Constraints • Can’t design a solution that requires domain engineering • (unless you can make money in that domain!) • Boundaries between domains are fuzzy • Solution needs to be integrated into general web search • Can’t assume special query syntax

  11. Surfacing the Deep Web[Madhavan et al. VLDB 2008] • Surfacing: • Find high-quality forms • Guess good queries to submit • Put the resulting HTML pages in the index • ~3M sites, 50 languages, 700 domains. • 1000 queries per-second get results from the deep web. • 400K forms served per day, 800K per week • Impact mostly on the long and heavy tail of queries

  12. Deep Web: The Future • Still an opportunity to go deeper into the deep web: • E.g., map the user query into a form submission. • Key challenge: given a keyword query, map it to forms in any domain • Understanding the meaning of forms is still hard (e.g. content, geo constraints).

  13. Outline • Surfacing the Deep Web • Searching tables on the surface Web • Fusion Tables: a platform for data management on the Web.

  14. Bad table

  15. Vertical Tables

  16. Sub-Header Rows

  17. Winners of the Boston Marathon (but that’s nowhere in the table)

  18. Schema Ok, but context is subtle (year = 2006)

  19. WebTables: Exploring the Relational Web[Cafarella et al., VLDB 2008, WebDB 08] • In corpus of 14B raw tables, we estimate 154M are “good” relations • Single-table databases; Schema = attr labels + types • Largest corpus of databases & schemas we know of • The Webtables system: • Recovers good relations from crawl and enables search • Builds novel apps on the recovered data

  20. (Web-scale) Schema Collection With 2.6 million schemas you can do some very interesting things. Synonym discovery

  21. “KR”-Based Table Search [Wu, Madhavan, Miao, Pasca, Shen] • Ideally, we describe every table: • Class of entities it contains • Properties being modeled • Context, quality, … • Use Web-extracted knowledge bases • Extract isa-hierarchy using patterns: • “cities such as Paris and London” • “chemical elements including hydrogen and oxygen”

  22. Step 1: Find “Subject” of Table Not always the left (or first non-number column)

  23. Step 2: associate classes with subject Chemical elements Most of the time, the class labels are not in attribute name

  24. Leveraging Web-extracted Ontologies • Given a query, e.g., (country, GDP) • Rank tables about countries that have GDP somewhere in the schema. • Very high precision (~90%) • Next challenge: understand binary properties and binary relationships. • Domain specialization: • System should improve if given ontologies in a particular domain.

  25. Combine Search, Extraction, Cleaning and Integration [Cafarella, Koussainova, H., VLDB 2009], • Try to create a database of all“VLDB program committee members”

  26. Outline • Surfacing the Deep Web • Searching tables on the surface Web • Fusion Tables: a platform for data management on the Web.

  27. Data Management for the Web Era • Integrate seamlessly with the Web: • Search, maps, … • Easy to use: • Much broader user base, pay-as-you-go • Very simple data integration • Provide incentives for sharing data • Facilitate collaboration Fusion Tables – our current attempt [Madhavan, Gonzalez, Langen, Shapley, Shen]

  28. Incentive We store and leverage a large collection of tables.

  29. Incentive, Pay-..-Go

  30. Coffee Production

  31. Coffee Consumption

  32. Seamless integration with other web tools

  33. Toilet heat map…

  34. Database functionality on map

  35. Collaboration Table Search

  36. Show up in search results!

  37. Data Integration

  38. Merged Table Carries attribution from both base tables. Owners maintain control of their own data.

  39. Fine Grained Discussions

  40. Example Uses of Fusion Tables • Tracking potholes in Spain • Displaying bike routes (MTBGuru) • State of California statistics • Government data from data.gov • Data about voting locations in the USA • Brazilian beaches • Chicago homicides • Most requested pop songs by year

  41. Conclusions • Information integration “in situ” • Blur the boundary between structured and unstructured data • Combine search, extraction, cleaning and integration into a single experience • Pay-as-you-go: introduce complexity as needed • Serve enterprises without IT depth • OpenII – an open-source platform for information integration.

  42. References • Fusion Tables: • tables.googlelabs.com • SIGMOD, SOCC, 2010 • Deep-web crawling: • [Madhavan et al., VLDB 08] • WebTables: • [Cafarella et al., VLDB 08] • Octopus: • [Cafarella et al., VLDB 09], • [Elmeleegy et al, VLDB 09]

More Related