Exploring DataSpaces: A New Approach to Data Management

DataSpaces:A New Abstraction for Data Management Alon Halevy* DASFAA, 2006 Singapore *Joint work with Mike Franklin and David Maier

Outline • Dataspaces: • collections of heterogeneous (un)-structured data. • examples and characteristics • Dataspace Support Platforms: • “Pay-as-you-go” data management • Getting there: • Recent progress and research challenges

Shrapnels in Baghdad Story courtesy of Phil Bernstein

AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginatedFrom BudgetOf HomePage Personal Information Management [Semex: Dong et al.]

Google Base

What do they have in common? • All dataspaces contain >20% porn. • The rest is spam.

What do they have in common? • Must manage allthe data in the space • Need best-effort services with no setup time. • Data is heterogeneous, • possibly unstructured • Do not have control over the data, just access.

Isn’t this Data Integration? Sequenceable Entity Structured Vocabulary Experiment Phenotype Gene Nucleotide Sequence Microarray Experiment Protein OMIM HUGO Swiss- Prot GO Gene- Clinics Locus- Link Entrez GEO

No, it’s Data Co-existence • Data integration systems require semantic mappings.

Semantic Mappings Books Title ISBN Price DiscountPrice Edition Authors ISBN FirstName LastName BooksAndMusic Title Author Publisher ItemID ItemType SuggestedPrice Categories Keywords BookCategories ISBN Category CDCategories ASIN Category CDs Album ASIN Price DiscountPrice Studio Artists ASIN ArtistName GroupName Inventory Database A Inventory Database B

No, It’s Data Co-existence • Data integration systems require semantic mappings. • Dataspaces are “pay-as-you-go”: • Provide some services immediately • Create more tight integrations as needed.

The Cost of Semantics Schema first vs. schema last Benefit Dataspaces Data integration solutions Investment (time, cost)

Why Now? • Data management is moving towards dataspace-like applications. • Prediction: • Data management is about people, not enterprises. • In 5 years our community will figure it out. • We’ve made relevant progress: • Combining DB & IR • Creation and management of semantic mappings. • Uncertainty, lineage, inconsistency

Dataspaces Fundamentals:Participants and Relationships sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB

sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSSP Components • Seamless flow from search to query • Query about participants, locate data • Lineage, uncertainty, completeness • Set up workflows • Find participants • Discover and refine relationships • Maintain catalog • Heterogeneous index • Reference reconciliation • Additional associations • Cache for performance & availability • participants’ capabilities • Relationships • Quality of both Catalog Local Store & Index Relax Search & query Administration Discovery & Enhancement

sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSSP Components Catalog Local Store & Index Search & query Administration Discovery & Enhancement

Querying a Dataspace • Best effort, based on: • Approximate semantic mappings • Other mechanisms • Example -- searching for Beng Chin’s phone number: • Keyword search for “beng chin ooi” • Examine attributes of tuples/XML elements • Match attributes to ‘address’

Querying a Dataspace • Best effort • Combine structured and unstructured data

Two Kinds of Data SQL Keyword Queries (XQuery) Structured Data (or XML) Unstructured Data [Dong, Liu, Halevy]

Querying a Dataspace • Best effort • Combine structured and unstructured data • Rank: answers, sources

Volvo Palo alto Volvo Palo alto

Acura integra Palo alto

Querying a Dataspace • Best effort • Combine structured and unstructured data • Rank: answers, sources • Iterative -- sequences of queries: • Used cars palo alto • Saab for sale • Classified ads • Classified ads for Saabs near Palo Alto

Querying a Dataspace • Best effort • Combine structured and unstructured data • Rank: answers, sources • Iterative: sequences of queries • Reflective: need to explain and expose confidence in answers.

Dataspace Reflection • Sources of uncertainty: • Unreliable sources • Data obtained by imprecise extractions • Inconsistent data • Approximate query answering (mappings) • A DSSP must: • Expose the lineage of an answer • Web search engines already do this • Reason about relationship between sources

LUI Introspection • Currently, three distinct formalisms: • Uncertainty • Lineage • Inconsistency • Single formalism should do it all: • Uncertainty and lineage (Trio @ Stanford) • Lineage & inconsistency (Orchestra @ U.Penn)

ULDB’s[Benjelloun, Das Sarma, Halevy, Widom] • Combine uncertainty and lineage • Based on x-tuples: {(t1 | t2 | t3)} • Queries can be answered with no additional complexity • You can do even better than uncertain DBs. • Because of lineage, you can sometimes obtain completeness.

sensor WSDL RDB java snapshot 1hr updates SDB sensor XML java manually created sensor schema mapping WSDL RDB XML replica view RDB DSS Components Catalog Local Store & Index Search & query Administration Discovery & Enhancement

Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List

Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: Dong

Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: FirstName “Dong”

Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: name “Dong”

Personal Information Space Alon Halevy Departmental Database Semex: … authoredPaper author Luna Dong author authoredPaper Inverted List Query: Paper author “Dong”

AttachedTo Recipient ConfHomePage CourseGradeIn ExperimentOf PublishedIn Sender Cites ComeFrom EarlyVersion ArticleAbout PresentationFor FrequentEmailer CoAuthor AddressOf OriginitatedFrom BudgetOf HomePage Enhancing a Dataspace • Creating associations • [Dong & Halevy, CIDR 2005] • Reference reconciliation • [Dong et al., SIGMOD 2005] • Very active field.

Reusing Human Attention • Human attention is most expensive. • Reuse whenever possible. E.g.,: • Manual schema mapping • Annotations • Queries written on data • Temporary collections of items • Operations on the data (cut & paste) • Solicit semantic information selectively • The ESP game: [von Ahn et al.]

Learning from Past Matches [Doan et. al, Transformic] • Every manual map is a learning example. • Learn models for elements in mediated schema. • Use multi-strategy learning. • Thousands of maps in very little time. Reuse for a very related task.

Corpus-based Matching Product Music (no tuples) [Madhavan et al.]

CD albumName prodID Corpus MusicCD CD Obtaining More Evidence Product

MusicCD CD albumName prodID ASIN Title album artists artistName recordLabel recordCompany discount price 4Y6DF23 The Doors Columbia $12.99 The Best of the Doors Comparing with More Evidence Product Music

Challenges • Learn from other kinds of user activities • Create other kinds of relationships between participants • Identify higher-level goals from user actions • Develop a formal framework for reusing human attention.

Conclusion “Dataspaces: because that’s the size of the problem” • Pay-as-you-go data integration • Data management for the masses • Key: reuse of human attention • Need to be very data driven in our research

Some References • www.cs.washington.edu/homes/alon • SIGMOD Record, December 2005: • Original dataspace vision paper • PODS 2006: • Specific technical challenges for dataspace research • Semex: CIDR 2005, SIGMOD 2005 • Teaching integration to undergraduates: • SIGMOD Record, September, 2003.

1. Build initial models T T S S Name: Instances: Type: … Name: Instances: Type: … Name: Instances: Type: … Name: Instances: Type: … M’s Ms M’t Mt t t s s 2. Find similar elements in corpus Corpus of schemas and mappings 3. Build augmented models 5. Use additional statistics (IC’s) to refine match 4. Match using augmented models

Exploring DataSpaces: A New Approach to Data Management