Indexing Dataspaces

Indexing Dataspaces Presented by Aditya Sakhuja Xin (Luna) Dong Alon Halevy University of Washington Google Inc. @ SIGMOD 2007

Hi, I am • Aditya Sakhuja • 1st semester MS – CS, CoC • Ongoing research: Automation of schema matching • Interested in online enabled apps (should be path breaking ), IR , databases, security, latest web technologies, sky diving and scuba diving • from INDIA

Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions

D5 D1 D2 D4 D3 Many Data Management Applications Need to Manage Heterogeneous Data Sources

Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration Systems SELECT P.title AS title, P.year AS year, A.name AS author FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid Publication (title, year, author) Author (aid, name) Paper (pid, title, year) AuthoredBy (aid,pid)

Mediated Schema D5 D1 D2 D4 D3 Querying on Traditional Data Integration Systems Q Q Q Q Q5 Q1 Q4 Q2 Q Q Q3

D5 D1 D2 D4 D3 In Many Applications it is Hard to Obtain Precise Semantic Mappings ?

Scenario 1. Different Websites About Movies

Scenario 2. Personal Information Space Intranet Internet

Querying Dataspaces • Dataspaces • Collections of heterogeneous data sources • Don’t necessarily include semantic mappings • Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web How to effectively query and search a dataspace?

Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

Searching and Querying a Dataspace • Structured query? • Require detailed knowledge on schemas • Require precise attribute values • Keyword search? • Does not allow specifications on structure • We consider queries that are • keyword-based • structure-aware

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Conjunction of predicates • Predicate: (v, {K1, …, Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example I:(title, ‘Semex’)(author, ‘Luna Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example II:(name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Form: {K1, …, Kn} • Example: ‘Semex’ • Relevant items • Associated items <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

Indexing of the Heterogeneous Data • Challenges • Index data from heterogeneous data sources • Capture both text values and structural information • Traditional Indexes • Build a separate index for each attribute to support structured queries • Build an inverted list to support keyword search • XML indexes assume tree models and build multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.) Our approach: Extend inverted lists to capture both text values and structure of the data

Contributions • Design an index that • indexes data from heterogeneous data sources • captures both structure and text of the data • incorporates various types of heterogeneity, including synonyms and hierarchies of attributes and associations

Alon Halevy Semex: … author Luna Dong author View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author> <author><name> Alon Halevy</name></author> </authors>… </publication> Attribute Association Object

Alon Halevy Semex: … author Luna Dong author View Data Sources as Triple Base

View Data Sources as Triple Base Alon Halevy Departmental Database Semex: … author Luna Dong author Goal: Index triples to efficiently answer queries that combine text and structure

Indexing a Triple Base Using an Inverted List Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Query: firstName “Dong”  “Dong/firstName/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

Incorporate Association Labels in the Inverted List Query: author “Dong” Query: author “Dong”  “Dong/author/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

name firstName lastName Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example II:(name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal> Attribute Hierarchy:

name firstName lastName Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

name firstName lastName Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong” Query: name “Dong”  “Dong/name/ OR Dong/firstName/ OR …” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

name firstName lastName Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” Query: name “Dong”  “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong” Query: name “Dong”  “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong author name firstName lastName Inverted List

Approach III. Hierarchy Path + Summary Rows Query: name “Dong” Query: name “Dong”  “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong author name firstName lastName Inverted List

Summary Rows • Goal: Given a threshold t, answer any prefix search by reading no more than t rows. • Definition: • The indexed keyword: p//E.g. “Dong/name//” • Rows starting with p/ are shadowed by the summary row p//E.g. “Dong/name/lastName/” is shadowed by “Dong/name//”

Answering Prefix Search with Summary Rows • Once read a summary row, ignore the rows shadowed by it • Example (t=1) Query: name “Dong” Query: name “Dong”  “Dong/name/*” Inverted List

Answering Prefix Search with Summary Rows • Once read a summary row, ignore the rows shadowed by it • Example (t=1) Query: name “Xin” Query: name “Xin”  “Xin/name/*” Inverted List

Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List

Answering Neighborhood Keyword Queries Query: Semex Query: Semex  “Semex/*” Alon Halevy Departmental Database Semex: … ~author author Luna Dong author ~author Inverted List

Implementation Details • Our index extends the Lucene Indexing Tool • Lucene stores an inverted list as a sorted array • Implemented in Java • Run on a machine with four 3.2GHz and 1024KB-cache CPUs, and 1GB memory

Experimental Setting • Data sets • A 50MB personal data set • Two 10GB XML data sets: Wikipedia, XMark Benchmark • Queries: with one predicate or keyword • Predicate Query with leaf attributes • Predicate Query with branch attributes • Predicate Query with associations • Neighborhood Keyword Query • Measure: in millisecond • Index-lookup time • Query-answering time

Our Indexing Method Significantly Improves Query Answering

XML Index [Kaushik et al, Sigmod’05] • Three indexes • Inverted list: index each attribute value on its text • Structured index: index each attribute value on the labels of the attribute and its ancestor attributes • Relationship index: index each instance on its associated instances

Indexing Dataspaces