540 likes | 556 Views
This paper discusses an algorithm for automating schema matching in data spaces, with an emphasis on indexing and hierarchy structures. The experimental results and conclusions are also presented.
E N D
Indexing Dataspaces Presented by Aditya Sakhuja Xin (Luna) Dong Alon Halevy University of Washington Google Inc. @ SIGMOD 2007
Hi, I am • Aditya Sakhuja • 1st semester MS – CS, CoC • Ongoing research: Automation of schema matching • Interested in online enabled apps (should be path breaking ), IR , databases, security, latest web technologies, sky diving and scuba diving • from INDIA
Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions
D5 D1 D2 D4 D3 Many Data Management Applications Need to Manage Heterogeneous Data Sources
Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration Systems SELECT P.title AS title, P.year AS year, A.name AS author FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid Publication (title, year, author) Author (aid, name) Paper (pid, title, year) AuthoredBy (aid,pid)
Mediated Schema D5 D1 D2 D4 D3 Querying on Traditional Data Integration Systems Q Q Q Q Q5 Q1 Q4 Q2 Q Q Q3
D5 D1 D2 D4 D3 In Many Applications it is Hard to Obtain Precise Semantic Mappings ?
Scenario 2. Personal Information Space Intranet Internet
Querying Dataspaces • Dataspaces • Collections of heterogeneous data sources • Don’t necessarily include semantic mappings • Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web How to effectively query and search a dataspace?
Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>
Searching and Querying a Dataspace • Structured query? • Require detailed knowledge on schemas • Require precise attribute values • Keyword search? • Does not allow specifications on structure • We consider queries that are • keyword-based • structure-aware
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Conjunction of predicates • Predicate: (v, {K1, …, Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example I:(title, ‘Semex’)(author, ‘Luna Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>
I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example II:(name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>
II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Form: {K1, …, Kn} • Example: ‘Semex’ • Relevant items • Associated items <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>
Indexing of the Heterogeneous Data • Challenges • Index data from heterogeneous data sources • Capture both text values and structural information • Traditional Indexes • Build a separate index for each attribute to support structured queries • Build an inverted list to support keyword search • XML indexes assume tree models and build multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.) Our approach: Extend inverted lists to capture both text values and structure of the data
Contributions • Design an index that • indexes data from heterogeneous data sources • captures both structure and text of the data • incorporates various types of heterogeneity, including synonyms and hierarchies of attributes and associations
Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions
Alon Halevy Semex: … author Luna Dong author View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author> <author><name> Alon Halevy</name></author> </authors>… </publication> Attribute Association Object
Alon Halevy Semex: … author Luna Dong author View Data Sources as Triple Base
View Data Sources as Triple Base Alon Halevy Departmental Database Semex: … author Luna Dong author Goal: Index triples to efficiently answer queries that combine text and structure
Indexing a Triple Base Using an Inverted List Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions
Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Query: firstName “Dong” “Dong/firstName/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
Incorporate Association Labels in the Inverted List Query: author “Dong” Query: author “Dong” “Dong/author/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions
name firstName lastName Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example II:(name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal> Attribute Hierarchy:
name firstName lastName Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
name firstName lastName Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong” Query: name “Dong” “Dong/name/ OR Dong/firstName/ OR …” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
name firstName lastName Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” Query: name “Dong” “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
name firstName lastName Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” Query: name “Dong” “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List
Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong” Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong author name firstName lastName Inverted List
Approach III. Hierarchy Path + Summary Rows Query: name “Dong” Query: name “Dong” “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong author name firstName lastName Inverted List
Summary Rows • Goal: Given a threshold t, answer any prefix search by reading no more than t rows. • Definition: • The indexed keyword: p//E.g. “Dong/name//” • Rows starting with p/ are shadowed by the summary row p//E.g. “Dong/name/lastName/” is shadowed by “Dong/name//”
Answering Prefix Search with Summary Rows • Once read a summary row, ignore the rows shadowed by it • Example (t=1) Query: name “Dong” Query: name “Dong” “Dong/name/*” Inverted List
Answering Prefix Search with Summary Rows • Once read a summary row, ignore the rows shadowed by it • Example (t=1) Query: name “Xin” Query: name “Xin” “Xin/name/*” Inverted List
Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List
Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List
Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List
Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List
Answering Neighborhood Keyword Queries Query: Semex Query: Semex “Semex/*” Alon Halevy Departmental Database Semex: … ~author author Luna Dong author ~author Inverted List
Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions
Implementation Details • Our index extends the Lucene Indexing Tool • Lucene stores an inverted list as a sorted array • Implemented in Java • Run on a machine with four 3.2GHz and 1024KB-cache CPUs, and 1GB memory
Experimental Setting • Data sets • A 50MB personal data set • Two 10GB XML data sets: Wikipedia, XMark Benchmark • Queries: with one predicate or keyword • Predicate Query with leaf attributes • Predicate Query with branch attributes • Predicate Query with associations • Neighborhood Keyword Query • Measure: in millisecond • Index-lookup time • Query-answering time
XML Index [Kaushik et al, Sigmod’05] • Three indexes • Inverted list: index each attribute value on its text • Structured index: index each attribute value on the labels of the attribute and its ancestor attributes • Relationship index: index each instance on its associated instances