1 / 54

Indexing Dataspaces

Indexing Dataspaces. Presented by Aditya Sakhuja. Xin (Luna) Dong Alon Halevy University of Washington Google Inc. @ SIGMOD 2007. Hi, I am. Aditya Sakhuja 1 st semester MS – CS, CoC Ongoing research : Automation of schema matching

odakota
Download Presentation

Indexing Dataspaces

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Indexing Dataspaces Presented by Aditya Sakhuja Xin (Luna) Dong Alon Halevy University of Washington Google Inc. @ SIGMOD 2007

  2. Hi, I am • Aditya Sakhuja • 1st semester MS – CS, CoC • Ongoing research: Automation of schema matching • Interested in online enabled apps (should be path breaking ), IR , databases, security, latest web technologies, sky diving and scuba diving • from INDIA

  3. Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions

  4. D5 D1 D2 D4 D3 Many Data Management Applications Need to Manage Heterogeneous Data Sources

  5. Mediated Schema D5 D1 D2 D4 D3 Traditional Data Integration Systems SELECT P.title AS title, P.year AS year, A.name AS author FROM Author AS A, Paper AS P, AuthoredBy AS B WHERE A.aid=B.aid AND P.pid=B.pid Publication (title, year, author) Author (aid, name) Paper (pid, title, year) AuthoredBy (aid,pid)

  6. Mediated Schema D5 D1 D2 D4 D3 Querying on Traditional Data Integration Systems Q Q Q Q Q5 Q1 Q4 Q2 Q Q Q3

  7. D5 D1 D2 D4 D3 In Many Applications it is Hard to Obtain Precise Semantic Mappings ?

  8. Scenario 1. Different Websites About Movies

  9. Scenario 2. Personal Information Space Intranet Internet

  10. Querying Dataspaces • Dataspaces • Collections of heterogeneous data sources • Don’t necessarily include semantic mappings • Scenarios: personal information, enterprises, government agencies, smart homes, digital libraries, and the Web How to effectively query and search a dataspace?

  11. Example Dataspace <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

  12. Searching and Querying a Dataspace • Structured query? • Require detailed knowledge on schemas • Require precise attribute values • Keyword search? • Does not allow specifications on structure • We consider queries that are • keyword-based • structure-aware

  13. I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Conjunction of predicates • Predicate: (v, {K1, …, Kn}) • v - an attribute or association label • {K1, …, Kn} - a keyword set <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

  14. I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example I:(title, ‘Semex’)(author, ‘Luna Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

  15. I. Predicate Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example II:(name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

  16. II. Neighborhood Keyword Query <publication> <title>Semex: Personal information management and integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Form: {K1, …, Kn} • Example: ‘Semex’ • Relevant items • Associated items <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal>

  17. Indexing of the Heterogeneous Data • Challenges • Index data from heterogeneous data sources • Capture both text values and structural information • Traditional Indexes • Build a separate index for each attribute to support structured queries • Build an inverted list to support keyword search • XML indexes assume tree models and build multiple indexes ([Cooper et al., 01],[Kaushik et al., 05],[Wang et al., 03], etc.) Our approach: Extend inverted lists to capture both text values and structure of the data

  18. Contributions • Design an index that • indexes data from heterogeneous data sources • captures both structure and text of the data • incorporates various types of heterogeneity, including synonyms and hierarchies of attributes and associations

  19. Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions

  20. Alon Halevy Semex: … author Luna Dong author View Data Sources as Triple Base <publication> <title>Semex: Toward …</title> <authors> <author><name> Xin Dong</name></author> <author><name> Alon Halevy</name></author> </authors>… </publication> Attribute Association Object

  21. Alon Halevy Semex: … author Luna Dong author View Data Sources as Triple Base

  22. View Data Sources as Triple Base Alon Halevy Departmental Database Semex: … author Luna Dong author Goal: Index triples to efficiently answer queries that combine text and structure

  23. Indexing a Triple Base Using an Inverted List Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  24. Indexing a Triple Base Using an Inverted List Query: Dong Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  25. Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions

  26. Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  27. Incorporate Attribute Labels in the Inverted List Query: firstName “Dong” Query: firstName “Dong”  “Dong/firstName/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  28. Incorporate Association Labels in the Inverted List Query: author “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  29. Incorporate Association Labels in the Inverted List Query: author “Dong” Query: author “Dong”  “Dong/author/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  30. Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions

  31. name firstName lastName Hierarchies of Attributes and Associations <publication> <title>Semex: Toward on-the-fly personal information integration</title> <author>Xin Dong</author> <author>Alon Halevy</author> <conference>IIWeb Workshop</conference> </publication> • Example II:(name, ‘Dong’) <thesis-proposal> <title>Semex: Personal …</title> <student> <name>Xin (Luna) Dong</name> <entryYear>2001</entryYear> </student> </thesis-proposal> Attribute Hierarchy:

  32. name firstName lastName Incorporate Attribute Hierarchy in the Inverted List Query: name “Dong” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  33. name firstName lastName Naïve Approach: Expand Queries with Sub-Attributes Query: name “Dong” Query: name “Dong”  “Dong/name/ OR Dong/firstName/ OR …” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  34. name firstName lastName Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” Query: name “Dong”  “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  35. name firstName lastName Approach I: Duplicate Entries for Parent Attributes Query: name “Dong” Query: name “Dong”  “Dong/name/” Alon Halevy Departmental Database Semex: … author Luna Dong author Inverted List

  36. Approach II. Concatenate a keyword with a Hierarchy Path Query: name “Dong” Query: name “Dong”  “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong author name firstName lastName Inverted List

  37. Approach III. Hierarchy Path + Summary Rows Query: name “Dong” Query: name “Dong”  “Dong/name/*” Alon Halevy Departmental Database Semex: … author Luna Dong author name firstName lastName Inverted List

  38. Summary Rows • Goal: Given a threshold t, answer any prefix search by reading no more than t rows. • Definition: • The indexed keyword: p//E.g. “Dong/name//” • Rows starting with p/ are shadowed by the summary row p//E.g. “Dong/name/lastName/” is shadowed by “Dong/name//”

  39. Answering Prefix Search with Summary Rows • Once read a summary row, ignore the rows shadowed by it • Example (t=1) Query: name “Dong” Query: name “Dong”  “Dong/name/*” Inverted List

  40. Answering Prefix Search with Summary Rows • Once read a summary row, ignore the rows shadowed by it • Example (t=1) Query: name “Xin” Query: name “Xin”  “Xin/name/*” Inverted List

  41. Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List

  42. Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List

  43. Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List

  44. Adding Summary Rows • Step 1. Create a summary row for a prefix p if • Searching prefix p needs to read more than t rows • There is no p’ with p as prefix such that searching prefix p’ needs to read more than t rows • Step 2. Remove row p if summary row p/ exists • Example (t=1) Inverted List

  45. Answering Neighborhood Keyword Queries Query: Semex Query: Semex  “Semex/*” Alon Halevy Departmental Database Semex: … ~author author Luna Dong author ~author Inverted List

  46. Outline • Motivation • Overview of our approach • Our algorithm • Indexing structure • Indexing hierarchies • Experimental Results • Conclusions

  47. Implementation Details • Our index extends the Lucene Indexing Tool • Lucene stores an inverted list as a sorted array • Implemented in Java • Run on a machine with four 3.2GHz and 1024KB-cache CPUs, and 1GB memory

  48. Experimental Setting • Data sets • A 50MB personal data set • Two 10GB XML data sets: Wikipedia, XMark Benchmark • Queries: with one predicate or keyword • Predicate Query with leaf attributes • Predicate Query with branch attributes • Predicate Query with associations • Neighborhood Keyword Query • Measure: in millisecond • Index-lookup time • Query-answering time

  49. Our Indexing Method Significantly Improves Query Answering

  50. XML Index [Kaushik et al, Sigmod’05] • Three indexes • Inverted list: index each attribute value on its text • Structured index: index each attribute value on the labels of the attribute and its ancestor attributes • Relationship index: index each instance on its associated instances

More Related