1 / 111

Oblivious Querying of Data with Irregular Structure

Oblivious Querying of Data with Irregular Structure. Based on Several Works. Queries with Incomplete Answers Yaron Kanza, Werner Nutt, Shuky Sagiv Flexible Queries Yaron Kanza, Shuky Sagiv SQL4X Sara Cohen, Yaron Kanza, Shuky Sagiv Computing Full Disjunctions Yaron Kanza, Shuky Sagiv.

ollie
Download Presentation

Oblivious Querying of Data with Irregular Structure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Oblivious Querying of Data with Irregular Structure

  2. Based on Several Works • Queries with Incomplete Answers • Yaron Kanza, Werner Nutt, Shuky Sagiv • Flexible Queries • Yaron Kanza, Shuky Sagiv • SQL4X • Sara Cohen, Yaron Kanza, Shuky Sagiv • Computing Full Disjunctions • Yaron Kanza, Shuky Sagiv

  3. Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

  4. Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

  5. The Semistructured Data Model • Data is described as a rooted labeled directed graph • Nodes represent objects • Edges represent relationships between objects • Atomic values are attached to atomic nodes

  6. Movie Database 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia A Movie Database Example

  7. <?xml version=“1.0”?> <MDB> <Movie> <Title>Star Wars</Title> <Year>1977</Year> <Actor> <Name>Mark Hamill</Name> </Actor> <Actor> <Name>Harrison Ford</Name> </Actor> </Movie> … </MDB> XML that Encodes the Semistructured Data

  8. What Should be the form of the Query? Movie Database 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release

  9. The year of the movie is missing The movie has a year attribute Movie Database 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Incomplete Data

  10. 11 14 29 21 Movie Database Actor below movie Movie below actor 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Variations in Structure

  11. A movie label A film label Movie Database 1 Actor Movie Movie Film 11 12 13 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 34 35 33 Natalie Portman 34 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Dealing with ontology variations is beyond the scope of this talk Ontology Variations

  12. Irregular Data • Data is incomplete • Missing values of attributes in objects • Data has structural variations • Relationships between objects are represented differently in different parts of the database • Data has ontology variations • Different labels are used to describe objects of the same type

  13. Irregular data does not conform to a strict schema Queries over irregular data should not be rigid patterns The schema cannot guide a user in formulating a query

  14. Data is contributed by many users in a variety of designs The query should deal with different structures of data The structure of the database is changed frequently Queries should be rewritten frequently The description of the schema is large (e.g., a DTD of XML) It is difficult to use the schema when formulating queries In Which Cases is it Difficult to Formulate Queries over Semistructured Data?

  15. Can Regular Expressions Help in Querying Irregular Data? • In many cases, regular expressions can be used to query irregular data • Yet, regular expressions are • Not efficient – it is difficult to evaluate regular expressions • Not intuitive – it is difficult for a naïve user to formulate regular expressions

  16. More on UsingRegular Expressions • When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database • For n types of objects, there are n! possible hierarchies • For an object with n attributes, there are 2n subsets of missing attributes

  17. Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

  18. Queries with Incomplete Answers • We have developed queries that deal with incomplete data in a novel way and return incomplete answers • The queries return maximal answers rather than complete answers • Different query semantics admit different levels of incompleteness

  19. Queries with Incomplete Answers Queries with complete answers Increasing level of incompleteness Queries with AND Semantics Queries with Weak Semantics Queries with OR Semantics

  20. Queries and Matchings • The queries are labeled rooted directed graphs • Query nodes are variables • Matchings are assignments of database objects to the query variables according to • the constraints specified in the query, and • the semantics of the query

  21. Constraints On Complete Matchings • Root Constraint: • Satisfied if the query root is mapped to the db root • Edge Constraint: • Satisfied if a query edge with label l is mapped to a database edge with label l Query Root r 1 Database Root x 12 l l y 25

  22. 1 11 12 27 32 35 Movie Database A Complete Matching 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v Date of birth 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill 35 14 May 1944 All the nodes are mapped to non-null values The root constraint and all the edge constraints are satisfied

  23. Date of birth 35 14 May 1944 Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v Date of birth 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill 35 14 May 1944 Consider the case where Node 35 is removed from the database No Complete Matching Exists!

  24. 1 Movie Database 1 r Movie Movie Movie Movie Producer NULL NULL Producer 11 12 x x y y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z z 29 NULL 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u u v 34 30 32 31 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford NULL Mark Hamill This is not a matching, since the sequence of labels from the database root to Node 31 is different from any sequence of labels that starts at the query root and ends in variable v Not Every Partial Assignment is an Incomplete Matching

  25. Database 1 l2 1 1 r l2 r l1 7 5 x l4 w l3 l4 l3 9 8 z l5 l6 y l6 l5 55 v v v 55 r l2 l1 w x l3 l4 y z l5 l6 v Query 55 The Reachability Constrainton Partial Matchings • A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied

  26. r x y Producer Actor Director z “And” Matchings • A partial matching is an AND matching if • The root constraint is satisfied • The reachability constraint is satisfied by every query node that is mapped to a database node • If a query node is mapped to a database node, all the incoming edge constraints are satisfied

  27. 1 11 12 27 u 32 NULL Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill An AND Matching

  28. Uncredited Actor In an AND matching, Node z must be null! Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill Suppose that we remove the edges that are labeled with Uncredited Actor

  29. null null x 12 x 12 x 12 x l l l m l m l y 25 y 25 y 25 y null null Weak Satisfaction of Edge Constraints • Edge Constraint: • Is Weakly Satisfied if it is either • Satisfied (as defined earlier), or • One (or more) of its nodes is mapped to a null value

  30. Weak Matchings • A partial matching is a weak matching if • The root constraint is satisfied • The reachability constraint is satisfied by every query node that is mapped to a database node • Every edge constraint is weakly satisfied

  31. 1 11 y NULL 27 u 32 NULL Edges that are weakly satisfied Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill A Weak Matching

  32. In an AND matching, only the first three options are permitted null null x 12 x 12 x 12 x l l l m l m l y 25 y 25 y 25 y null null In a weak matching, all four options are permitted

  33. Producer In a weak matching, Node z must be null! Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill Consider the case where edges labeled with Producer are removed

  34. “OR” Matchings • A partial matching is an OR matching if • The root constraint is satisfied • The reachability constraint is satisfied by every query node that is mapped to a database node

  35. 1 11 y NULL 27 u 32 NULL An edge which is not weakly satisfied Movie Database 1 r Movie Movie Movie Movie Producer 11 12 x y Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill An OR Matching

  36. Increasing Level of Incompleteness • A complete matching is an AND matching • An AND matching is a weak matching • A weak matching is an OR matching

  37. Maximal Matchings • A tuple t1subsumes a tuple t2 if t1 is the result of replacing some null values in t2 by non-null values: • A matching is maximal if no other matching subsumes it • A query result consists of maximal matchings only Matchings are represented as tuples of oid’s and null values t1=(1, 5, 2, null) t2=(1, null, 2, null)

  38. On the Complexity of Computing Queries with Incomplete Answers • The size of the result can be exponential in the size of the input (database and query) • Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query) • Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity

  39. Input-Output Complexity In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result.

  40. The Motivation for Using I/O Complexity • Measuring the time complexity with respect to the size of the input does not separate between the following two cases: • An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input • An algorithm that does an exponential amount of work even when the query result is small • Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard

  41. Recent Results (PODS’03) I/O Complexity of Query Evaluation(lower bounds are for non-emptiness)

  42. Filter Constraints • Constraints that filter the results (i.e., the maximal matchings) • There are • Weak filter constraints(the constraint is satisfied if a variable in the constraint is null) • Strong filter constraints(all variables must be non-null for satisfaction) • Existence constraint:!xis true if xis not null

  43. I/O Complexity of Query Evaluationwith Existence Constraints(lower bounds are for non-emptiness)

  44. I/O Complexity of Query Evaluationwith Weak Equality/Inequality Constraints(lower bounds are for non-emptiness)

  45. Query Containment • Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers • Q1 Q2 if for all database D, every matching of Q1 w.r.t. to D is subsumed by a matchings of Q2 w.r.t. to D • Query containment (query equivalence) is useful for the development of optimization techniques

  46. Q1 Q2 r r l1 l1 l2 l2 x p l2 y z q l3 l4 l3 l4 u v u v Containment in AND Semantics • Homomorphism between the query graphs is necessary and sufficient for containment Q1 Q2 homomorphism • Deciding whether one query is contained in another is NP-Complete

  47. Containment in OR Semantics • The following is a necessary and sufficient condition for query containment in OR semantics • For every spanning tree T1 of the contained query, there a spanning tree T2 of the containing query, such that there is a homomorphism from T2 to T1 • is in ΠP2 • NP-Complete if the containee is a tree • polynomial if the container is a tree

  48. Containment in Weak Semantics • Similar to containment in OR Semantics, with the following difference • Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments • A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root

  49. Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration

  50. Flexible Queries • To deal with structural variations in the data, we have developed flexible queries

More Related