1.11k likes | 1.22k Views
Yaron Kanza The Rachel and Selim Benin School of Engineering and Computer Science The Hebrew University of Jerusalem. Oblivious Querying of Data with Irregular Structure. Joint Work With. Queries with Incomplete Answers Werner Nutt, Shuky Sagiv Flexible Queries Shuky Sagiv SQL4X
E N D
Yaron KanzaThe Rachel and Selim BeninSchool of Engineering and Computer ScienceThe Hebrew University of Jerusalem Oblivious Querying of Data with Irregular Structure
Joint Work With • Queries with Incomplete Answers • Werner Nutt, Shuky Sagiv • Flexible Queries • Shuky Sagiv • SQL4X • Sara Cohen, Shuky Sagiv • Computing Full Disjunctions • Shuky Sagiv
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration
The Semistructured Data Model • Data is described as a rooted labeled directed graph • Nodes represent objects • Edges represent relationships between objects • Atomic values are attached to atomic nodes
Movie Database 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia A Movie Database Example
<?xml version=“1.0”?> <MDB> <Movie> <Title>Star Wars</Title> <Year>1977</Year> <Actor> <Name>Mark Hamill</Name> </Actor> <Actor> <Name>Harrison Ford</Name> </Actor> </Movie> … </MDB> XML that Encodes the Semistructured Data
What Should be the form of the Query? Movie Database 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Consider a Query that Requests Movies, Actors that Acted in the Movies and the Movies’ Year of Release
The year of the movie is missing The movie has a year attribute Movie Database 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Incomplete Data
11 14 29 21 Movie Database Actor below movie Movie below actor 1 Movie Actor Movie Film 11 12 14 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 35 36 34 Natalie Portman 33 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Variations in Structure
A movie label A film label Movie Database 1 Actor Movie Movie Film 11 12 13 13 Actor Movie Title Actor Title Name T.V. Series Actor Year Title 29 21 22 23 24 25 28 26 27 Léon Star Wars Title Year 1977 Kyle MacLachlan Name Name Title Name 30 31 32 34 35 33 Natalie Portman 34 Harrison Ford Mark Hamill Dune Twin Peaks 1984 Magnolia Dealing with ontology variations is beyond the scope of this talk Ontology Variations
Irregular Data • Data is incomplete • Missing values of attributes in objects • Data has structural variations • Relationships between objects are represented differently in different parts of the database • Data has ontology variations • Different labels are used to describe objects of the same type
Irregular data does not conform to a strict schema Queries over irregular data should not be rigid patterns The schema cannot guide a user in formulating a query
Data is contributed by many users in a variety of designs The query should deal with different structures of data The structure of the database is changed frequently Queries should be rewritten frequently The description of the schema is large (e.g., a DTD of XML) It is difficult to use the schema when formulating queries In Which Cases is it Difficult to Formulate Queries over Semistructured Data?
Can Regular Expressions Help in Querying Irregular Data? • In many cases, regular expressions can be used to query irregular data • Yet, regular expressions are • Not efficient – it is difficult to evaluate regular expressions • Not intuitive – it is difficult for a naïve user to formulate regular expressions
More on UsingRegular Expressions • When querying irregular data, the size of the regular expression could be exponential in the number of labels in the database • For n types of objects, there are n! possible hierarchies • For an object with n attributes, there are 2n subsets of missing attributes
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration
Queries with Incomplete Answers • We have developed queries that deal with incomplete data in a novel way and return incomplete answers • The queries return maximal answers rather than complete answers • Different query semantics admit different levels of incompleteness
Queries with Incomplete Answers Queries with complete answers Increasing level of incompleteness Queries with AND Semantics Queries with Weak Semantics Queries with OR Semantics
Queries and Matchings • The queries are labeled rooted directed graphs • Query nodes are variables • Matchings are assignments of database objects to the query variables according to • the constraints specified in the query, and • the semantics of the query
Constraints On Complete Matchings • Root Constraint: • Satisfied if the query root is mapped to the db root • Edge Constraint: • Satisfied if a query edge with label l is mapped to a database edge with label l Query Root r 1 Database Root x 12 l l y 25
1 11 12 27 32 35 Movie Database A Complete Matching 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v Date of birth 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill 35 14 May 1944 All the nodes are mapped to non-null values The root constraint and all the edge constraints are satisfied
Date of birth 35 14 May 1944 Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v Date of birth 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill 35 14 May 1944 Consider the case where Node 35 is removed from the database No Complete Matching Exists!
1 Movie Database 1 r Movie Movie Movie Movie Producer NULL NULL Producer 11 12 x x y y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z z 29 NULL 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u u v 34 30 32 31 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford NULL Mark Hamill This is not a matching, since the sequence of labels from the database root to Node 31 is different from any sequence of labels that starts at the query root and ends in variable v Not Every Partial Assignment is an Incomplete Matching
Database 1 l2 1 1 r l2 r l1 7 5 x l4 w l3 l4 l3 9 8 z l5 l6 y l6 l5 55 v v v 55 r l2 l1 w x l3 l4 y z l5 l6 v Query 55 The Reachability Constrainton Partial Matchings • A query node v that is mapped to a database object o satisfies the reachability constraint if there is a path from the query root to v, such that all edge constraints along this path are satisfied
r x y Producer Actor Director z “And” Matchings • A partial matching is an AND matching if • The root constraint is satisfied • The reachability constraint is satisfied by every query node that is mapped to a database node • If a query node is mapped to a database node, all the incoming edge constraints are satisfied
1 11 12 27 u 32 NULL Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill An AND Matching
Uncredited Actor In an AND matching, Node z must be null! Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Uncredited Actor Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill Suppose that we remove the edges that are labeled with Uncredited Actor
null null x 12 x 12 x 12 x l l l m l m l y 25 y 25 y 25 y null null Weak Satisfaction of Edge Constraints • Edge Constraint: • Is Weakly Satisfied if it is either • Satisfied (as defined earlier), or • One (or more) of its nodes is mapped to a null value
Weak Matchings • A partial matching is a weak matching if • The root constraint is satisfied • The reachability constraint is satisfied by every query node that is mapped to a database node • Every edge constraint is weakly satisfied
1 11 y NULL 27 u 32 NULL Edges that are weakly satisfied Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill A Weak Matching
In an AND matching, only the first three options are permitted null null x 12 x 12 x 12 x l l l m l m l y 25 y 25 y 25 y null null In a weak matching, all four options are permitted
Producer In a weak matching, Node z must be null! Movie Database 1 r Movie Movie Movie Movie Producer Producer 11 12 x y Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill Consider the case where edges labeled with Producer are removed
“OR” Matchings • A partial matching is an OR matching if • The root constraint is satisfied • The reachability constraint is satisfied by every query node that is mapped to a database node
1 11 y NULL 27 u 32 NULL An edge which is not weakly satisfied Movie Database 1 r Movie Movie Movie Movie Producer 11 12 x y Actor Director Director Director Title Actor Uncredited Actor Actor Title Year z 29 21 22 23 24 27 25 26 Date of birth Star Wars Name Name 1977 Name Hook Name Name Name u v 34 30 32 31 33 George Lucas Dustin Hoffman Steven Spielberg Harrison Ford Mark Hamill An OR Matching
Increasing Level of Incompleteness • A complete matching is an AND matching • An AND matching is a weak matching • A weak matching is an OR matching
Maximal Matchings • A tuple t1subsumes a tuple t2 if t1 is the result of replacing some null values in t2 by non-null values: • A matching is maximal if no other matching subsumes it • A query result consists of maximal matchings only Matchings are represented as tuples of oid’s and null values t1=(1, 5, 2, null) t2=(1, null, 2, null)
On the Complexity of Computing Queries with Incomplete Answers • The size of the result can be exponential in the size of the input (database and query) • Note that the same is true when joining relations – the size of the result can be exponential in the size of the input (database and query) • Instead of using data complexity (where the runtime depends only on the size of the database), we use input-output complexity
Input-Output Complexity In input-output complexity, the time complexity is a function of the size of the query, the size of the database, and the size of the result.
The Motivation for Using I/O Complexity • Measuring the time complexity with respect to the size of the input does not separate between the following two cases: • An algorithm that does an exponential amount of work simply because the size of the output is exponential in the size of the input • An algorithm that does an exponential amount of work even when the query result is small • Either the algorithm is naïve (e.g., it unnecessarily computes subsumed matchings) or the problem is hard
Recent Results (PODS’03) I/O Complexity of Query Evaluation(lower bounds are for non-emptiness)
Filter Constraints • Constraints that filter the results (i.e., the maximal matchings) • There are • Weak filter constraints(the constraint is satisfied if a variable in the constraint is null) • Strong filter constraints(all variables must be non-null for satisfaction) • Existence constraint:!xis true if xis not null
I/O Complexity of Query Evaluationwith Existence Constraints(lower bounds are for non-emptiness)
I/O Complexity of Query Evaluationwith Weak Equality/Inequality Constraints(lower bounds are for non-emptiness)
Query Containment • Query containments for queries with incomplete answers is defined differently from query containment for queries with complete answers • Q1 Q2 if for all database D, every matching of Q1 w.r.t. to D is subsumed by a matchings of Q2 w.r.t. to D • Query containment (query equivalence) is useful for the development of optimization techniques
Q1 Q2 r r l1 l1 l2 l2 x p l2 y z q l3 l4 l3 l4 u v u v Containment in AND Semantics • Homomorphism between the query graphs is necessary and sufficient for containment Q1 Q2 homomorphism • Deciding whether one query is contained in another is NP-Complete
Containment in OR Semantics • The following is a necessary and sufficient condition for query containment in OR semantics • For every spanning tree T1 of the contained query, there a spanning tree T2 of the containing query, such that there is a homomorphism from T2 to T1 • is in ΠP2 • NP-Complete if the containee is a tree • polynomial if the container is a tree
Containment in Weak Semantics • Similar to containment in OR Semantics, with the following difference • Instead of checking homomorphism between spanning trees, we check homomorphism between graph fragments • A graph fragment is a restriction of the query to a subset of the variables that includes the query root such that every node in the fragment is reachable from the root
Agenda Why is it difficult to query semistructured data? Queries with incomplete answers (QwIA) Flexible queries (FQ) Oblivious querying = QwIA + FQ Using QwIA and FQ for information integration
Flexible Queries • To deal with structural variations in the data, we have developed flexible queries