520 likes | 638 Views
AutoJoin: Providing Freedom from Specifying Joins. Terrence Mason ( terrence-mason@uiowa.edu ) Lixin Wang ( lixin-wang@uiowa.edu ) Dr. Ramon Lawrence ( ramon-lawrence@uiowa.edu ) Iowa Database and Emerging Application Laboratory University of Iowa.
E N D
AutoJoin: Providing Freedom from Specifying Joins Terrence Mason (terrence-mason@uiowa.edu) Lixin Wang (lixin-wang@uiowa.edu) Dr. Ramon Lawrence (ramon-lawrence@uiowa.edu) Iowa Database and Emerging Application Laboratory University of Iowa 7th International Conference on Enterprise Information Systems ICEIS 2005 Miami, Florida
Presentation Outline • Define Query Inference • Query Languages that require Inference • AutoJoin Architecture • Join Graph represent a schema • Queries and Query Interpretations on a Join Graph • Pre-compute maximal join trees • Algorithm EMO • Query time processing – Example • Performance Evaluation
Query Inference Problem New Languages The query inference problemrequires enumerating and ranking query interpretations of a query such that the query interpretation desired by the user is among the highest ranked interpretations.
Motivation for Query Inference • State of the art query languages require it • Keyword Search – automatically relate keywords across relations of a schema • Conceptual Queries – Concepts mapped to database must be related • Natural Language Queries • Natural language query mapped to concepts • Relate concepts as in Conceptual Queries • Current approaches not scalable • Tied to specific language • Or conceptual model
Motivation for Query Inference • Reduces to graph problem • Connect relations (nodes) with joins (edges) • Exponential solutions for highly connected graphs (database graphs less connected) • Approaches to join determination • Grow all ways • Universal Relation (Maier and Ullman, 1983) • Discover (Keyword) (Hristidis and Papakonstantinou, 2002, 2003, 2004) • Shortest Paths • CQL Conceptual Query Language (Owei and Navathe, 2001) • Limited Interpretations • Steiner Tree (2-Trees) (Wald and Sorenson, 1984) • Limit number of joins and interpretations (Zhang et al., 1999) • Query time find spanning trees of keywords • DBXplorer Keyword Search (Agrawal et al. 2002)
Goal of AutoJoin • Consistent, Scalable Inference Engine • Abstract database schema from users • Automatically determine joins to relate relations and attributes • Consistent approach to handle ambiguity in queries • Efficient algorithm to pre-compute potential joins • Minimal overhead at query time • Demonstrate efficiency and scalability • Structured on relational model without any required conceptual models
Example Query on TPC-H Schema English Query: List all parts ordered by Customers in the United States. • Attribute-only SQL • Determine Joins with AutoJoin • New formulation for Query Inference problem.
TPC-H Schema TPC-H BENCHMARK™ (http://www.tpc.org/) List all parts ordered by Customers in the United States.
Attribute-only Query: Select Part.Name where Nation.Name=‘United States’; • Part.Name - name attribute in Part Table • Nation.Name – name attribute in Nation Table • Select and where similar to SQL • No From clause or joins specified Keyword Query: Part ‘United States’ • Maps Part toPart relation • Maps‘United States’totuple in Nationrelation • Nojoins specified
SQL QuerySelect Part.Name where Nation.Name = ‘United States’; SELECT P.name FROM part P, nation N, partsupp PS, lineitem LI, orders O, customer C WHERE N.name = ‘United States’ And P.partkey = PS.partkey And PS.partkey = LI.partkey And PS.suppkey = LI.suppkey And O.custkey = C.custkey And C.nationkey = N.nationkey And LI.orderkey = O.orderkey; Specified Joins and Tables
AutoJoin Architecture User Relational Database Execute Queries Query Interface Inference Request Interpretations Query Builder Iterator Generator Ranker XML Document Loader AutoJoin Inference Engine
Representing Joins of a SchemaJoin Graph • Graph representation of relational schema • Nodes • Relations in schema • Directed Edges • Foreign key constraint between relations • Edges directed from N to 1 cardinality of relationships • Maintain Lossless property (No spurious tuples on joins)
Create Join Graph TPC-H Tables as Nodes Line Item Part Supp Part Supplier Order Nation Customer Region
Pre-compute Maximal Join Trees • EMO Algorithm on Join Graph • Efficiently computes all Trees • Executes where previous strategy failed • Direction of edges results in lossless join trees • Pre-computed • Executed once prior to query time • Structures built for query time performance
Compute Lossless Joins • Maximal sets of lossless joins • Ambiguity inherent in the schema • Two types of ambiguity: • Single relation that plays multiple roles • Node with more than one incoming edge in join graph • Multiple semantic relationships between entities • Strongly connected components greater than one node
Creation of Maximal Join TreesLossless Joins • Efficient Algorithm EMO • Determine all reachable graphs from nodes that may be a root for Maximal Set of Lossless Joins • Identify all Strong Connected Components (SCC) • For each SCC • If SCC is single node and no incoming edges, create reachable graph from this node • If SCC has multiple nodes, for each node in SCC with no incoming edges that are not part of SCC create reachable graph. • For each reachable graph find all spanning trees • Spanning trees represent Maximal Join Trees
Maximal Join Trees of TPC-H • LineItem is the only root for a reachable graph. • No strongly connected components • Join graph is reachable graph • Enumerate spanning trees on original graph • Remove shortcut joins and re-compute
TPC-H Join Graph Line Item Part Supp Order Part Supplier Nation Customer Region
TPC-H Maximal Join Trees Line Item Line Item Line Item Line Item Part Supp Part Supp Part Supp Part Supp Supplier Order Supplier Order Supplier Order Part Part Supplier Order Part Part Nation Customer Nation Customer Nation Customer Nation Customer 4 1 3 Region Region 2 Region Region Line Item Line Item Line Item Line Item Part Supp Part Supp Part Supp Part Supp Supplier Order Supplier Order Supplier Order Part Part Supplier Order Part Part Nation Customer Nation Customer Nation Customer Nation Customer 8 6 7 5 Region Region Region Region
Shortcut Joins • Semantically equivalent join paths • A shortcut join is a join that is semantically equivalent to a longer join path • Core join path (longer) preserved in join graph • Shortcut join removed for join determination • Appears to be a semantically different interpretation of the query • Substituted back into query • No nodes on core path in query (faster) execution) • TPC-H has two shortcut joins
TPC-H Join Graph Remove Shortcut Joins Line Item Red – Shortcut Joins Part Supp Order Part Supplier Nation Customer Region
Original TPC-H Maximal Join Trees Line Item Line Item Line Item Line Item Part Supp Part Supp Part Supp Part Supp Supplier Order Supplier Order Supplier Order Part Part Supplier Order Part Part Nation Customer Nation Customer Nation Customer Nation Customer 4 1 3 Region Region 2 Region Region Line Item Line Item Line Item Line Item Part Supp Part Supp Part Supp Part Supp Supplier Order Supplier Order Supplier Order Part Part Supplier Order Part Part Nation Customer Nation Customer Nation Customer Nation Customer 8 6 7 5 Region Region Region Region
TPC-H Semantically Unique Maximal Join Trees Line Item Line Item Part Supp Part Supp Part Supplier Order Order Part Supplier Nation Customer Nation Customer Region Region 1 2
Query and Query Interpretation AutoJoin • Join Graphs • Query: • Sub-graph of the join graph • Nodes and (optionally) edges • Not connected requires inference • Query Interpretation: • Connected sub-graph of the join graph • Includes all specified nodes and edges
Example Query SELECT Part.Name WHERE Nation.Name = ‘United States’; • Relate Part.Name to Nation.Name • Part and Nation Nodes. • QueryofPart and Nation nodes to AutoJoin. • The query is ambiguous • More than one query interpretation • Nation relates to Supplier and Customer • Return the query with fewest joins first
Efficient Query Time Execution • Find maximal join trees with query nodes • Reverse index - relation to its set of join trees • Intersect lists • Build Interpretations • Least common ancestor (vs. recursive prune) • Pre-compute ancestor lists • No lossless interpretations (no trees) • Find lossy interpretation • Rank interpretations by cost function maximal sets of lossless joins
Both Trees Contain Query Nodes Select Part.Name where Nation.Name = ‘United States’; Line Item Line Item Part Supp Part Supp Part Supplier Order Order Part Supplier Nation Customer Nation Customer Region Red – Target Nodes Region 1 2
Query Processing Line Item Line Item Part Supp Part Supp Part Supplier Order Order Part Supplier Nation Customer Nation Customer Red – Target Nodes Blue – Tree Nodes Gray – Nodes to Prune Region Region 1 2
Query Interpretations Select Part.Name where Customer.Nation.Name = ‘United States’; Select Part.Name where Supplier.Nation.Name = ‘United States’; Line Item Part Supp Part Supplier Part Supp Part Order Nation Nation Customer 1 2
Unambiguous Query Select Supplier.Name where Order.Id = 73; Line Item Line Item Part Supp Part Supp Part Supplier Order Order Part Supplier Nation Customer Nation Customer Region Red – Target Nodes Region 1 2
Query Processing Select Supplier.Name where Order.Id = 73; Line Item Line Item Part Supp Part Supp Part Supplier Order Order Part Supplier Nation Customer Nation Customer Red – Target Nodes Blue – Tree Nodes Gray – Nodes to Prune Region Region 1 2
Query Interpretations Select Supplier.Name where Order.Id = 73; Line Item Line Item Part Supp Part Supp Supplier Order Supplier Order 1 2
The Unambiguous Query Interpretation Select Supplier.Name where Order.Id = 73; Line Item Part Supp Supplier Order
Additional InterpretationsLossy Joins • Related through a node involved in two distinct roles • Two maximal join trees contain all query nodes and have at least one node in common • Union maximal join trees • Common nodes provide relation for trees. • Interpretation where node will have two incoming edges • No longer lossless • Example Customer and Supplier related through Nation in TPC-H. • Cross products of Customers and Suppliers with the same nation
Beyond Natural Joins • Theta joins • Merge the two nodes related by theta join into single node and re-compute maximal objects. • Expand this node for final query interpretation with theta join • Tuple Variables • A query interface may specify tuple variables • Additional nodes and edges will be added to join graph to complete the query interpretations
Performance Experiments • Broad Range of Schemas • caBIO (NCI) 149 relations, 213 joins, and 1253 maximal join trees • TPC-H Standard Database • Inferred standard queries (21 specified queries) • Ambiguity reduced by removing shortcut joins • Tenant – 9 nodes, 50 joins, and 1286 maximal join trees
Peformance Results • Time to generate all Maximal Join Trees • Handles schemas where previous method failed • Worst test 2.7 seconds • Average < 1 second • Reduce Ambiguity • Removing shortcut joins reduces ambiguity • Increased number of unambiguous query • From 45% to 68% for TPC-H Benchmark Queries • Minimal overhead of inference at query time • Average < 1 millisecond • Worst test 7.4 milliseconds
AutoJoin Conclusions • Scalable inference engine • Efficiently pre-compute maximal join trees • Reduced ambiguity by removing shortcut joins • Overhead is minimal • Complex queries can be inferred • Built directly on relational model
Future Work • Develop a query language • Remove requirement of understanding the underlying schema • Automatically determines joins • End user interface based on AutoJoin • Query inference for integration systems.
Query Inference(Previous) The translation of a query in a query language into an unambiguous representation of the query [Wald and Sorenson, 1984]
Universal Relation • First model to require query inference • Maximal Objects (Maier and Ullman, 1983) • Lossless Join property to identify potential joins • Grows all ways on hyper-graph • Returns a union of all query interpretations • Minimum Directed Cost Steiner Tree (Wald and Sorenson, 1984) • Limited to Partial 2-Trees • Returns only lowest cost query interpretation • Generate a single interpretation • Do not meet need of new query languages • Limited query interpretations possible
State of the Art Query Languages • Keyword Searches • Keywords map to either specific data, attribute names, or relation names in a database. • Must identify joins to relate keywords spread across multiple relations. • Multiple approaches to identifying the top-k relationships between keywords.
Keyword SearchTop-K Relationships • Discover (Hristidis and Papakonstantinou, 2002, 2003, 2004) • Grow all ways from a keyword • Limit on number of joins • Creates extra graphs • DBXplorer (Agrawal et al. 2002) • Generates spanning trees at query time • BANKS ( ) • Graph of all tuples related by joins • Must fit in memory (limited to smaller databases)
State of the Art Query Languages • Conceptual Query Languages or Models • Queries built with concepts that map to a database. • Remove the burden of knowledge of the schema. • Must determine joins to relate concepts in query. • Use conceptual model to determine joins
Conceptual Query Languages • CQL (Owei and Navathe, 2001) • Queries may include roles or joins required for a query • Pathfinder algorithm for completing the query • Based on shortest path between source and target concepts in query • Semantically Constrained ER Diagram as a graph used to determine joins. • Conceptual Model (Zhang et al., 1999) • Semantic graph of database • Search algorithm constrained by number of joins or number of interpretations
State of the Art Query Languages • Natural Language Queries • Natural language queries map the language to concepts in a database • Joins must be determined to relate concepts in database similar to Conceptual Query Languages
Functional Dependencies due to Primary KeysTPC-H Primary Keys Foreign Keys