270 likes | 419 Views
Efficient Detection of Empty Result Queries. Gang Luo IBM T.J. Watson Research Center luog@us.ibm.com. Empty Result Problem. Query returns an empty result set User gets lost about where to look at next Frequently encountered in interactive exploration of massive data sets
E N D
Efficient Detection of Empty Result Queries Gang Luo IBM T.J. Watson Research Center luog@us.ibm.com
Empty Result Problem • Query returns an empty result set • User gets lost about where to look at next • Frequently encountered in interactive exploration of massive data sets • Our contribution: method for quickly detecting empty result sets
Example Percentages of Empty Result Queries • In a Customer Relationship Management (CRM) application developed by IBM • 18.07% (3,396 empty result queries in 18,793 queries) • In a real estate application developed by IBM • 5.75% • In a digital library application [JCM+00] • 10.53% • In a bioinformatics application [RCP+98] • 38%
Empty Result Queries May Not Finish Execution Quickly • Consider a query joining two relations • Query execution time is longer than join time, no matter whether or not query result set is empty • Even if a query finishes in a few seconds in a lightly loaded RDBMS, it may last longer than one minute in a heavily loaded RDBMS
Outline • Limitations of previous approaches • Fast detection method for empty result queries • Some experiments
Existing Solutions to the Empty Result Problem • Explain what leads to the empty result set • Automatically generalize the query so that the generalized query will return some answers
Limitations of Existing Solutions • Require domain specific knowledge • Only apply to a restricted form of queries • Require an excessive amount of time • Give too many reasons why the result set is empty • Users cannot reuse each other’s query results
Outline • Limitations of previous approaches • Fast detection method for empty result queries • Some experiments
Our Solution • Only consider read-only environment • From previous queries’ execution, remember the query parts that lead to empty result sets • When a new query Q comes, match it with the remembered query parts. If such a match exists, report that Q will return an empty result set without executing Q • Utilize special properties of empty result sets and thus often more powerful than traditional materialized view method
Definitions • Empty result propagating operator: An operator whose output is empty if any input is empty • Empty result propagating query: A query whose query plan only contains empty result propagating operators (our focus) • Query part: A sub-tree of a query plan • Atomic query part: An ordered pair (relation names RN, selection condition SC) • Corresponds to a relational algebra formula: first product join all relations in RN, then apply SC • SC is a conjunction of primitive terms, where each primitive term is a comparison
Definitions – Cont. • Cover: Atomic query part P1=(RN1, SC1)covers atomic query part P2=(RN2, SC2) if • RN1RN2 • Whenever SC2 is true, SC1 is true • Property: Suppose atomic query part P1 covers atomic query part P2. For a given database, if the output of P1 is empty, the output of P2 is also empty.
[0] sort-merge join B.g=C.h [0] sort [1000] sort [0] C.f<300[1000] [0] C (table-scan) [20000] hash join A.c=B.d [0] hash [200] hash [5000] 50<A.a<100 A.b=200[200] [5000] A (table-scan) [40000] B (index-scan) B.e<40 B.e=50 [5000] Given an Empty Result Query • Find the lowest-level query part P whose output is empty
Transforming P into a Simplified Query Part Ps • Drop all operators (e.g., projection, hash, sort) that have no influence on the emptiness of the output • Replace each physical join operator with a logical join operator • Replace each index-scan operator with a table-scan operator followed by a selection operator, where the selection condition is the index-scan condition
⋈A.c=B.d B.e<40 B.e=50 50<A.a<100 A.b=200 B (table-scan) A (table-scan) Transforming P into a Simplified Query Part Ps – Cont. • Corresponding relational algebra formula • (50<A.a<100 A.b=200 (A)) ⋈A.c=B.d(B.e<40 B.e=50(B))
Breaking Ps into Atomic Query Parts • Get all selection conditions in the selection/join operators • Rewrite the conjunction of these selection conditions into a disjunctive normal form (DNF) • Negations on numeric or string attributes are removed using complementary operators • Interval-based comparison is treated as a single primitive term • Generate a set of atomic query parts (RN, SC) • RN: input relations of all table-scan operators in Ps • SC: a term in the DNF
Breaking Ps into Atomic Query Parts – Cont. • Property: The following three assertions are equivalent to each other: • The output of the query part P is empty • The output of the simplified query part Ps is empty • The output of each generated atomic query part is empty (50<A.a<100 (A)) ⋈A.c=B.d(B.e<40(B)) (A.b=200 (A)) ⋈A.c=B.d(B.e<40(B)) (50<A.a<100 (A)) ⋈A.c=B.d(B.e=50(B)) (A.b=200 (A)) ⋈A.c=B.d(B.e=50(B))
Storing the Generated Atomic Query Parts • For each generated atomic query part Pa • Insert Pa into a collection Caqp of atomic query parts • Remove from Caqp all previously stored atomic query parts that are covered by Pa • See paper for details of the coverage checking algorithm
When Getting a New Query Q • Break Q into a set of atomic query parts • For each such atomic query part Pa, check whether some atomic query part Ai in Caqp covers Pa • If such an Ai exists for each Pa, report that Q will return an empty result set without executing Q
Outline • Limitations of previous approaches • Fast detection method for empty result queries • Some experiments
Setup • Testing environment • PostgreSQL 7.3.4 • Windows XP OS • Dell Inspiron 8500 PC with one 2.2GHz CPU, 512MB memory, one 40GB disk • TPC-R benchmark • See paper for detection probability analysis
Overhead Experiment • Query Q1: Find the information about certain parts that were sold on certain days select * from orders o, lineitem lwhere o.orderkey=l.orderkey and (o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf);
Overhead Experiment – Cont. • Query Q2: Find the information about certain parts that were sold to certain customers on certain days select * from orders o, lineitem l, customer cwhere o.orderkey=l.orderkey and o.custkey=c.custkey and (o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf) and (c.nationkey=n1 or … or c.nationkey=ng);
Overhead Experiment – Cont. • The overhead of our method increases with both query complexity and the number of atomic query parts stored in Caqp • When check fails, the overhead of our method is higher than that when check succeeds
Overhead Experiment – Cont. • The overhead of our method is trivial compared to query execution overhead
Summary • Provide a fast detection method for empty result queries • Low overhead • High detection probability once enough information has been accumulated
Open Issues • In the presence of update, correctly preserve as much stored information as possible • A hybrid method that can combine the advantages of both our method and the existing solutions • More aggressive storage saving technique