Efficient Detection of Empty Result Queries

Efficient Detection of Empty Result Queries Gang Luo IBM T.J. Watson Research Center luog@us.ibm.com

Empty Result Problem • Query returns an empty result set • User gets lost about where to look at next • Frequently encountered in interactive exploration of massive data sets • Our contribution: method for quickly detecting empty result sets

Example Percentages of Empty Result Queries • In a Customer Relationship Management (CRM) application developed by IBM • 18.07% (3,396 empty result queries in 18,793 queries) • In a real estate application developed by IBM • 5.75% • In a digital library application [JCM+00] • 10.53% • In a bioinformatics application [RCP+98] • 38%

Empty Result Queries May Not Finish Execution Quickly • Consider a query joining two relations • Query execution time is longer than join time, no matter whether or not query result set is empty • Even if a query finishes in a few seconds in a lightly loaded RDBMS, it may last longer than one minute in a heavily loaded RDBMS

Outline • Limitations of previous approaches • Fast detection method for empty result queries • Some experiments

Existing Solutions to the Empty Result Problem • Explain what leads to the empty result set • Automatically generalize the query so that the generalized query will return some answers

Limitations of Existing Solutions • Require domain specific knowledge • Only apply to a restricted form of queries • Require an excessive amount of time • Give too many reasons why the result set is empty • Users cannot reuse each other’s query results

Our Solution • Only consider read-only environment • From previous queries’ execution, remember the query parts that lead to empty result sets • When a new query Q comes, match it with the remembered query parts. If such a match exists, report that Q will return an empty result set without executing Q • Utilize special properties of empty result sets and thus often more powerful than traditional materialized view method

Definitions • Empty result propagating operator: An operator whose output is empty if any input is empty • Empty result propagating query: A query whose query plan only contains empty result propagating operators (our focus) • Query part: A sub-tree of a query plan • Atomic query part: An ordered pair (relation names RN, selection condition SC) • Corresponds to a relational algebra formula: first product join all relations in RN, then apply SC • SC is a conjunction of primitive terms, where each primitive term is a comparison

Definitions – Cont. • Cover: Atomic query part P1=(RN1, SC1)covers atomic query part P2=(RN2, SC2) if • RN1RN2 • Whenever SC2 is true, SC1 is true • Property: Suppose atomic query part P1 covers atomic query part P2. For a given database, if the output of P1 is empty, the output of P2 is also empty.

 [0] sort-merge join B.g=C.h [0] sort [1000] sort [0] C.f<300[1000]  [0] C (table-scan) [20000] hash join A.c=B.d [0] hash [200] hash [5000] 50<A.a<100  A.b=200[200]  [5000] A (table-scan) [40000] B (index-scan) B.e<40  B.e=50 [5000] Given an Empty Result Query • Find the lowest-level query part P whose output is empty

Transforming P into a Simplified Query Part Ps • Drop all operators (e.g., projection, hash, sort) that have no influence on the emptiness of the output • Replace each physical join operator with a logical join operator • Replace each index-scan operator with a table-scan operator followed by a selection operator, where the selection condition is the index-scan condition

⋈A.c=B.d B.e<40  B.e=50 50<A.a<100  A.b=200 B (table-scan) A (table-scan) Transforming P into a Simplified Query Part Ps – Cont. • Corresponding relational algebra formula • (50<A.a<100  A.b=200 (A)) ⋈A.c=B.d(B.e<40  B.e=50(B))

Breaking Ps into Atomic Query Parts • Get all selection conditions in the selection/join operators • Rewrite the conjunction of these selection conditions into a disjunctive normal form (DNF) • Negations on numeric or string attributes are removed using complementary operators • Interval-based comparison is treated as a single primitive term • Generate a set of atomic query parts (RN, SC) • RN: input relations of all table-scan operators in Ps • SC: a term in the DNF

Breaking Ps into Atomic Query Parts – Cont. • Property: The following three assertions are equivalent to each other: • The output of the query part P is empty • The output of the simplified query part Ps is empty • The output of each generated atomic query part is empty (50<A.a<100 (A)) ⋈A.c=B.d(B.e<40(B)) (A.b=200 (A)) ⋈A.c=B.d(B.e<40(B)) (50<A.a<100 (A)) ⋈A.c=B.d(B.e=50(B)) (A.b=200 (A)) ⋈A.c=B.d(B.e=50(B))

Storing the Generated Atomic Query Parts • For each generated atomic query part Pa • Insert Pa into a collection Caqp of atomic query parts • Remove from Caqp all previously stored atomic query parts that are covered by Pa • See paper for details of the coverage checking algorithm

When Getting a New Query Q • Break Q into a set of atomic query parts • For each such atomic query part Pa, check whether some atomic query part Ai in Caqp covers Pa • If such an Ai exists for each Pa, report that Q will return an empty result set without executing Q

Setup • Testing environment • PostgreSQL 7.3.4 • Windows XP OS • Dell Inspiron 8500 PC with one 2.2GHz CPU, 512MB memory, one 40GB disk • TPC-R benchmark • See paper for detection probability analysis

Overhead Experiment • Query Q1: Find the information about certain parts that were sold on certain days select * from orders o, lineitem lwhere o.orderkey=l.orderkey and (o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf);

Overhead Experiment – Cont. • Query Q2: Find the information about certain parts that were sold to certain customers on certain days select * from orders o, lineitem l, customer cwhere o.orderkey=l.orderkey and o.custkey=c.custkey and (o.orderdate=d1 or … or o.orderdate=de) and (l.partkey=p1 or … or l.partkey=pf) and (c.nationkey=n1 or … or c.nationkey=ng);

Overhead Experiment – Cont. • The overhead of our method increases with both query complexity and the number of atomic query parts stored in Caqp • When check fails, the overhead of our method is higher than that when check succeeds

Overhead Experiment – Cont. • The overhead of our method is trivial compared to query execution overhead

Summary • Provide a fast detection method for empty result queries • Low overhead • High detection probability once enough information has been accumulated

Open Issues • In the presence of update, correctly preserve as much stored information as possible • A hybrid method that can combine the advantages of both our method and the existing solutions • More aggressive storage saving technique

Efficient Detection of Empty Result Queries

Efficient Detection of Empty Result Queries

Presentation Transcript

Empty

Efficient Context-Sensitive Intrusion Detection

Effective Cost -efficient Result-oriented

Efficient Processing of Top- k Queries in Uncertain Databases

Efficient Computation of Reverse Skyline Queries

Efficient Evaluation of Queries in a Mediator for WebSources

On Efficient Matching of Streaming XML Documents and Queries

Efficient Processing of XPath Queries Using Indexes

Efficient processing of XPath queries with structured overlay networks

EFFICIENT PROFILING FOR ESTIMATION OF QUERY RESULT QUALITY

On the Path to Efficient XML Queries

Efficient Detection of Split Personalities in Malware

Efficient Algorithm For Processing XPath Queries

Efficient Scheduling of Heterogeneous Continuous Queries

Efficient Processing of Metric Skyline Queries

On the Path to Efficient XML Queries

Efficient Data-Race Detection

Efficient Evaluation of HAVING Queries on a Probabilistic Database

Generating Efficient Plans for Queries Using Views

EMPTY

Efficient Scheduling of Heterogeneous Continuous Queries