Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan

Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan

Motivation Simple Example Order Properties Grouping followed by ordering Order Property Optimization Performance Results Conclusion Outline

Previous presentation: Fundamental Techniques for Order Optimization Using FDs and selection predicates Determining order propagation from input to output Infer from ordering Current presentation: Aside from orderings, we also infer how relations are grouped(i.e., how records in relations are clustered according to value of certain attributes) Infer from grouping Infer from secondary ordering Motivation

Inferred orderings Make it possible to avoid sorting when preprocessing ORDER BY clauses of SQL query Inferred groupings Avoid sorting or hashing prior to computing aggregates for GROUP BY clauses Reduce the cost of projection with duplicate elimination Complete projection and duplicate elimination in a single pass Reduce the cost of evaluating selection queries in the form σA=k(R) in the absence of indexes or an ordering on A Inference of secondary ordering and grouping Avoid unnecessary sorting or grouping over multiple attributes Infer new primary orderings or groupings (example follows) Motivation(cont.)

Benefits of inferring grouping and secondary ordering Simple Example TPC-H Query SELECT c_custkey, COUNT (*) FROM Customer, Supplier WHERE c_nationkey = s_nationkey GROUPBY c_custkey How many suppliers could supply each costumer directly without having to go through customs

Simple Example (cont.) sort-merge join result is sorted (and hence grouped) on c_nationkey; the output tuples in the same group with respect to c_nationkey, are themselves grouped on the key of outer relation (c_custkey) “c_nationkey G→c_custkeyG “ =>no sort group c_custkey, count(*) Postgres Plan first sorts the join result on the grouping attribute c_custkey so as to be able to aggregate over groups in a single pass. sort c_custkey But one-pass aggregation requires data only to be grouped and not sorted! merge-join c_nationkey= s_nationkey TPC-H Query SELECT c_custkey, COUNT (*) FROM Customer, Supplier WHERE c_nationkey = s_nationkey GROUPBY c_custkey sort c_nationkey sort s_nationkey table scan customer table scan supplier Postgres QEP of the Query

order properties have the form: each Aiis an attribute, each αieither specifies an ordering (αi= O) or a grouping (αi=G) A1α1primary ordering or grouping and A2α2 secondary Ordering properties are formalized with an algebra of constructors, following the signatures given below: Order Properties empty ordering combination of orderings basic orderings:order or group

Suppose that R=(A,B) consists of 10 tuples, t1,…,t10, and its physical representation satisfies the order property, Ao→ BG. This situation is illustrated on the next slide Grouping followed by ordering

Grouping followed by ordering (cont.) The secondary ordering (BG) says that within each group of tuples with like values of A, tuples are clustered together if they have the same value for B The primary ordering (AO) says that the group of tuples with A=1 precedes the group of tuples with A=2 whichprecedes the group with A=3 A=3 A=1 B=2 B=3 B=1 t7 t6 B=2 t1 t2 B=1 t3 t5 t4 < < A=2 t1 can precede t2 or t2 can precede t1 but the must be adjacent B=2 B=1 Two Example permutations that satisfies the order property : t2,t1,t3,t10,t8,t9,t6,t7,t4,t5 t1,t2 ,t3 ,t9,t8,t10,t4 ,t5, t6,t7 t9 t10 t8 An illustration of AO→ BG

The general properties have the form: Shorthand: Also, given and the shorthand: “o1→o2“ (concatenation of OP) denotes: Computing with Order Properties (cont.)

for any order property that holds of a physical relation, all prefixes of that order property also hold of R an ordering on any attribute implies a grouping on that attribute If X functionally determines B, and an order property that includes all attributes in X (ordered or grouped) appearing before Bα, thenBαis superfluous. Order Properties (cont.) • Identities

Order Properties (cont.) • Identities (cont.) • special case of identity #3, covering the case where X consists of a single attribute • the grouping of an attribute that is functionally determined by the attribute that follows it in the order property is superfluous

Using the algebra of order properties and their formal definitions, we can derive inference rules that state how order properties propagate through relational operators, e.g., joins: Order Property Inference

The data structures for all plan nodes in postgres include the following fields: inp1,… inpn: the fields contained in all input tuples to the node left: the left subtree of the node (set to Null for leaf nodes and Append) right: the right subtree of the node (set to Null for leaf nodes, unary operators and Append). Order Property Optimization • Postgres Plan Operators Summarized

Order Property Optimization • Postgres Plan Operators Summarized(cont.) • Additional operator-specific fields provided by Postgres and used by our refinement algorithm

Group performs two passes over its input: insert Null values between pairs of consecutive tuples with different values for attributes, att1, …,attk, apply functions Fk+1,…, Fn to the collection of values of attributes attk+1,…,attn respectively, for each set of tuples separated by Nulls. Hash: builds a hash table over its input using a predetermined hash function over attribute att. Order Property Optimization • Postgres Plan Operators Summarized (cont.)

HJoin: performs a (non-order-preserving) simple hash equijoin (att1 = att2) with the relation produced by left as the probe relation, and the relation produced by right as the build relation. Merge: performs a merge equijoin (att1 = att2) with the relation produced by left as the outer relation, and the relation produced by right as the inner relation. NOP: has been added as a dummy plan operator that is temporarily made the root of a Postgres plan prior to its refinement. Order Property Optimization

Input: query plan tree generated by Postgres Output: an equivalent plan tree with unnecessary Sort operators (used either to order or group) removed Requires: 4 new attributes associated with every node in a query plan tree Order Property Optimization • A Plan Refinement Algorithm

keys: a set of attribute sets that are guaranteed to be keys of inputs to n fds: a set of functional dependencies (attribute sets →attribute) that are guaranteed to hold of inputs to n req: a single order property that is required to hold of inputs either to n or some ancestor node of n for that node to execute sat: a set of order properties that are guaranteed to be satisfied by outputs of n Order Property Optimization • A Plan Refinement Algorithm(cont.) • New Attributes

Idea: decorate the input plan with these new attributes remove any Sort operator, whose child node produces a result that is guaranteed to satisfy an order property required by its parent node Accomplished in 3 passes over the input plan Order Property Optimization • A Plan Refinement Algorithm (cont.)

Order Property Optimization NOP group c_custkey, count(*) sort c_custkey merge-join c_nationkey= s_nationkey sort c_nationkey sort s_nationkey table scan customer table scan supplier • A Plan Refinement Algorithm (cont.) • Refinement of the query plan

Order Property Optimization group c_custkey, count(*) merge-join c_nationkey= s_nationkey sort c_nationkey sort s_nationkey table scan customer table scan supplier • A Plan Refinement Algorithm (cont.) • Resulting query plan with Sort removed:

Pass 1: Functional Dependencies and Keys A bottom-up pass, FDs and keys are propagated upwards when inferred to hold of intermediate query result Pass 2: Required Order Properties Top-down pass to propagate required order properties (req) downwards from the root of the tree Pseudocode of this pass given in SetReq (next slide) New required order properties are generated by: NOP: if its child is Sort, i.e., original query includes order by Group and Unique (whose input needs to be grouped) Join operators (propagate 1 order from above into 2 below) All other nodes pass the required order properties they inherit from parent nodes to their child nodes, except for Hash and Append which propagate the empty order property to their child nodes Order Property Optimization • A Plan Refinement Algorithm (cont.)

Order Property Optimization

Pass 3:Sort Elimination A bottom-up pass of the query plan tree that determines what order properties are guaranteed to be satisfied by outputs of each node (sat), and that concurrently removes any Sort operator, n for which n.left.sat n.req Algorithm: InferSat (next slides) Order Property Optimization • A Plan Refinement Algorithm (cont.)

Order Property Optimization • A Plan Refinement Algorithm (cont.) • InferSat

Order Property Optimization • A Plan Refinement Algorithm (cont.) • InferSat (cont.)

Example:TPC-D (now TPC-H) Query 3 TPC-D Query 3 select l.orderkey, sum (l .extendedprice*( 1- l.discount)) as rev, o.orderdate, o.shippriority from customer, order, lineitem where o.orderkey = l.orderkey and c.custkey = o.custkey and c.mktsegment =’building’ and o.orderdate < date(‘1998-11-30’) and l.shipdate > date(‘1998-11-30’) group by l.orderkey, o.orderdate, o.shippriority order by rev desc, o.orderdate

Previous presentation: optimized plan outperformed the original plan by a factor of  2 Now: Further improvements due to reasoning about groupings and secondary orderings Example:TPC-D(now TPC-H) Query 3

Example:TPC-D(now TPC-H) Query 3 sort rev, o_orderdate NLJ R=> Oo_orderkeyG(U) Identitiy#5 => Oo_orderkeyG(T) Identitiy#4 =>Oo_custkeyG→Oo_orderkeyG(T) group by o_orderkey MJ Rule =>Oc_custkeyG→c_custkeyG→o_custkeyG→o_orderkeyG(T) and c_custkey = o_custkey => Oo_custkeyG→o_custkeyG→o_custkeyG→o_orderkeyG(T) nested-loops o_orderkey = l_orderkey sort o_orderkey Index scan lineitem merge-join c_custkey = o_custkey Identitiy#5 => Oc_custkeyG→o_orderkeyG(S) sort c_custkey sort o_custkey Oo_custkeyo(S)=> Oo_custkeyG(S) Oc_custkeyo(R)=> Oc_custkeyG(R) table scan customer table scan order

TPC-D (now TPC-H) Results Database: Customer table: 150,000 rows Supplier table: 10,000 rows Order table: 1,500,000 rows LineItem table: 6,000,000 rows PC: 1 GHz Pentium III Linux, with 512 MB RAM, 120 GB HDD Performance Results

Performance Results group c_custkey, count(*) sort c_custkey merge-join c_nationkey= s_nationkey sort c_nationkey sort s_nationkey table scan customer table scan supplier Experiment #1 our example N.B.: Merge join result isHUGE (60 Mio rows)

Performance Results sort rev, o_orderdate group by o_orderkey nested-loops o_orderkey = l_orderkey sort o_orderkey Index scan lineitem merge-join c_custkey = o_custkey sort c_custkey sort o_custkey table scan customer table scan order Experiment #2 TPC-H Query 3 Same value of o_orderkey were consecutive tuples thereby increased likelihood of finding joining tuples from lineitem in the cache

Performance Results sort rev, o_orderdate group by o_orderkey nested-loops o_orderkey = l_orderkey sort o_orderkey Table scan lineitem merge-join c_custkey = o_custkey sort c_custkey sort o_custkey table scan customer table scan order Experiment #2 TPC-H Query 3 With table scan on lineitem

How much do we pay for plan refinement? We pay most, when it actually pays off!(queries Q1, Q5, Q10: no refinement) Cost of additional optimization

Formal approach to order optimization that integrates both orderings and groupings within the same comprehensive framework Also considered secondary orderings and groupings By inferring secondary orderings and groupings, it is possible to avoid unnecessary sorting or grouping over multiple attributes Use secondary orderings known of an operator's input to infer primary orderings of its output Conclusion

Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan

Avoiding Sorting and Grouping In Processing Queries Sahak Maloyan

Presentation Transcript

Processing Data Intensive Queries in Scientific Database Federations

Processing and Optimization of Forecast Queries

Grouping and Segmentation

Segmentation and Grouping

Grouping and Segmentation

Segmentation and Grouping

Parallel Processing (CS 676) Lecture: Grouping Data and Communicators in MPI

Processing Location Based Queries in a Streaming Fashion

Cosequential Processing and the Sorting of Large Files

Grouping and Segmentation

Grouping Search-Engine Returned Citations for Person Name Queries

Grouping

Processing Spatio-Temporal Queries in a Streaming Fashion

Avoiding Idle Waiting in the execution of Continuous Queries

Academic performance grouping Random grouping Socio-metric grouping Psychological Grouping

Part 03 – Sorting, Paging, and Grouping

Processing Spatio-Temporal Queries in a Streaming Fashion

SQL queries ordering and grouping and joins

Avoiding duplicated queries in 11aq

Grouping Search-Engine Returned Citations for Person Name Queries

Query Processing 1: Joins and Sorting

Grouping Search-Engine Returned Citations for Person Name Queries