370 likes | 455 Views
Rewriting Procedures for Batched Bindings. Ravindra Guravannavar and S. Sudarshan Indian Institute of Technology, Bombay. Appeared in VLDB 2008. Motivation. Queries/updates are often invoked repeatedly. E.g.:. Nested subqueries (with correlated evaluation)
E N D
Rewriting Procedures for Batched Bindings Ravindra Guravannavar and S. Sudarshan Indian Institute of Technology, Bombay Appeared in VLDB 2008
Motivation Queries/updates are often invoked repeatedly. E.g.: • Nested subqueries (with correlated evaluation) • Queries embedded in user-defined functions (UDFs) that are invoked from other queries • Queries/updates in stored procedures that are invoked repeatedly (e.g. batch jobs) • Queries/updates invoked inside loops in application programs
Example: A Nested Query List the orders of high worth (> 10K) SELECT O.orderid, O.custid FROM orders O WHERE 10000 < (SELECT sum(lineprice) FROM lineitem L WHERE L.orderid=O.orderid); IterativeExecution: For each record in the result of the outer query block - bind the parameters for the nested sub-query - evaluate the nested sub-query - process the results (check predicate, output)
Optimizing Repeated Invocations • Iterative execution of queries often performs poorly • Redundant I/O • Random I/O and poor buffer effects • Network round-trip delays • Decorrelation is widely used for optimizing nested queries
Query Decorrelation Original Query SELECT O.orderid, O.custid FROM orders O WHERE 10000 < (SELECT sum(lineprice) FROM lineitem L WHERE L.orderid=O.orderid); After Decorrelation(Most systems do this automatically) SELECT O.orderid, O.custid FROM orders O, lineitem L WHERE O.orderid=L.orderid GROUPBY O.orderid, O.custid HAVING sum(L.lineprice) > 10000;
Limitations of Decorrelation Tricky at times… • COUNT aggregate • Non-equality correlation predicates The solutions may not produce the best plan Decorrelation techniques are not applicable for: • Nested invocation of procedures with complex logic and embedded query invocations • Iterative invocation of queries/updates from application code
Example: UDF Invoked from a Query SELECT * FROM category WHEREcount_items(category-id) > 50; // Count the items in a given category and its sub-categories int count_items(int categoryId) { … while(…) { … … SELECT count(item-id) INTO icount FROM item WHERE category-id = :curcat; … } } Procedural logic with embedded queries
Key Idea: Parameter Batching • Repeated invocation of an operation is replaced by a single invocation of its batched form • Batched form: Works on a set of parameters • Benefits • Choice of efficient set-oriented plans • Repeated selection → Join • Efficient integrity checks • Efficient disk access (sort RIDs before fetch) • Reduced network round-trip delay
Batched Forms of Basic Operations • Insert • insert into <table1> select … from <table2> … • Bulk load (SQLServer: bcp, Oracle: sqlldr, DB2: load) • Update • update <table1> from <table2> where … (equivalent to SQL:2003 merge statement) • Queries • Make use of join or outer-join (seen in decorrelation)
SQL Merge GRANTMASTER GRANTLOAD merge into GRANTMASTER GM using GRANTLOAD GL on GM.empid=GL.empidwhen matched then update set GM.grants=GL.grants; Notation:Mc1=c1’,c2=c2’,… cn=cn’(r, s)
Effect of Batch Size on Inserts Bulk Load: 1.3 min
Iterative and Set-Oriented Updates on the Server Side • TPC-H PARTSUPP (800,000 records), Clustering index on (partkey, suppkey) • Iterative update of all the records using T-SQL script (each update has an index lookup plan) • Single commit at the end of all updates Takes 1 minute Same update processed as a merge (update … from …) Takes 15 seconds
The Challenge Given a procedure, how to obtain its batched form? Possible to manually rewrite, but time consuming and error-prone.
Our Work • Automatic generation of batched forms of UDFs/stored procedures using rewrite rules • Automatic rewrite of programs to replace looping with batched invocation
Batched Forms • Batched form qb of a pure function q • Returns results for a set of parameters • Result in the form {(parameter, result)} • For queries: standard techniques for creating batched forms (from work on decorrelation) Example: Original query: SELECT item-id FROM item WHERE category-id=? Batched form: SELECT pb.category-id, item-id FROM param-batch pb LEFT OUTER JOIN item ON pb.category-id = item.category-id;
Batch Safe Operations • Batched forms – no guaranteed order of parameter processing • Can be a problem for operations having side-effects Batch-Safe operations • All operations without side effects • Also a few operations with side effects • E.g.: INSERT on a table with no constraints • Operations inside unordered loops (e.g., cursor loops with no order-by)
Generating Batched Forms of Procedures Step 1: Create trivial batched form. Transform: procedure p(r) { body of p} To procedure p_batched(pb) { for each record r in pb { < body of p> < collect the return value > } return the collected results paired with corrsp. params; } Step 2: Optimize query invocations in the trivial batched form
Rule 1A: Rewriting a Simple Set Iteration Loop where q is any batch-safe operation with qb as its batched form Rule 1B Handles return values
Condition for Invocation Return values Operation to Batch // Batched invocation Let s = // Merge the results s where Rule 1C: Batching Conditional Statements
Table(T) t; while(p) { ss1 modified to save local variables as a tuple in t } Collect the parameters for each r in t { sq modified to use attributes of r; } Can apply Rule 1A-1C and batch. for each r in t { ss2 modified to use attributes of r; } Process the results Rule 2: Splitting a Loop while (p) { ss1; sq; ss2; } * Conditions Apply
Rule 2: Pre-conditions • The conditions make use of the data dependence graph • Data Dependence Graph • Nodes: program statements • Edges: dependencies between statements that read/write same location • Types of Dependencies • Flow (Write Read), Anti (Read Write) and Output (Write Write) • Loop-carried flow/anti/output • Dependencies across iterations • Pre-conditions for Rule-2 • No loop-carried flow/output dependencies cross the points at which the loop is split • No loop-carried dependencies through external data (e.g., DB)
Data Dependencies Flow Dependence Anti Dependence Output Dependence Control Dependence Loop-Carried Need for Reordering Statements (s1) while (category != null) { (s2) item-count = q1(category); (s3) sum = sum + item-count; (s4) category = getParent(category); }
Reordering Statements to Enable Rule 2 while (category != null) { int item-count = q1(category); // Query to batch sum = sum + item-count; category = getParent(category); } Splitting made possible after reordering while (category != null) { int temp = category; category = getParent(category); int item-count = q1(temp); sum = sum + item-count; }
Rule 4: Control Dependencies while (…) { item = …; qty = …; brcode = …; if (brcode == 58) { brcode = 1; q(item, qty, brcode); } } Remember the branching decision in a boolean variable while (…) { item = …; qty = …; brcode = …; boolean cv = (brcode == 58); cv? brcode = 1; cv? q(item, qty, brcode); }
Cascading of Rules After applying Rule 2 Table(…) t; while (…) { r.item = …; r.qty = …; r.brcode = …; r.cv = (r.brcode == 58); r.cv? r.brcode = 1; t.addRecord(r); } for each r in t { r.cv? q(r.item, r.qty, r.brcode); } Rule 1C qb(Pitem,qty,brcode(scv=true(t))
Batching Across Multiple Levels while(…) { …. while(…) { ... q(v1, v2, … vn); … } … } while(…) { …. Table t (…); while(…) { ... } qb(t); … } Batch q w.r.t inner loop
Nest and Unnest Operations • μc(T) : Unnest T w.r.t. table-valued column c • vS s(T) : Group T on columns other than S and nest the columns in S under the name s
Implementation and Evaluation • Conceptually the techniques can be used with any language (PL/SQL, Java, C#-LINQ) • We implemented for Java using the SOOT framework for program analysis Evaluation • No benchmarks for procedural SQL • Scenarios from three real-world applications, which faced performance problems • Data Sets: TPC-H and synthetic
Application 1: ESOP Management App Process records from a file in custom format. Repeatedly called a stored procedure with mix of queries, updates and business logic. - Validate inputs - Lookup existing record - Update or Insert Rewritten program used outer-join and merge.
Application 2: Category Traversal Find the maximum size of any part in a given category and its sub-categories. Clustered Index CATEGORY (category-id) Secondary Index PART (category-id) Original Program Repeatedly executed a query that performed selection followed by grouping. Rewritten Program Group-By followed by Join
Application 3: Value Range Expansion Expand records of the form: (start-num, end-num, issued-to, …) Performed repeated inserts. Rewritten program Pulled the insert stmt out of the loop and replaced it with batched insert. ~75% improvement ~10% overhead Log scale
Related Work • Query unnesting • E.g. Kim [TODS82], Dayal [VLDB87], Seshadri et al. [ICDE96], Galindo Legaria et al. [SIGMOD01] • We extend the benefits of unnesting to procedural nested blocks • Graefe [BTW03] highlights the importance of batching in nested iteration plans (a motivation for our work) • Optimizing set iteration loops in database programming languages - Lieuwen and DeWitt [SIGMOD 92] • Also perform program rewriting, but • Do not address batching of queries/procedure calls within the loop • Limited language constructs - No WHILE loops, IF-THEN-ELSE • Parallelizing compilers Kennedy[90], Padua[95] • We borrow and extend the techniques
Conclusion • Automatic rewrite of programs for set-orientation is possible • Combining query rewrite with program analysis is the key • Our experiments on real-world scenarios show significant benefits due to batching Future Work • Cost-based selection of operations to batch • Handling exceptions • Automatically deciding whether an operation is batch-safe • Implementing rewriting for PL/SQL