Parallel Execution Plans

Parallel Execution Plans Joe Chang jchang6@yahoo.com www.sql-server-performance.com/joe_chang.asp

Parallel Execution Plans • Allows single query to use multiple processors • Query should run faster but may consume more resources • Example • 1 thread: 10 sec run time, 10 CPU-sec • 2 threads: 6 sec run time, 12 CPU-sec

Parallel Execution Configuration • Cost Threshold For Parallelism • Minimum query plan threshold for considering queries for parallel execution • Default 5: Considering increasing to 20-50 for new systems • Max Degree of Parallelism • Default 0: Can use all available processors • SQL Server determines level based on available memory and recent CPU usage

Parallel Plan Operators The Distribute Streams operator consumes a single input stream of records and produces multiple output streams. The record contents and format are not changed. Each record from the input stream appears in one of the output streams. This operator automatically preserves the relative order of the input records in the output streams. Usually, hashing is used to decide to which output stream a particular input record belongs. The Repartition Streams operator consumes multiple streams and produces multiple streams of records. The record contents and format are not changed. Each record from an input stream is placed into one output stream. If this operator is order-preserving, then all input streams must be ordered and merged into several ordered output streams. The Gather Streams operator consumes several input streams and produces a single output stream of records by combining the input streams. The record contents and format are not changed. If this operator is order-preserving, then all input streams must be ordered.

Execution Plan Cost Formulas Index Seek – Plan Formula I/O Cost = 0.006328500 + 0.000740741 per additional page (≤1GB) = 0.003203425 + 0.000740741 per additional page (>1GB) CPU Cost = 0.000079600 + 0.000001100 per additional row Bookmark Lookup – May have changed ? I/O Cost = multiple of 0.006250000 (≤1GB) = multiple of 0.003124925 (>1GB) CPU Cost = 0.0000011 per row Table Scan or Index Scan I/O: 0.0375785 + 0.0007407 per page CPU: 0.0000785 + 0.0000011 per row Table Scan or Index Scan IUD I/O Cost ~ 0.01002 – 0.01010 (>100 rows) IUD CPU Cost = 0.000001 per row

Cost Interpretation • Time in seconds? CPU time? Too fast for 7200RPM disk random I/Os. 0.0062500sec -> 160/sec 0.000740741 ->1350/sec (8KB) ->169/sec(64K)-> 10.8MB/sec About right for 1997 sequential disk transfer rate? S2K BOL: Administering SQL Server, Managing Servers, Setting Configuration Options: cost threshold for parallelism Opt Query cost refers to the estimated elapsed time, in seconds, required to execute a query on a specific hardware configuration.

Test Table CREATE TABLE M3A_20 ( GroupID int NOT NULL, ID int NOT NULL, ID2 int NOT NULL, ID3 int NOT NULL, ID4 int NOT NULL, sID smallint NOT NULL, bID1 bigint NOT NULL, bID2 bigint NOT NULL, bID3 bigint NOT NULL, rMoney money NOT NULL, rDate datetime NOT NULL, rReal real NOT NULL, rDecimal decimal (9,4) NOT NULL, CONSTRAINT [PK_M3A_20] PRIMARY KEY CLUSTERED ( [GroupID], [ID] ) WITH FILLFACTOR = 100 ) GO

Data Population Script 1 SET NOCOUNT ON DECLARE @BatchTotal int, @BatchSize int, @TotalRows int, @BatchStart int, @BatchEnd int, @BatchRow int, @I int, @RowsPerPage bigint , @Card int , @DistinctValues int SELECT @BatchStart=1, @BatchEnd=1000, @BatchTotal=1000, @BatchSize=100000, @RowsPerPage=100, @Card=100000 SELECT @TotalRows=@BatchTotal*@BatchSize SELECT @I=(@BatchStart-1)*@BatchSize+1, @DistinctValues=@TotalRows/@Card WHILE @BatchStart <= @BatchEnd BEGIN BEGIN TRANSACTION SELECT @BatchRow = @BatchStart*@BatchSize WHILE @I <= @BatchRow BEGIN INSERT M3A_20 (GroupID, ID, ID2, ID3, ID4, sID, bID1, bID2, bID3, rMoney, rDate, rReal, rDecimal) VALUES ( 1, @I, @TotalRows-@I+1, (@I-1)/@Card+1, (@TotalRows-@I)%@Card+1, @I%32768, @I, (@I-1)%@Card+1, 1+(@I-1)*@RowsPerPage/@TotalRows+((@I-1)*@RowsPerPage)%@TotalRows, 10000*rand(), DATEADD(hour,@I%3000000,'1900-01-01'), 10000*rand(), 10000*rand() ) IF @@ERROR > 0 BEGIN GOTO B END SET @I = @I+1 END COMMIT TRANSACTION CHECKPOINT PRINT CONVERT(varchar,GETDATE(),121) + ', row ' + CONVERT(varchar,@BatchRow) SET @BatchStart = @BatchStart+1 END B: IF @@TRANCOUNT > 0 COMMIT TRANSACTION PRINT '01 Complete ' + CONVERT(varchar,GETDATE(),121) + ', row ' + CONVERT(varchar,@BatchRow) + ', Trancount ' + CONVERT(varchar(10),@@TRANCOUNT)

Data Population Script 1 Notes • Double While Loop • Each Insert/Update/Delete statement is an implicit transaction • Gets separate transaction log entry • Explicit transaction – generates a single transaction log write (max 64KB per IO) • Single TRAN for entire loop requires excessively large log file • Inserts are grouped into intermediate size batches

Data Population Scripts 2 Primary table populated using single row inserts in a WHILE loop, Additional tables populated with INSERT / SELECT statement Single row inserts ~20-30K rows/sec INSERT / SELECT statement ~100K+ rows/sec DECLARE @L int SELECT @L = 1 WHILE @L <= 3 BEGIN INSERT M3A_11 (GroupID,ID,ID2,ID3,ID4,sID,bID1,bID2,bID3,rMoney,rDate,rReal, rDecimal) SELECT TOP 500000 GroupID, ID, 1500001-ID, ID3, ID4, sID, bID1, bID2, bID3, rMoney, rDate, rReal, rDecimal FROM M3A_20 WHERE GroupID = 1 AND ID BETWEEN (@L-1)*500000+1 AND @L*500000 SELECT @L = @L + 1 CHECKPOINT PRINT '11 Step ' + CONVERT(varchar,@L) + ', ' + CONVERT(varchar,GETDATE(),121) END UPDATE STATISTICS M3A_01 (PK_M3A_01) WITH FULLSCAN CREATE STATISTICS ST_01 ON M3A_01 (ID) WITH FULLSCAN, NORECOMPUTE

Index Seek Plans Many rows returned, Non-parallel plan Cost: 9.34 Parallel Execution disabled Cost: 9.82 Cost: 4.94 Parallel Plan

Index Seek Details Non-parallel plan Parallel plan

Index Seek – Non-parallel Cost assigned to SELECT Index Seek, 1M rows in 11,115 pages (81 bytes/row, 90% Fill) I/O cost is: 8.2365 CPU Cost is 1.1000785 Cost & sub-tree Cost is correct, I/O & CPU is ½ of correct value

Index Seek – Parallel Plan No cost assigned to SELECT Index Seek I/O and CPU cost ½ of non-parallel plan

Index Seek with Aggregate 4 3 2 1

Index Seek Aggregate Parallel Plan Details 4 2 3 1

Table Scan Cost: 9.01 Cost: 8.26

Table Scan Details Non-parallel plan Parallel plan I/O cost same CPU cost ½ of non parallel plan

Table Scan Details Non-parallel plan Parallel plan No cost No cost on Select I/O cost same CPU cost ½ of non parallel plan

Parallel Plan Cost Formulas Patterns • CPU costs are ½ of non-parallel plan • Index Seek I/O cost are also ½ • Scan I/O cost is same as non-parallel plan • Parallel plan costs are based on 2 processors • Actual number of processors determined at runtime • Overhead operations • Distribute, Repartition & Gather Streams

Hash Join 200,000 rows 15 byte OS row size Cost: 6.50 Cost: 4.79

Hash Join Details Non-parallel plan Parallel plan

Hash Join – Non-parallel plan

Hash Join – Parallel Plan 4 3 2 1 2 4 3 1

Hash Join with I/O Cost 900,000 rows MAXDOP 1 Cost 85.1 Cost 74.1

Hash Join – Join I/O Cost 730,000 rows 740,000 rows

Hash Join - Bitmap

Hash Join Cost Formula Index Seek – Plan Formula I/O Cost = 0.006328500 + 0.000740741 per additional page (≤1GB) = 0.003203425 + 0.000740741 per additional page (>1GB) CPU Cost = 0.000079600 + 0.000001100 per additional row Hash Join CPU Cost = 0.017750000 base + 0.0000001749 (2-30 rows) + 0.0000000720 (100 rows) 0.000015091 per row 0.000015857 (parallel) + 0.000001880 per row per 4 bytes in OS + 0.000005320 per additional row in IS I/O Cost = 0.000042100 per row over 64MB (Row Size+8) 0.0000036609 per 4 byte over 15B

Parallel Cost Formula Base Cost 0.028500 Repartition Stream Cost per row = 0.0000024705 Base (15 Bytes) + 0.000000759 per additional 4 Bytes Gather Stream Cost per row = 0.0000018735 Base(15) + 0.000000759 per additional 4 Bytes Dispatch

Loop Join

Loop Join Details Non-parallel plan Outer Source Parallel plan Outer Source

Loop Join Details Inner Source cost identical for both non-parallel and parallel plans

Loop Join Details Non-parallel plan Parallel plan

Merge Join

Merge Join Details Non-parallel plan Parallel plan

Index Seek + Aggregate Test Xeon 2.4GHz/512K Opteron2.2GHz 1M

Index Seek + Aggregate Test, Itanium 2 Itanium 2 1.5GHz/6M

Index Seek + Aggregate Test, SUM(INT) Itanium 2 1.5GHz/6M

Index Seek + Aggregate Test, NULL Itanium 2 1.5GHz/6M

Loop Join, COUNT(*) Itanium 2 1.5GHz/6M

Hash Join, COUNT(*) Itanium 2 1.5GHz/6M

Merge Join, COUNT(*) Itanium 2 1.5GHz/6M

General Recommendations • Useful in DW, ETL, and maintenance activities • Use judgment on transactions processing • Is throughput more important • Or faster expensive queries • Increase Cost Threshold from 5 to 20-50 • Limit MAXDOP to 4 • Verify or limit parallelism on Xeon systems with Hyper-Threading enabled

Additional Information www.sql-server-performance.com/joe_chang.asp SQL Server Quantitative Performance Analysis Server System Architecture Processor Performance Direct Connect Gigabit Networking Parallel Execution Plans Large Data Operations Transferring Statistics SQL Server Backup Performance with Imceda LiteSpeed jchang6@yahoo.com

Parallel Execution Plans

Parallel Execution Plans

Presentation Transcript

Parallel Execution Plans

Best practices on managing parallel execution in concurrent environments

Understanding SQL Server Query Execution Plans

Parallel Execution Models for Future Multicore Architectures

Execution plans

Identifying and Fixing Performance Problems Using Execution Plans

GPU Parallel Execution Model / Architecture

Understanding Graphical Execution Plans Level 200

Optimus: A Dynamic Rewriting Framework for Data-Parallel Execution Plans

Executing Explain Plans and Explaining Execution Plans

Customizable Parallel Execution of Scientific Stream Queries

Execution plans and tools

Panel: Development and Execution of Recovery Plans

Under the hood: Query Optimization, Query Execution plans

Rethinking Parallel Execution

GPU Parallel Execution Model / Architecture

Understanding SQL Server Query Execution Plans