김종구 대리 기술지원부 마이크로소프트

SQL 서버 성능 문제 해결 및 Locking Internals 김종구대리 기술지원부 마이크로소프트

강사 소개 • 김종구/마이크로소프트 기술지원부 (2000 ~ 현재) • Infrastructure RRE • SQL Support Engineer • KT NeOSS, Auction, 삼성생명 등 주요 사이트 기술지원 • “SQL Memory Architecture” 등 TechNet 세미나 진행

목적 • 성능 관련 문제 발생 시 데이터 수집을 위한 PSSDIAG 툴 소개 • 성능 문제 해결을 위한 전반적인 방법론 소개 및 관련 지식 전달 • Locking Internal에 대한 지식 전달

대상 기술범위 • PSSDIAG Tool • Read80Trace Tool • SQL Server Performance Troubleshooting Methodology • Locking Internals

이 주제를 이해하는 데 필요한 지식 • Performance 문제 해결을 위한 기본 데이터에 대한 이해 - sysprocesses 의 waittype 및 lastwaittype - blocking Monitoring방법 - 성능 로그 사용법 및 주요 counter - Profiler 사용법 Level 200

목차 • PSSDiag • SQL Server Performance Troubleshooting • SQL Server Locking Internals and Troubleshooting

PSSDiag

What is PSSDiag? • Wrapper around data collection APIs commonly used in PSS, particularly SQL Server Support (Profiler, blocking script, Perfmon/Sysmon, SQLDIAG, event logs, etc) • Designed to provide double-click simplicity and reduce user error • Get all the needed data the first time, collected at the same time • See KB 830232 for information and download location

Components • GUI configuration utility (PSSDiagConfig.EXE) • Configure types of data to collect • Save configuration in XML document • Can also manage the collector service • Collector (PSSDiag.EXE) • Collector app, consumes configuration file created by GUI • Can run as a service or a console app

GUI configuration utility • Typical use: • Select the target version of SQL Server • Supply authentication mode and relevant info • Select and configure the diagnostics you want to collect • Click Start • The GUI configures and starts the collector service for you • Any output messages from the collector will be displayed in the GUI • Diagnostic files will be written to the output folder (.\OUTPUT by default)

Collector (PSSDiag.EXE) • Can run as a console app or as a service (GUI always runs it as a service) • Can compress files using NTFS or ZIP compression via /Cn parameter • Logging out while running as a console will stop data collection • Works fine from Terminal Services session • Set the output folder via /O parameter • To uninstall, run PSSDiag /U to uninstall the service (if you’ve installed it), then delete the files extracted from the archive

Collector (PSSDiag.EXE) • Optional command line params: • /Cn – (/C1 for NTFS background compression, /C2 for ZIP compression at shutdown) • /Nn – erase, overwrite, or rename output folder • /B YYYYMMDD_HH:MM:SS – Start time • /E YYYYMMDD_HH:MM:SS – Automatic shutdown time • /G – Generic mode. Disables SQL Server-centric mode to permit collection on machines without SQL Server installed • /R – register as a service • /U – deregister service

Collecting Data From a Clustered SQL Instance • Two options here: • Use the default machine name (“.”) when running on a cluster node, and PSSDiag will collect data from all SQL Server virtual servers on the cluster • Supply a virtual SQL Server name for the machine name (leave the instance name set to “*”) and PSSDiag will collect from that virtual server only

Collecting Data from a Remote Server • Supply the machine name when starting the GUI • Configuration is disabled when connecting to a remote machine • Output files will be written to remote machine • Output path must exist or be creatable • Run the collector as a console app if you wish to capture diagnostic files to the local machine: • Profiler trace is always a server-side collection • Output path must exist on local machine and the SQL Server • Never capture Profiler to a UNC or network drive • .TRC files will be copied to local OUTPUT dir on shutdown • Other data types (blocker, perfmon, etc) will be captured wherever the collector console app is running

Scheduling Collection Start/Stop • When running as a service: • Schedule an NT job to run PSSDiagControl START to start the service • Schedule an NT job to run PSSDiagControl STOP to stop it • When running as a console app: • Schedule the console app to start via the NT scheduler • Use /E or /B parameters if start/stop time and day (or relative time) is known • Schedule an NT job to create a file named PSSDiag.STOP in the output folder • Don’t KILL or you will leave Profiler trace running on SQL - see KB 283786 for manually stop and delete trace.

Collecting Data for Extended Periods • Run as a service so that you can log out of the console (the GUI needn’t keep running) • Can schedule start/stop times and delete/rename old output folder • Consider using /C1 to minimize space used by rolled over .TRC and .BLG files • The fact that .TRC must always be collected on the server makes remote collection not very effective for minimizing disk space use on the server

PSSDiag Impact on Server Performance • Impact of PSSDiag.EXE itself is negligible • Perf impact of collection equals sum of costs of diagnostics being collected • Generally dominated by cost of Profiler tracing (use the “Detailed Performance” trace template only when you actually need it) • Blocking detection script and Perfmon shouldn’t have a significant impact

Troubleshooting • Intent is to save time – don’t let it become an obstacle • Main collector log is ##PSSDiag.LOG • All console output also written to application event log • Some Perfmon counter errors are normal (e.g. “Could not add counter: XYZ. - The specified object is not found on the system.”) • All scripts (and profiler trace) are started via osql.exe. Script output files are .OUT files with names like “##server__Run_sp_trace.OUT”

Demo - Using PSSDIAG Tool

SQL Server Performance Troubleshooting

Agenda • Methodology • Resource bottlenecks • Determining your bottleneck • Which queries are responsible • Tuning the identified queries

Methodology • System performance is the result of aggregate performance of all queries • At a high level what type of bottleneck does system have • Find the queries using the most of that resource • Is resource being used efficiently • Always another bottleneck

Common DB Bottlenecks • Synchronization (Locks/Latches) • 224453 is good KB for this • CPU • Single query/single CPU • Single query/parallel • Aggregate query load over all CPUs • IO • Insufficient memory or poor access path? • Memory • SQL Server throttles the number of concurrently executing queries with sorts/hashes

Performance Monitor • Synchronization • Locks: Lock Waits/sec, Lock Wait Time (ms) • Latches: Total Latch Wait Time, Latch Waits/sec • CPU • Sustained rates at 75+ percent • Compiles/sec, Recompiles/sec

Performance Monitor II • IO • SQL Server View • Page Reads/sec • Readahead pages/sec • Checkpoint & Lazywrites/sec • fn_virtualfilestats • Operating System View • Avg Disk sec/Read or Write • Disk Queue Length is often NOT a good indicator • Memory • Memory Grants Pending • Max/Granted Workspace Memory

DBCC SQLPERF(WAITSTATS) • Number of waits & total wait time for each waittype • Example • PAGEIOLATCH_SH 64.0 7748.0 761.0 • PAGEIOLATCH_UP 24.0 2381.0 10.0 • PAGEIOLATCH_EX 21.0 2274.0 60.0 • KB 822101 for description of the various waittypes • Take delta between snapshots, or clear with DBCC SQLPERF(WAITSTATS, CLEAR)

SQL Server Trace (Profiler) • Use the sp_trace procedures instead of the GUI • Significantly less performance impact • Won’t “drop” events if rate is high • Write trace files to fast drive(s) • Configure what you want to trace in GUI and use File – Script Trace option • PSS prefers PSSDiag option for one step collection

Analyzing Trace Data • Use GUI option to sort by a column • Use fn_trace_gettable to load and query the data • Problems • Time consuming and generally requires you to have a specific problem in mind • Individual queries identified may not be relevant to the problem • Too manual—easy to miss things

Introducing Read80Trace • All text is “normalized” to remove comments, white space & parameters • Database only stores the text of the first “unique” entry • Detail data is loaded in normalized format to facilitate joins, reduce redundant data • Connections • Batches/UniqueBatches • Statements/UniqueStatements • Plans/UniquePlans • See KB 887057 for download location

Demo • Using Read80trace Tool • - Queries using the most CPU • - Query that changes execution plans • - Comparing “good” trace with “bad” trace

Looking at Specific Queries • Does performance change correlate with plan differences • Different execution plan • Different amount of work performed • Majority of bad plans caused by poor cardinality estimates • Use STATISTICS PROFILE to find the problematic part of plan

Using STATISTICS PROFILE • Everything except Rows/Executes is compile time information • Executes column reflects parallelism • For example, scan that executes 4 times • Compare Rows with (EstimateRows * EstimateExecutes) • Find most deeply nested operator where the error originates; it propagates up the tree from there

Demo • - Using STATISTICS PROFILE

Acceptable Cardinality Error • Reasonable margin of error depends on operator • Loop joins – within 2x range pretty reasonable • Merge join – 5x is reasonable • Hash joins – size of build input (first table below join) affects hash table memory size. Probe input (second table) doesn’t matter much • Sorts – size affects memory grant and 2x is reasonable • Differences in estimates may not be bugs • Should the optimizer do better given the available statistics

Cardinality Estimation • Histograms contain most useful information for predicates with literal • au_lname = ‘Smith’ • StartTime BETWEEN ‘2003-01-01’ AND ‘2003-12-31’ • Density information • ColA = ‘x’ and ColB = ‘y’ • Equijoins • Equality predicates with variables • Auto create statistics only creates single column statistics • Other estimates usually based on fixed selectivity estimates • Percentage based on the comparison operator • See Inside SQL Server 2000 for table of values

Auto Statistics • Samples a percentage of the data • Minimum sampling ~4MB of data • Maximum a function of rows in table • If you have issues with bad plans • Update statistics (sampled) – if this fixes it then histogram is probably out of date or auto update not triggered soon enough for your query • If fullscan required to fix a problem use DBCC SHOW_STATISTICS to see how much the density values differ

Limitations to Consider • T-SQL variable (as opposed to parameter) • Value not known at compile time so can’t use histogram • Builtin functions • No statistics available • Multi-statement table-valued functions • No statistics available • Table variables • No statistics available • Temp table & recompile uses statistics

SQL Server Locking Internals and Troubleshooting

Introduction • UMS Scheduling and Workers • What is a SQL Resource • How SQL Server really waits on a ‘resource’ • Blocking • Crabbing • Fetch Rates • Physical vs Logical Protection • Scans and Lock Classes • Lock Escalation

UMS Scheduling • User Mode Scheduling • Precise Resource Usage • Preemptive vs Non-Preemptive

SQL Server Workers • What is a worker? • Worker Pool • Request bound to Worker for (Life Time) - Example • Target Setting - Example • Not Dynamic • Division of Workers

Connection Bound To Scheduler • Assignment: Scheduler with fewest users • Life time: Bound for connection life time • Workload Matters

A SQL Resource • Reader / Writer • Waiters list • FIFO Maintained

Blocked / Blocking • Blocked • Worker tied up in block scenario • Attentions / Query Timeout (Blocking) • Diagnostics • sysprocesses • syslockinfo • dbcc opentran - Example • Profiler and SQLTrace • xact_abort • Performance Monitor • dbcc sqlperf(waitstats) • dbcc pss – Example • PSSDiag

Resource Crabbing • Maintains Data Stability • Acquire Next • Release Previous

Client Fetch Rate Matters • Sending Results • How Crabbing Applies • Mobile Links • Lock Scope • Never Perfect • Preemptive Network Writes

Batch Size Can Matter • Touching Several Objects • External Logic • XProcs • COM Objects • The Transaction Log • Sector Alignment • Flush to LSN

DTC Locking • All Under One Roof • SQL 7.0 Behavior • SQL 2000 Behavior • Deadlocking • SPID = -1 or SPID = -2

Physical versus Logical Protection • Physical = Latch • Reading Page Into Memory • Writing A Page To Disk • Inserting A Row • Internal Data Structures • Logical = Lock • Inserting / Modifying A Row

김종구 대리 기술지원부 마이크로소프트

김종구 대리 기술지원부 마이크로소프트

Presentation Transcript