490 likes | 981 Views
Turning DBQL Data into Actionable Information. Steve Woodrow. Teradata America’s Professional Services Performance and Workload Management COE. Overview. Database Query Log (DBQL) Basics DBQL Recommendations DBQL and TD 12 Actionable DBQL Application Performance Workload Management
E N D
Turning DBQL Data into Actionable Information Steve Woodrow Teradata America’s Professional Services Performance and Workload Management COE
Overview • Database Query Log (DBQL) Basics • DBQL Recommendations • DBQL and TD 12 • Actionable DBQL • Application Performance • Workload Management • Bonus Queries
DBQL - Tables DBQLStepTblquery step timings, row counts, CPU and I/O, step actions, etc DBQLogTblFor individual queries:query origination start, stop and other timingserror codesSQL (truncated)CPU AND I/O values DBQLObjTblTracks usage of database objects(tables, columns, indexes, database,etc) DBQLSummaryTblSummary of (short-running) queries:high-volume queries:query origination response time, cpu and io sums,query counts DBQLSqlTblHolds full SQL text DBQLExplainTblHolds full EXPLAIN text
Detail Table - DBQLogTbl ProcID CollectTimeStamp QueryID UserID AcctString ExpandAcctString SessionID LogicalHostID RequestNum InternalRequestNum LogonDateTime AcctStringTime AcctStringHour AcctStringDate LogonSource AppID ClientID ClientAddr QueryBand ProfileID StartTime FirstStepTime FirstRespTime LastStateChange NumSteps NumStepswPar MaxStepsInPar NumResultRows TotalIOCount AMPCPUTime ParserCPUTime UtilityByteCount UtilityRowCount ErrorCode ErrorText WarningOnly DelayTime AbortFlag CacheFlag StatementType QueryText NumOfActiveAMPs MaxAMPCPUTime MaxCPUAmpNumber MinAmpCPUTime MaxAmpIO MaxIOAmpNumber MinAmpIO SpoolUsage WDID OpEnvID SysConID LSN INTEGER NoClassification WDOverride SLGMet ExceptionValue FinalWDID TDWMEstMaxRows TDWMEstLastRows TDWMEstTotalTime TDWMAllAmpFlag TDWMConfLevelUsed TDWMRuleID UserName DefaultDatabase AMPCPUTimeNorm ParserCPUTimeNorm MaxAMPCPUTimeNorm MaxCPUAmpNumberNorm MinAmpCPUTimeNorm EstResultRows EstProcTime EstMaxRowCount ExtraField1 ExtraField2 ExtraField3 ExtraField4 ExtraField5 ExtraField6 ExtraField7 Procid and Queryid: Ties all DBQL tables together
SQL Table - DBQLSqlTbl • One Row or more are written to DBC.DBQLSQLTBL for each SQL statement • Large statements can cause multiple rows to be written in order to log the full query text BEGIN QUERY LOGGING WITH SQL, OBJECTS, STEPINFO, EXPLAIN limit SQLTEXT=0 ON ALL ACCOUNT = ‘$H1$WG1R&S&D&H’; Select * From DBC.DBQLSQLTBL;
Object Table - DBQLObjTbl • DBQL logs one row in DBQLObjTbl for each physical data object used by the optimizer to satisfy the query • An object can be: BEGIN QUERY LOGGING WITH SQL, OBJECTS, STEPinfo, EXPLAIN limit SQLTEXT=0 ON ALL ACCOUNT = ‘$H1$WG1R&S&D&H’;
StepInfo Table - DBQLStepTbl • One row per step is written to DBC.DBQLStepTbl • Info: Step number, step name, start time, end time, row count • Diagnostic in nature, shouldn’t be left on BEGIN QUERY LOGGING WITH OBJECTS, SQL, StepInfo, Explain ON DBC; END QUERY LOGGING ON DBC; Select * from DBC.DBQLStepTbl; * * ExtraField1 = RowsWComprColumns Available in V2R6.1
Explain Table - DBQLExplainTbl • One row (or more) is written to DBC.DBQLExplainTbl • Large explains can cause multiple rows to be written in order to log the full explain text. • Diagnostic in nature, shouldn’t be left on BEGIN QUERY LOGGING WITH OBJECTS, SQL, StepInfo, Explain ON DBC; END QUERY LOGGING ON DBC; Select * from DBQLExplainTbl;
Logging Recommendations • ALL usage should be logged in DBQL 24 X 7 • Flush the DBQL Caches at default rate of 10 minutes • Recommend logging usage in DBQL at the AccountString level • This may mean some work needs to be done to the AccountStrings to ensure the appropriate level of logging is done for each type of usage • Our recommendation for AccountString setup:$H1$WXYZ&S&D&H Where… • $H1$ = The Performance Group • WXYZ = The Work Group Identifier • &S&D&H = ASE Variables – Session, Date, and Hour
Logging Recommendations • Three types of logging Classified by usage • Type 1 – Long running (Adhoc, Report Apps, Batch Loads) • Example: BEGIN QUERY LOGGING WITH SQL, OBJECTS LIMIT SQLTEXT=0 ON ALL ACCOUNT = ‘acctname’; • Type 2 - ONLY short running (subsecond) KNOWN work (TPumps, Apps w/10000s of Tactical Queries ) • Example: BEGIN QUERY LOGGING LIMIT SUMMARY=10,50,100 CPUTIME ON ALL ACCOUNT = ‘acctname’; (V2R6) • Type 3 - Mostly short (subsecond) known work with occasional long running unknown work (Tactical Apps with Some Larger or Unknown Queries) • Example: BEGIN QUERY LOGGING LIMIT THRESHOLD=100 CPUTIME AND SQLTEXT=10000 ON ALL ACCOUNT = ‘acctname’; (V2R6)
Recommendations for Historic DBQL Data • Recommendation is to remove the DBQL data from the DBC database nightly • If space is a concern consider changing retention • Recommendation for DDL on Historical DBQL * • Create new date column populated from Starttime on all tables • Put PPI on this date to provide better performing analysis • Join between DBQL tables on ProcId, QueryId and the Date • End logging on at least one user to dump the cache to ensure all DBQL history is captured • Begin logging on the user again • Retain 13 months Detail & Summary, 3 months SQL & Object, Step and Explain as needed • Keep history tables similar to DBC DBQL tables for other tools like Teradata Workload Analyzer * The standard maintenance process (for V2R6 and TD12) is now available on Teradata.com
What’s New In V12 • Column Name Changes in DBQLogTBL • Extrafield1 to InternalRequestNum • TotalCPUTime to AmpCPUTime (Linux is recorded in Milliseconds, Unix is recorded in hundredths) • Extrafield5 to StatementType • HotAmp1CPU to MaxAmpCPUTime • HotCPUAmpNumber to MaxCPUAmpNumber • LowAmp1CPU to MinAmpCPUTime • HotAmp1IO to MaxAmpIO • HotIOAmpNumber to MaxIOAmpNumber • LowAmp1IO to MinAmpIO • WDPeriodID to OPEnvID • FirstRespTime will not be null and LastRespTime column is removed. • New Columns in DBQLOGTBL • LogonSource • LastStateChange • ParserCPUTime
What’s New In V12 • More New Columns in DBQLogTBL • UtilityByteCount (not populated yet) • UtilityRowCount (only populated on the End Loading Statement entry) • SysConId • TDWMRuleID • AmpCPUTimeNorm • ParserCPUTimeNorm • MaxAmpCPUTimeNorm • MaxCPUAmpNumberNorm • MinAmpCPUTimeNorm • EstResultRows • EstProcTime • EstMaxRowCount
What’s New In V12 • Column Name Changes in DBQLStepTBL • HotAmp1CPU to MaxAmpCPUTime • HotCPUAmpNumber to MaxCPUAmpNumber • LowAmp1CPU to MinAmpCPUTime • HotAmp1IO to MaxdAmpIO • HotIOAmpNumber to MaxIOAmpNumber • LowAmp1IO to MinAmpIO • New Columns in DBQLStepTBL • EstCPUCost • RowCount2 • SpoolUsage • MaxAmpSpool • MaxSpoolAmpNumber • MinAmpSpool • StepWD • LSN
What’s New In V12 • New Columns in DBQLStepTBL • UtilityTableId • EstIOCost • EstNetCost • EstHrCost • CPUTimeNorm • MaxAmpCPUTimeNorm • MaxCPUAmpNumberNorm • MinAmpCPUTimeNorm • Columns Changes in DBQLSummaryTBL (Contd) • TotalCPUTime • AmpCPUTime • New Columns in DBQLSummaryTBL (Contd) • ParserCPUTime • AmpCPUTimeNorm • ParserCPUTimeNorm
Application Performance • Use DBQLogTBL to Identify Problem Queries • Fields to focus on • CPU • IO • Spool Usage (Combined with user spool info to proactively identify users close to spool limit) • ErrorCode • Response Time (Identify possible blocking or contention on resources) • Returned rows • Response Time Calculation: • EXTRACT (SECOND FROM lastresptime) + (EXTRACT (MINUTE FROM lastresptime) * 60 ) + (EXTRACT (HOUR FROMlastresptime) *60*60 ) + (86400 * (CAST ( lastresptime AS DATE) -CAST ( starttime AS DATE) ) ) - (EXTRACT (SECOND FROM starttime) + (EXTRACT (MINUTE FROM starttime) * 60 ) + (EXTRACT (HOUR FROM starttime) *60*60 ) )
Application Performance • Suspect Query Report • Ranked on Impact CPU “HotAMP1CPU * (HASHAMP()+1)” • Calculates CPU & IO skew, Product Join Indicator (PJI) and Unnecessary IO Indicator (UII) • Filters out queries that don’t have a certain amount of skew, PJI or UII • UII & PJI – minimum depends on hardware version and/or operating system • Large PJI indicates possible product join • Large UII indicates opportunity for reducing IO PJI Calculation:CASEWHEN totalcputime < 1 OR TotalIOCount = 0 THEN 0 ELSE (a.TotalCPUTime *1000)/a.TotalIOCount UII Calculation:CASEWHEN totalcputime < 1 OR TotalCPUTime = 0 THEN 0 ELSE a.TotalIOCount/(a.TotalCPUTime *1000)
Suspect Query Report by Process • Suspect query report can be modified for different purposes • Application • Time of day • Process
Suspect Query Rpt Post Fixes • Validate success • Iterative process
Application Performance • Use DBQLObjectTBL • Use object table before dropping or altering any DDL to identify use, users and/or times • Use to correlate suspect queries to tables involved • Object tables and index use report • Can be used after creating an index or as a recurring report • Save space and CPU used to maintain under utilized indexes and tables, resulting in reduced batch times • Exclude collect stats and maintenance requests on base table Notes : Prior to V2R6.1 – If the object is an index, its number is logged rather than a name. In V2R6.1 - There will be one row in the table for each column that makes up the index, and the ColumnNames will be logged (one row for each component column)
Evaluate Secondary Index Usage • Remove unused Indexes to Reduce Load Times and Save Space
Application Performance • Use DBQLStepTBL • Use to help diagnose problem queries • Each step contains start and finish times, resources used and skewing if present for each step • Look at differences between estimated row count and actual for each step • Can show blocking (not what was blocked) • All amp operation taking long time even though not using much resources may indicate AWT shortage • Query taking long in aggregation step may indicate aggregate join index will help • Use to identify plan changes • Can be done by # of steps or type of steps • Should be used in conjunction with DBQLExplainTBL
Application Performance • Use DBQLExplainTBL • Use to identify plan changes • Good for any changes to system (upgrades, DDL changes, etc…) • Can scan text for specific types of joins • Customer turns on merge joins, with explain turned on customer is able to identify queries that were effected • When used with DBQLStepTBL can provide more in depth analysis
Workload Management • Workload Management Example (Contd) • Previous graph shows large increase in number of queries running longer than one second at 3 AM • Additional analysis required • Schmon, Resusage or DBQL can identify CPU, IO or skewing issues • If collecting Amp Worker Task information concurrency issues could be determined • DBQL can show if non-tactical work is being done by the workload that is missing its SLAs
Workload Management • Workload Management Example (Contd) • Assume Previous example due to AWT shortage • Need to limit work, great opportunity to use DBQL concurrency report
Workload Management • Determine Success After Changes to Workload management • In TASM rule changes should be quantifiable and DBQL is just the data to do it • Graph below shows example of Workload before and after changes to system throttle
Workload Management • Use DBQL with DBC.TDWMexceptionLog • Identify queries that are exceptions and possible reasons • Identify time periods that might require Workload adjustments
CPU Histogram – Bucket Query • Shows number of queries, by account string, based on the amount of CPU used • 97.9% use < 1/2 CPU second • 99.6% use < 100 CPU seconds • 100% use < 200 CPU seconds • 10% of “Tactical” queries used > 1 CPU second • TAC used > 64% of CPU used • Useful to help set up TASM • Indicates if workloads are truly tactical • High priority for queries estimated to use < 1/2 second provides the majority of queries a better opportunity for a good response time • Medium priority < 100 CPU seconds • Queries > 300 seconds are • Candidates for being terminated as “bad queries” • Exceptions
Response Time Histogram • Average response time for each account string and/or workload • Broken down by the amount of CPU used • Classify by CPU buckets that are site appropriate • Queries that use < 1 CPU second have unacceptable response times • Evident in other areas also • Useful information to help set up and analyze TASM • Giving a very high priority for Queries estimated to use less than 1 second should provide the majority of queries a better opportunity for a good response time
Example Response Time Histogram • Red areas represent potential areas for response time improvement through: • Higher priority • Reclassification • Other action
DBC Suspect Query Macro REPLACE MACRO PMCPINFO.SuspQuerysDBC_Rpt ( topN INT) AS ( /* This Macro can be run anytime to list Poor Performing Querys */ /* That are in the DBC DBQL tables. It assumes that only the current */ /* day is in the DBC DBQL tables. The parameter requires the number */ /* of queries you want to list. If you put in 50, you will get the worst 50. */ SELECT RANK(ImpactCPU) AS CPURank ,UserName , AcctString, ExpandAcctString, appid, clientid, DefaultDatabase, SpoolUsage, StartTime, ParseBlockTime, QryRespTime, NumOfActiveAmps, NumResultRows, SumCPU, SumIO, CpuSkw, IOSkw, PJI, UII, ImpactCPU ,CASE WHEN s.sqltextinfo IS NOT NULL OR s.sqltextinfo = ' ' THEN s.SQLTextInfo ||';' ELSE a.querytext||';' END AS QryText FROM ( SELECT a.UserName, a.ProcId, a.QueryId, a.expandacctstring, appid, clientid, CAST(EXTRACT(HOUR FROM ((a.firststeptime - a.StartTime) HOUR(2) TO SECOND(2) ) ) * 3600 + EXTRACT(MINUTE FROM ((a.firststeptime - a.StartTime) HOUR(2) TO SECOND(2) ) ) * 60 + EXTRACT(SECOND FROM ((a.firststeptime - a.StartTime) HOUR(2) TO SECOND(2) ) ) AS INTEGER) AS ParseBlockTime ,((lastresptime-Starttime) DAY(4) TO SECOND) AS QryRespTime , NumofActiveAmps, NumResultRows , a.TotalCPUTime AS SumCPU , a.TotalIOCount AS SumIO, CASE WHEN totalcputime < 1 OR (TotalCPUTime / (HASHAMP()+1)) =0 THEN 0 ELSE HotAmp1CPU/(TotalCPUTime / (HASHAMP()+1)) END (DEC(8,2)) AS CPUSKW, CASE WHEN totalcputime < 1 OR (TotalIOCount / (HASHAMP()+1)) =0 THEN 0 ELSE HotAmp1IO/(TotalIOCount / (HASHAMP()+1)) END (DEC(8,2)) AS IOSKW, CASE WHEN totalcputime < 1 OR TotalIOCount = 0 THEN 0 ELSE (a.TotalCPUTime *1000)/a.TotalIOCount END AS PJI, CASE WHEN totalcputime < 1 OR TotalCPUTime = 0 THEN 0 ELSE a.TotalIOCount/(a.TotalCPUTime *1000) END AS UII, a.HotAMP1CPU * (HASHAMP()+1) /*no. of Amps*/ AS ImpactCPU, a.acctstring , a.Starttime , a.querytext, a.logdate, a.defaultdatabase, a.SpoolUsage FROM PMCPINFO.DBQLogTbl_DBC a WHERE ( CpuSkw > 1.25 /*2.0*/ OR IOSkw > 1.25 /*2.0*/ /* Start with 2.0 as a threshold for CoExistence systems */ OR UII > 6 /* UnNecessary IO Indicator - may offer an opportunity for Indexing */ OR pji > 6 /* Could use a lower threshold (3) for Older Node systems */ ) GROUP BY 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23 ) AS a LEFT OUTER JOIN (SELECT Procid, Queryid, SUBSTR(sqltextinfo,1,15000) AS sqltextinfo FROM PMCPINFO.DBQLSQLTbl_DBC s WHERE sqlrowno = 1) AS s ON s.ProcID = a.ProcID AND s.QueryID = a.QueryID QUALIFY RANK(ImpactCPU) <= :TOPN ORDER BY 1; );
TDWM Exceptions joined to DBQL SQL SELECT DL.logdate ,DL.starttime, DL.queryid, DL.Username, DL.expandacctstring, EX.WDName, EX.WDPeriod, EX.ClassificationTime, EX.ExceptionTime, EX.NewWDName, SumCPU, SumIO, CpuSkw, IOSkw, PJI, UII, ImpactCPU, trim(EX.ErrorText) as ExceptionText, Case When dl.sqltextinfo is not null or dl.sqltextinfo = ' ' Then dl.SQLTextInfo||';' Else dl.querytext||';' End as QryText FROM (select cast (a.starttime as date) as logdate, a.procid, a.queryid, a.Username, a.AcctString, a.expandacctstring, ((firstresptime-starttime) day(4) to second) as QryRespTime, NumofActiveAmps, NumResultRows, a.totalcputime as SumCPU, a.TotalIOCount as SumIO, Case when (totalcputime / (hashamp()+1) ) =0 then 0 else HotAmp1CPU /(TotalCPUTime / (hashamp()+1) ) End (dec(7,2)) as CpuSkw, Case when (TotalIOCount / (hashamp()+1) ) =0 then 0 else HotAmp1IO /(TotalIOCount / (hashamp()+1) ) End (dec(7,2)) as IOSkw, Case when a.TotalIOCount = 0 then 0 else (a.totalcputime *1000)/a.TotalIOCount end (dec(7,2)) as PJI, Case when a.totalcputime = 0 then 0 else a.TotalIOCount /(a.totalcputime *1000) end (dec(7,2)) as UII, a.HotAmp1CPU * (hashamp()+1) as ImpactCPU, a.acctstring, a.starttime ,a.querytext , s.sqltextinfo, a.defaultdatabase , a.SpoolUsage from dbc.dbqlogtbl a left outer join (Select Procid, Queryid,cast(collecttimestamp as date) as logdate, substr(sqltextinfo,1,15000) as sqltextinfo from dbc.DBQLSQLTbl s WHERE cast (collecttimestamp as date) between date and date and sqlrowno = 1) as s on s.ProcID = a.ProcID and s.QueryID = a.QueryID and s.logdate = cast(a.starttime as date)) DL, (Select cast(Bufferts as date) as Logdate, ProcID, QueryId, BufferTS, UserName, el.WDID, wl.WLCName as WDName, wp.Description as WDPeriod, el.WDPeriodID, ClassificationTime, ExceptionTime, wd.WLCName as NewWDName, NewWDID, ErrorText, ExtraInfo from dbc.TDWMExceptionLog el left outer join tdwm.wlcdefs wl on wdid=wl.wlcid left outer join tdwm.wlcdefs wd on NewWDID=wd.wlcid left outer join tdwm.WlcPeriods wp on WDPeriodID=wp.PeriodId and WDID=wp.WLCID where Cast(BufferTS as Date) between date and date and extract(hour from bufferts) > 07 and wd.removedate=0 and wl.removedate=0 and wp.removedate=0) ex where ex.Logdate=dl.Logdate and ex.Procid=DL.procid and ex.queryid=DL.queryid and qrytext not like '%COLLECT%' order by 1, 2,5;
Suspect Table Report LOCKING ROW FOR ACCESS SEL o.ObjectDatabaseName ,o.ObjectTableName ,SUM(NumOfQrys) AS TotNumOfQrys ,SUM(sumcpu) AS TotCPUConsumed ,SUM(CASE WHEN (cpuskw > 1.25 /*2.0*/ OR ioskw > 1.25 /*2.0*/ OR PJI > 6 OR UII >6 ) THEN sumcpu ELSE 0 END) AS TotSkdPJLSQrys ,SUM(CASE WHEN (cpuskw > 1.25 /*2.0*/ OR ioskw > 1.25 /*2.0*/ OR PJI > 6 OR UII >6 ) THEN sumcpu ELSE 0 END) AS TotSkdPJLSCPU ,SUM(CASE WHEN (cpuskw > 1.25 /*2.0*/ OR ioskw > 1.25 /*2.0*/ ) THEN sumcpu ELSE 0 END) (FLOAT) AS SkewedCPU ,SUM(CASE WHEN (cpuskw > 1.25 /*2.0*/ OR ioskw > 1.25 /*2.0*/ ) THEN 1 ELSE 0 END) (FLOAT) AS SkewedNumQrys ,SUM(CASE WHEN (PJI > 6) THEN sumcpu ELSE 0 END) (FLOAT) AS ProdJnCPU ,SUM(CASE WHEN (PJI > 6) THEN 1 ELSE 0 END) (FLOAT) AS ProdJnNumQrys ,SUM(CASE WHEN (UII > 6) THEN sumcpu ELSE 0 END) (FLOAT) AS LgScansCPU ,SUM(CASE WHEN (UII > 6) THEN 1 ELSE 0 END) (FLOAT) AS LgScansNumQrys FROM dba_tables.DBQLObjTbl_hst o , (SEL procid , queryid , logdate ,ZEROIFNULL( TotalCPUTime ) (FLOAT) AS sumcpu ,1 (FLOAT) AS NumOfQrys ,CASE WHEN totalcputime < 1 OR (TotalCPUTime / (HASHAMP()+1)) =0 THEN 0 ELSE HotAmp1CPU/(TotalCPUTime / (HASHAMP()+1)) END (DEC(8,2)) AS CPUSKW ,CASE WHEN totalcputime < 1 OR (TotalIOCount / (HASHAMP()+1)) =0 THEN 0 ELSE HotAmp1IO/(TotalIOCount / (HASHAMP()+1)) END (DEC(8,2)) AS IOSKW ,CASE WHEN totalcputime < 1 OR TotalIOCount = 0 THEN 0 ELSE (TotalCPUTime *1000)/TotalIOCount END AS PJI ,CASE WHEN totalcputime < 1 OR TotalCPUTime = 0 THEN 0 ELSE TotalIOCount/(TotalCPUTime *1000) END AS UII FROM dba_tables.DBQLogTbl_hst WHERE logdate BETWEEN '2007-10-22' AND '2007-10-28') AS a WHERE a.logdate BETWEEN '2007-10-22' AND '2007-10-28' AND o.objecttype = 'T' AND a.procid = o.procid AND a.queryid = o.queryid AND a.logdate = o.logdate GROUP BY 1,2 order by TotSkdPJLSCPU DESC