590 likes | 748 Views
Analytic Modeling Techniques for Predicting Batch Window Elapsed Time. Debbie Sheetz Sr. Staff Consultant BMC Software, Waltham, MA USA. Presentation Overview. Motivation for Batch Window/Elapsed Time Analysis History of Batch/Elapsed Time vs. Interactive/Response Time Computing
E N D
Analytic Modeling Techniques for Predicting Batch Window Elapsed Time Debbie Sheetz Sr. Staff Consultant BMC Software, Waltham, MA USA
Presentation Overview • Motivation for Batch Window/Elapsed Time Analysis • History of Batch/Elapsed Time vs. Interactive/Response Time • Computing • Analysis techniques • Methodology • Time Period Selection • Resource Performance Objectives • Characterizing Work • Elapsed Time Analysis Techniques • Four Case Studies • Daily Batch Job • Monthly Batch Processing • Batch Processing Sequence • Batch Processing with a Large Number of Jobs
History of Batch Window/Elapsed Time Analysis • Development of Capacity Planning tools was motivated by the advent of interactive computing • User at a terminal cared about transaction response time • Batch was run overnight and was considered “internal” • The first analytic models had three types of processing representations available: 2 interactive and 1 batch • Interactive Time Sharing (TS) • Interactive Transaction Processing (TP) • Batch Processing (BP) • Usually analysis and modeling focused on the peak of the interactive workload(s), and batch was represented as a low-priority stream of work that utilized the resources not required for the interactive processing • The overnight period was then used exclusively for “production” batch and the “interactive” day would begin again the next morning • Throughout, there continued to be particular batch work whose completion was critical to the business and thus had a required time for completion • Examples are reconciliation of financial accounts, database updates, or billing cycles
Characteristics of Critical Batch Work • Interactive vs. critical batch work • Key similarity is both have a performance objective • “under .1 second response time” vs. “finish by 8 AM” • Key difference is • interactive work is typically evaluated one transaction at a time vs. critical batch is typically a very large number of transactions which are submitted all at once, and the completion of the last request is the only timing that is of interest. • Critical batch always has explicit timing and frequency • Examples: A backup is scheduled to be performed daily at midnight or a billing cycle is executed monthly/weekly • Often both the start and completion times are subject to business constraints • For example, account reconciliation at a financial institution cannot start until the “interactive” day has completed, and the reconciliation must be completed before the next “interactive” day begins
Characteristics of Critical Batch work • The scheduling is often explicit about what state the server(s) will be in, i.e. not running any other work • Everything else needs to be “finished” • Open files cannot be backed up or nightly account reconciliation cannot occur if customers are still entering additional transactions • Maximum available computing resources favors the best possible performance! • For large examples of critical batch work, one (or more) large servers may be required, and these servers are likely dedicated to this processing • Scheduling is usually dictated by business requirements rather than by server availability • Smaller examples like a PC backup or virus scan are intentionally scheduled outside of “interactive” window, e.g. 3 AM, so the work is done before interactive use is resumed • The amount of critical batch work to be performed and the time available for processing often conflict with each other • Capacity planning analysis is almost always required • Critical batch exists on all platforms – it is the oldest use of computing systems and continues in many forms today • On a PC there are overnight backups and virus scans • On large mainframes/distributed systems there are overnight database updates
Batch Window/Elapsed Time Analysis Methodology – Time Period Selection • One of the most striking differences between batch and interactive work is the length of time selected for analysis • Interactive work usually peaks 1-2 times per day, so one hour (or 15-minute or 5-minute) peak(s) are selected • The amount of time selected depends only on how SLAs are defined • If the business will spend money to guarantee performance as measured for an hour, than a one hour peak should be selected; if the business will spend money to guarantee performance for the highest activity for five minutes, then a five minute peak should be selected. • Peaks may vary by day of week, time of month, or time of year, and that determines which day(s) to select the peak periods from • Critical Batch runs until it is done, e.g. 1 hour, 8 hours, 3 days, so the selected time period must match the specific characteristics of the batch work being studied • The frequency of the batch work, e.g. daily, weekly, monthly, determines which particular day(s) will be chosen for study
Batch Window/Elapsed Time Analysis Methodology – Time Period Selection • What is common to all performance analysis is that historical data is of the utmost importance as it enables analysis of • Performance results trends, e.g. that the batch run time is lengthening • Workload trends, e.g. the amount of work load is increasing • Objectively it doesn’t matter what the trend is or isn’t – what matters is that the analyst can confirm (1) what has been stated as the current performance results and (2) changes (if any) in the workload • Too much of what passes for performance analysis is simply uninformed opinion instead of methodically interpreted performance measurements • Historical measurement data allows the selection of multiple “baseline” study periods • No performance analysis should be based on a single time period, or at least such results should be considered tentative at best!
Batch Window/Elapsed Time Analysis Methodology – Resource Performance Objectives • A significant difference between batch and interactive work analysis is the interpretation of server resource utilization • For interactive work, capacity planning seeks to keep the resource utilization and resulting queueing below the level at which response time becomes unacceptable, e.g. CPU queue length less than 2 per processor, 80% CPU utilization, 40% disk utilization, etc. • Each interactive transaction contributes a tiny amount to overall utilization – hundreds of simultaneous users cause the aggregate high utilizations • Simultaneity isn’t coordinated – no matter what distribution of arrivals you like (e.g. random, hyper-exponential, etc.), the loading isn’t evenly spaced over time • Batch analysis is totally opposite: maximum throughput/minimum elapsed time occurs when one (or more) resources are at 100% utilization • The key insight is that there is usually no queueing at server resources because the queueing is occurring “inside” the batch processing, i.e. the next transaction doesn’t “’start” until the last one is done. So expect 100% utilization with a 0 queue – that means that the processing is making optimal use of resources. • If more than one resource is measured at 100% utilization, even better! • If no server resource (CPU or disk) is at 100%, that proves that the primary performance constraint is either the application design or software implementation
Batch Window/Elapsed Time Analysis Methodology – Characterizing Work • The biggest difference between representing interactive and batch work is the source: multiple users issuing uncoordinated requests vs. a stream of requests occurring precisely one after the other • It is essential to represent the difference between multiple simultaneous requests vs. a single (or multiple) stream(s) of requests • The resource demands are not represented differently, i.e. a “transaction” requires a certain number of milliseconds of CPU, a certain amount of I/O, etc. • How you choose to define a transaction can be different, but after that choice has been made, the analysis/modeling results have the same form • A difference is that interactive work is usually characterized by aggregating the activities of many users vs. batch work often focuses on a very small number of processes, sometimes only one. • Even when there are multiple batch processes, they are often represented separately rather than being aggregated
Batch Window/Elapsed Time Analysis Methodology – Elapsed Time Analysis Techniques • Prediction techniques • Interactive performance is commonly performed using analytic modeling • Batch performance analysis involves a bit of arithmetic, frequently supplemented by analytic modeling • Building a suitable analytic model depends on having adequate measurement data available, i.e. both system resource measurements as well as detailed process data representing the application components. • Elapsed time analysis typically indicates selecting all of the elapsed time • or all of the timefor one phase of the batch processing • Process data is used to isolate the process(es) of interest into model-ready “workloads/transactions”
Batch Window/Elapsed Time Analysis Methodology – Elapsed Time Analysis Techniques • How to Apply Business Volume to the Analysis • A standard analytic model is supplemented with two pieces of additional information: • total job/process elapsed time (referred to as BATCH-RESPONSE) • total number of business requests processed during one execution (referred to as BATCH-TRANSACTIONS) • Four categories of changes may now be evaluated using an analytic model and a calculator: • Application change (e.g. more I/Os, I/O to different devices, change in CPU requirements, fewer physical I/Os) • Competition from other workloads (e.g. new workload(s), changed workload(s), workload(s) removed) • Hardware change (e.g. new CPU, additional CPUs, new devices, new caching) • Change in application
Batch Window/Elapsed Time Analysis Methodology – Elapsed Time Analysis Techniques • How to Apply Business Volume to the Analysis (continued) • The first three changes are expressed as changes in various model parameters (using standard analytic model “what-if” procedures) • The fourth case, change in volume, requires no change to the model • After the necessary changes have been applied to the model and the model re-evaluated, the new throughput and response time will be calculated by the model • In order to determine the new job response time (for all cases): New BATCH-RESPONSE = New BATCH-TRANSACTIONS * New BATCH-RESPONSE TIME • When only the application volume is changing, no model evaluation is required: New BATCH-RESPONSE TIME is the same as Baseline • For cases where transaction volume is not changing, New BATCH-TRANSACTIONS is the same as Baseline
Batch Window/Elapsed Time Analysis Methodology – Elapsed Time Analysis Techniques • How to Apply Stream Constraints to the Analysis • A standard analytic model is supplemented with the “stream” configuration: • Add the maximum possible concurrency to the batch transaction, e.g. for a single job/process without threads, the maximum concurrency is 1. • In addition to the standard model calibration procedures, check the result of multiplying the transaction throughput and modeled response time • Ideally, this should come out to about 3600 seconds (1 hour) • If it is less than 3600, some other processing is missing, and should be added • If it is more than 3600, you would have received notice of saturation during model evaluation and need to consider what might be removed from the current transaction processing requirements • If the maximum concurrency is 2, use 7200 seconds as the standard of comparison, etc…
Batch Window/Elapsed Time AnalysisCase Studies • Case Studies • Demonstrate various aspects of the batch window/elapsed time analysis techniques • Show distributed systems data (Unix/Windows) • Techniques apply to any computing system • First developed for mainframes, enhanced for mid-range, then applied to distributed systems • Not every analysis uses an analytic model, but the process of interpreting and preparing the data for a model provides the necessary context for an effective analysis • No case study includes all stages of analysis since my involvement typically covered only one intermediate stage • The intended contribution is to identify the most important characteristics and factors relating to capacity planning solutions (if there was a capacity planning solution!) • There is some variety, but they don’t cover all types of batch analysis! • Four Case Studies • Daily Batch Job • Monthly Batch Processing • Batch Processing Sequence • Batch Processing with a Large Number of Jobs
Daily Batch Job Case Study Background • Performance question • Job runs within Sybase at 7:00AM, for 15 - 30 minutes on a SUN 6500 with 4 processors • Opinion is that it should run “faster” than 30 minutes • Methodology • Collect performance data for 4 days • Use Modeling and CDB Viewer
Daily Batch Job Case Study Current performance characteristics: CPU • (1) System CPU: possible 400%, max used 192%
Daily Batch Job Case Study Current performance characteristics: CPU • (2) Individual processors: possible 100%, max used 53%
Daily Batch Job Case Study Current performance characteristics: CPU • (3) Sybase individual dataserver processes: possible 100%, max used 80% • But only 2 out of 3 absorb the CPU load
Daily Batch Job Case Study Current performance characteristics: I/O, Memory, Network • I/O • Individual disk utilization: possible 100%, maximum used is 4% • Only a few disks are active • Disk service time/IO good (under 1 ms) for disks carrying most of the I/O load • Very low CPU wait for I/O % • Memory • Minor rate of paging to disk • Scan rate not high • Network • No stress during time job is running
Daily Batch Job Case Study Current performance characteristics: Summary • Summary • Hardware is NOT a constraint: • No sustainedhardware constraints (CPU, I/O, Memory, Network) • Possible brief hardware constraints? • Software, application design and/or implementation mustbe a constraint • If there are no sustained hardware constraints, the batch job should be able to process more quickly than it does now • Phase II Analysis Methodology • Collect additional performance data with finer detail (1 minute spills vs. standard 15 minute spills) to check for brief hardware constraints • Compare performance characteristics between TEST and DEV systems • Compare these two systems since one is considered to be performing “well” and the other is not • Also obtain specific “activity” counts for each batch job
Daily Batch Job Case Study Future performance characteristics • For current level of workload • Try to identify “other” constraints • Rough model shows about 80% of hardware elapsed time is for CPU service • CPU upgrade (faster processor) would improve performance • Additional processors would not help (no CPU wait time to reduce)
Daily Batch Job Case Study Future performance characteristics • For increased level of workload • Try to identify “other”/non-hardware constraints • Try to achieve better distribution of demand • CPU: Sybase dataserver distribution • I/O: more disks possible? • Only obvious hardware improvement would be faster processors
Monthly Batch Processing Case Study Background • Performance question • Environment is a 12-processor SUN 5500 running Solaris • Application is a multi-phase billing application which uses a Sybase backend • Current workload will increase by 100% • Billing cycle must complete within 9 days. What hardware configuration is required? Can the application be reconfigured/tuned? • Methodology • Configure CPU, Disk • Use Modeling and CDB Viewer • Obtain 3 days of performance data for 2 billing cycles
Monthly Batch Processing Case Study Background • Preliminary Analysis • One phase of the billing application is already taking 1.25 days, even before the workload increases • Batch Window modeling principle is that batch elapsed time will increase linearly, e.g. 1.25 days with projected 100% growth will result in 2.5 days • So this single phase is enough to “break” the billing cycle window by itself, so merits individual study • Need to understand what is constraining current performance so that can be addressed/understood before additional hardware is considered
Monthly Batch Processing Case StudyData Analysis • Use Interactive Data Analysis to Study Job Characteristics • Three days of collected performance data is available for two billing cycles • Observe the resource usage pattern for three days • Select an representative hour for detailed analysis
Monthly Batch Processing Case StudyData Analysis • Use Interactive Data Analysis to Study Job Characteristics • Can clearly see structure of the application • 5 “extract98” jobs, one for each database • Accounts are divided amongst the 5 databases • One Sybase instance, with 6 “dataservers”, which means a maximum of 6 CPUs can be used
Monthly Batch Processing Case StudyWorkload Specification • Workload/Transaction Class Structure • Specify multiple transaction classes in Interactive Data Analysis • One for each extract job
Monthly Batch Processing Case StudyWorkload Specification • Workload/Transaction Class Structure • Specify multiple transaction classes in Interactive Data Analysis • One for each dataserver
Monthly Batch Processing Case StudyCurrent Performance Characteristics • Use CDB Viewer to Study Workload Characteristics • Three days from two billing cycles • Apply workload definitions • Look at each billing cycle in isolation, then compare the two • Additional information from the customer • March cycle processed 1.7 million accounts • The extract phase was run twice in March • June cycle processes 1.825 million accounts
Monthly Processing Batch Case StudyWorkload Results Extract phase from 3 March 20:00 to 4 March 17:00 (19 - 21 hours). Rerun from 4 March 18:00 to 5 March 15:00 (18 – 21 hours).
Monthly Batch Processing Case StudyWorkload Results Extract phase runs 1 June 19:00 to 3 June at 2:00 (22 - 30 hours).
Monthly Batch Processing Case StudyPerformance Analysis: Workload Summary • CPU has a maximum utilization of 700% out of 1200% (about 60% normalized) • This means that there is a currently significant (40%) unused CPU capacity • The June cycle processed 7% more accounts than March, but its elapsed time grew by about 40% • This means that something is already limiting performance, and we know it isn’t overall CPU capacity
Monthly Batch Processing Case StudyPerformance Analysis: CPU Queue CPU queue length must exceed the number of processors in use (7 maximum) to indicate waiting. That doesn’t occur, so there is absolutely no waiting.
Monthly Batch Processing Case StudyPerformance Analysis: CPU per Processor The CPU utilization per physical processor is quite balanced. This indicates that the OS is successfully distributing the load and there is plenty of spare CPU capacity.
Monthly Batch Processing Case StudyPerformance Analysis: CPU per Workload The 5 extract jobs consume only 1 CPU together and are pretty evenly matched. The dataserver jobs consume up to 1 CPU per job (for 6 of them). They are evenly matched on the second run, but not the first.
Batch Processing Sequence Case StudyPerformance Analysis: CPU per Workload Summary • Each dataserver process is limited to 100% CPU utilization • This means that there is potential for Sybase to use the significant unusued capacity (by configuring additional dataserver processes) • In general, “unmatched” CPU utilization indicates a potential for improvement in performance • The second run has a shorter elapsed time, probably because all dataservers are at 100%. What is causing this to occur?
Monthly Batch Processing Case StudyPerformance Analysis: CPU for Sybase The first run, the dataservers are at 100%, 3 are 75% or less The second run, all 6 are about 100%
Monthly Batch Processing Case StudyPerformance Analysis: CPU for Data Extract When one of the 5 extracts runs longer, it’s CPU utilization doesn’t match the others. What is preventing work from being processed as quickly for many hours?
Monthly Batch Processing Case StudyPerformance Analysis: CPU for Data Extract A zoom-in on one extract job in March. The shorter elapsed time of the second run (18 hours) is clearly correlated with higher CPU utilization than during the first run (20 hours).
Monthly Batch Processing Case StudyPerformance Analysis: CPU Summary • So before any capacity planning can occur, need to understand the current limitations to performance • Since the application is CPU intensive, ideal conditions for maximum batch performance would be 1200% out of a possible 1200% • Other charts have been used to rule out I/O and network as significant contributors • CPU queueing chart shows minimal interference from any other activity on the system • A rough analytic model shows potential to absorb 75% growth with the current configuration if the current CPUs can be fully utilized
Batch Processing Sequence Case Study Background • Performance question • Environment is a 14-processor SUN 2000 running Solaris • Application has multiple phases and uses an Oracle backend • Current workload will increase by 100% • Billing cycle currently completes in 4 hours. What hardware configuration is required to maintain that elapsed time? • Methodology • Configure CPU • Use Modeling and CDB Viewer • Two days of performance data is obtained
Batch Processing Sequence Case StudyData Analysis • Use CDB Viewer to Study Job Characteristics • Two days of collected data is covers the sequence which begins in the evening and ends the following morning • The job sequence consists of the “FE_EMC_PRO” processes
Batch Processing Sequence Case StudyData Analysis • Use Interactive Data Analysis to Study Job Characteristics • Can see the sequence of start times for each process • Structure workloads, EMC-A, EMC-B, EMC-C, EMC-D, EMC-E • “A” from 20:31 to 21:07 • “B” from 21:08 to 22:40 • “C” from 22:41 to 23:36 • “D” from 23:37 to 00:21 • “E” from 00:22 to 00:44 Listing of selected processes showing process ID, CPU time, start time, parent process ID Only a couple of processes account for most of the system CPU utilization
Batch Processing Sequence Case StudyWorkload Specification • A separate analysis/model is built for each phase (example “EMC-B”) • One for each large FE_EMC_PRO job • The number of transactions is 20,000
Batch Processing Sequence Case StudyModel Calibration • Model Calibration • To check calibration for the transaction of interest • 11.44K transactions/hr * .31 sec/transaction = 3548 seconds • Compared to 3600 seconds (because throughput is expressed as transactions per hour), that’s 99%, or 1% from perfect calibration. This is acceptable.
Batch Processing Sequence Case StudyCurrent Performance Analysis • Baseline model results show • CPU Service time accounts for most of the total response time • Only effective upgrade is a processor faster than the current processor • Additional processors will have no effect (because there’s no CPU Wait time) • I/O upgrades will have a modest effect (there is some time spent doing I/O)
Batch Processing Sequence Case StudyFuture Performance Analysis • ‘What-if’ model results show that CPU Service time is significantly reduced • The proposed CPU upgrade is from the current Sun 14-processor system to an 8-processor Sun system where the individual processor is about 9 times faster • Relative response time of .3 shows that we can easily double this workload volume • I/O time is relatively more important than it was in the baseline • Double the workload volume in the model and evaluate • Calculate new response time: 40,000 transactions * .10 sec/transaction = 4000 seconds
Batch Processing Sequence Case StudyFuture Performance Analysis • Same technique is applied to each of the 5 models • First four phases • dominated by CPU service time • account for 4 hours • Fifth phase (clean up and FTP) • uses I/O more than CPU • only 20 minutes • The baseline job stream response time is approximately 4.3 hours: 20000 transactions * (.09 + .31 + .13 + .14 + .11) sec/tran = 4.3 hours
Batch Processing Sequence Case StudyFuture Performance Analysis • ‘What-if’ modeling results • First four phases show the planned CPU upgrade is adequate • relative response time of .5 (or less) means double business volume can be handled without extending run time • The fifth phase • shows less improvement, only .70 of original, so that phase will run longer than the baseline • The “what-if” job stream response time is approximately 2.8 hours: 40000 transactions * (.03 + .09 + .02 + .03 + .08) sec/tran = 2.8 hours
Batch Processing w/Many Jobs Case Study Background • Performance question • Environment is a 16-processor IBM Power6 frame running AIX • Application consists of thousands of jobs and uses an Oracle backend • 4 SPLPARs process the workload (2 application partitions, 2 database partitions) • There are hundreds of job sequences to be performed, but no single sequence occupies the entire batch window • Individual sequences are around an hour in length and have 4 phases (data acquisition, data analysis, database update, cleanup phase) • The set of sequences begins at midnight and ends the next morning • Current workload will quadruple • Data processing cycle currently completes in 8 hours. What hardware configuration is required to maintain that elapsed time? • Methodology • Configure CPU • Use CDB Viewer and Analysis Techniques • Measurement data was obtained for two weeks (14 samples of daily activity) • One day is shown as an example