240 likes | 360 Views
Balancing Batch Workloads and CPU Activity in a Parallel Sysplex Environment. Prepared by Kevin Martin McKesson For CMG Canada Spring Seminar 2006. Introduction. Pharma applications run in a data center in California. Application support is in San Francisco and Dallas.
E N D
Balancing Batch Workloads and CPU Activity in a Parallel Sysplex Environment Prepared by Kevin Martin McKesson For CMG Canada Spring Seminar 2006
Introduction • Pharma applications run in a data center in California. Application support is in San Francisco and Dallas. • We implemented parallel sysplex environments last July to improve availability. • We also installed a 2086-350 and 2086-250. The CPU engines have the same speed, facilitating reporting and workload balancing.
Z890-350 CPU Utilization by LPAR CPU % Busy
DDCA Processor Utilizationby Workload CPU % Busy
Z890-250 CPU Utilization by LPAR CPU % Busy
DDCO Processor Utilizationby Workload CPU % Busy
Reasons for Imbalanced CPU Activity • Originally the Pharma application ran on one production LPAR. Hard to decide how to split processing and maintain data integrity. • Software licenses: IMS and COMPAREX only on the 350 and SAS only on the 250 • System tasks: TWS controller (job scheduling) on the 350 and DFHSM migrates and backups on the 250 • Other restrictions due to problems and data integrity concerns
Job Routing • Our goal was to avoid modifying JCL • We used WLM scheduling environments, and a tool to assign programs or jobs to the scheduling environments
WLM Scheduling Environments • DDCANY run on DDCA or DDCO • DDCA DDCA jobs • DDCOJOBS DDCO jobs • SAS SAS programs • DDCO Jobs that run on DDCO using class 6 • EDE EDICKP DD statement • MQSERIES MQSERIES • REEL 3420 tapes • EDETEST EDE test jobs DM99Txxx • DDCSPECL programs that run on the 350
SDSF Resource Display • RESOURCE DDCA DDCO • DDCANY ON ON • DDCO OFF ON • DDCOJOBS OFF ON • DDCSPECL ON OFF • DDNAMES ON OFF • EDE ON ON • EDETEST ON ON • IMSTEST ON ON • MQSERIES ON OFF • REEL ON OFF • SAS OFF ON
WLM and JES Mode Initiators • For each job class you can specify MODE=WLM or MODE=JES in the JES2 parameters • WLM mode initiators can start dynamically on any LPAR • JES mode initiators are set for each LPAR in permanent initiators • WLM and JES mode classes can run at the same time. However, ensure that there are enough JES mode initiators.
WLM and JES Mode Initiators • CLASS Status Mode Wait-Cnt Xeq-Cnt Hold-Cnt JCLim • H NOTHELD WLM 3 100 • L NOTHELD WLM 1 100 • M NOTHELD WLM 1 100 • N NOTHELD WLM 100 • O NOTHELD WLM 100 • 1 NOTHELD WLM 100 • 2 NOTHELD JES 100 • 3 NOTHELD JES 7 100 • 4 NOTHELD WLM 100 • 5 NOTHELD JES 100 • 6 NOTHELD JES 100
Problem # 1: slower turnaround on one LPAR – more jobs running. • TWS controller is on DDCA. When a job is released, a WLM initiator is available on the same LPAR first. • For example, there could be 15 jobs on DDCA and only 5 jobs on DDCO. So the jobs on DDCA get slower turnaround than the ones on DDCO. • This gets worse if high priority jobs are running on the busy LPAR. The low priority jobs will run very slowly. • Checked DASD response and tuned JES MAS parms. • We routed several large priority jobs to DDCO by assigning specific job names to a scheduling environment named DDCOJOBS.
Problem # 2: Releasing many jobs at the same time • 8 or 16 large jobs are released at once. They are on the critical path for a schedule and they have a high priority. • With WLM mode initiators most of the jobs could start on one LPAR because that LPAR was not busy at the time that the jobs were released. • For example, DDCA could get 2 jobs and DDCO could get 6 jobs. The jobs on DDCA would finish earlier, and then DDCA would be idle while DDCO was still busy. • We assigned these groups of large priority jobs to JES mode job classes to balance the LPAR activity better. Defined four class 5 initiators on DDCA and four class 5 initiators on DDCA. Assigned DY65 jobs to class 5.
Problem # 3: WLM initiators and jobs on the input queue • Priority jobs would start, but lower priority jobs would wait on the input queue • With over 10,000 jobs running per day, we found some jobs that were incorrectly classified. • We defined a WLM policy override to change the BATLOW service class to importance level 3, the same importance level as the higher priority batch. After the FIXINPUT policy override was activated, the jobs on the input queue would start. Sometimes it would take 10 minutes to start all of the jobs. Afterwards the regular policy was activated again.
How to make WLM policy overrides • On the WLM service policy selection list, specify action code 2=COPY to copy the base policy to a new policy named FIXINPUT. • Then specify action code 7=Override Service Classes to modify the service class goals for FIXINPUT. • Then specify action code 3=Override Service Class to modify the goals for specific service classes in the policy override. • To activate the policy, enter: V WLM,POLICY=FIXINPUT • To display the WLM policy, enter: D WLM
Jobs on the input queue • Apar UA21235 on z/OS 1.4 systems. • Correction was released in October, 2005 • “Currently WLM does not start additional initiators for local batch work with system affinities when idle initiators exist on other systems in the sysplex. This can lead to situations where local batch jobs are delayed for a significant period of time because a local shortage of initiators exists. The situation is most visible on large sysplex environments with batch work having system affinities to only few systems. WLM improves to start initiators by looking more closely at the number of initiators which can really handle the affinity work.“
Summary • Balance LPAR activity in order to optimize capacity in a parallel sysplex environment. • WLM mode initiators work well in most cases. It is essential that the correction for UA21235 is installed. • It is OK to mix WLM mode and JES mode job classes, provided that there are always enough fixed initiators for each JES mode job class.
Changes in CPU utilization • Overall CPU activity decreased from September to January due to tuning. • DDCA decreased due to tuning improvements. • DDCO increased in August and then remained at the same utilization due to better workload balancing. • The following graphs show how the LPAR activity became more balanced.