230 likes | 409 Views
Introduction. Pharma applications run in a data center in California. Application support is in San Francisco and Dallas. We implemented parallel sysplex environments last July to improve availability. We also installed a 2086-350 and 2086-250. The CPU engines have the same speed, facilitating re
E N D
1. Balancing Batch Workloads and CPU Activity in a Parallel Sysplex Environment
Prepared by Kevin Martin
McKesson
For CMG Canada
Spring Seminar 2006
2. Introduction Pharma applications run in a data center in California. Application support is in San Francisco and Dallas.
We implemented parallel sysplex environments last July to improve availability.
We also installed a 2086-350 and 2086-250. The CPU engines have the same speed, facilitating reporting and workload balancing.
4. Z890-350 CPU Utilization by LPAR
5. DDCA Processor Utilizationby Workload
6. Z890-250 CPU Utilization by LPAR
7. DDCO Processor Utilizationby Workload
8. Reasons for Imbalanced CPU Activity Originally the Pharma application ran on one production LPAR. Hard to decide how to split processing and maintain data integrity.
Software licenses: IMS and COMPAREX only on the 350 and SAS only on the 250
System tasks: TWS controller (job scheduling) on the 350 and DFHSM migrates and backups on the 250
Other restrictions due to problems and data integrity concerns
9. Job Routing Our goal was to avoid modifying JCL
We used WLM scheduling environments, and a tool to assign programs or jobs to the scheduling environments
10. WLM Scheduling Environments DDCANY run on DDCA or DDCO
DDCA DDCA jobs
DDCOJOBS DDCO jobs
SAS SAS programs
DDCO Jobs that run on DDCO using class 6
EDE EDICKP DD statement
MQSERIES MQSERIES
REEL 3420 tapes
EDETEST EDE test jobs DM99Txxx
DDCSPECL programs that run on the 350
11. SDSF Resource Display RESOURCE DDCA DDCO
DDCANY ON ON
DDCO OFF ON
DDCOJOBS OFF ON
DDCSPECL ON OFF
DDNAMES ON OFF
EDE ON ON
EDETEST ON ON
IMSTEST ON ON
MQSERIES ON OFF
REEL ON OFF
SAS OFF ON
12. WLM and JES Mode Initiators For each job class you can specify MODE=WLM or MODE=JES in the JES2 parameters
WLM mode initiators can start dynamically on any LPAR
JES mode initiators are set for each LPAR in permanent initiators
WLM and JES mode classes can run at the same time. However, ensure that there are enough JES mode initiators.
13. WLM and JES Mode Initiators CLASS Status Mode Wait-Cnt Xeq-Cnt Hold-Cnt JCLim
H NOTHELD WLM 3 100
L NOTHELD WLM 1 100
M NOTHELD WLM 1 100
N NOTHELD WLM 100
O NOTHELD WLM 100
1 NOTHELD WLM 100
2 NOTHELD JES 100
3 NOTHELD JES 7 100
4 NOTHELD WLM 100
5 NOTHELD JES 100
6 NOTHELD JES 100
14. Problem # 1: slower turnaround on one LPAR – more jobs running. TWS controller is on DDCA. When a job is released, a WLM initiator is available on the same LPAR first.
For example, there could be 15 jobs on DDCA and only 5 jobs on DDCO. So the jobs on DDCA get slower turnaround than the ones on DDCO.
This gets worse if high priority jobs are running on the busy LPAR. The low priority jobs will run very slowly.
Checked DASD response and tuned JES MAS parms.
We routed several large priority jobs to DDCO by assigning specific job names to a scheduling environment named DDCOJOBS.
15. Problem # 2: Releasing many jobs at the same time 8 or 16 large jobs are released at once. They are on the critical path for a schedule and they have a high priority.
With WLM mode initiators most of the jobs could start on one LPAR because that LPAR was not busy at the time that the jobs were released.
For example, DDCA could get 2 jobs and DDCO could get 6 jobs. The jobs on DDCA would finish earlier, and then DDCA would be idle while DDCO was still busy.
We assigned these groups of large priority jobs to JES mode job classes to balance the LPAR activity better. Defined four class 5 initiators on DDCA and four class 5 initiators on DDCA. Assigned DY65 jobs to class 5.
16. Problem # 3: WLM initiators and jobs on the input queue Priority jobs would start, but lower priority jobs would wait on the input queue
With over 10,000 jobs running per day, we found some jobs that were incorrectly classified.
We defined a WLM policy override to change the BATLOW service class to importance level 3, the same importance level as the higher priority batch. After the FIXINPUT policy override was activated, the jobs on the input queue would start. Sometimes it would take 10 minutes to start all of the jobs. Afterwards the regular policy was activated again.
17. How to make WLM policy overrides On the WLM service policy selection list, specify action code 2=COPY to copy the base policy to a new policy named FIXINPUT.
Then specify action code 7=Override Service Classes to modify the service class goals for FIXINPUT.
Then specify action code 3=Override Service Class to modify the goals for specific service classes in the policy override.
To activate the policy, enter: V WLM,POLICY=FIXINPUT
To display the WLM policy, enter: D WLM
18. Jobs on the input queue Apar UA21235 on z/OS 1.4 systems.
Correction was released in October, 2005
“Currently WLM does not start additional initiators for local batch work with system affinities when idle initiators exist on other systems in the sysplex. This can lead to situations where local batch jobs are delayed for a significant period of time because a local shortage of initiators exists. The situation is most visible on large sysplex environments with batch work having system affinities to only few systems. WLM improves to start initiators by looking more closely at the number of initiators which can really handle the affinity work.“
19. Summary Balance LPAR activity in order to optimize capacity in a parallel sysplex environment.
WLM mode initiators work well in most cases. It is essential that the correction for UA21235 is installed.
It is OK to mix WLM mode and JES mode job classes, provided that there are always enough fixed initiators for each JES mode job class.
21. Changes in CPU utilization Overall CPU activity decreased from September to January due to tuning.
DDCA decreased due to tuning improvements.
DDCO increased in August and then remained at the same utilization due to better workload balancing.
The following graphs show how the LPAR activity became more balanced.