250 likes | 428 Views
Batch Computing in Client/Server IT-infrastructure’s using LSF and the MSC/Analysis Manager. Klaus-Peter Wessel Daimler Benz Aerospace Airbus Dept. EIC (CAE-Systems support) Tel.: 0421-538-3161 Fax: 0421-538-4994 E-Mail: klaus-peter.Wessel@airbus.dasa.de.
E N D
Batch Computing in Client/Server IT-infrastructure’s using LSF and the MSC/Analysis Manager Klaus-Peter Wessel Daimler Benz Aerospace Airbus Dept. EIC (CAE-Systems support) Tel.: 0421-538-3161 Fax: 0421-538-4994 E-Mail: klaus-peter.Wessel@airbus.dasa.de
My Person: Klaus-Peter Wessel Dipl.-Ing. (Aerospace Technology), University of Stuttgart • Responsible at DASA Airbus for: • Installation, Customizing and User Support of • Pre-/Postprocessor Systems • MSC/Patran (major system) • I_DEAS, MSC/ARIES (in the use with MSC/EMAS), MENTAT • Services • Intranet CAE-Services • Batch-Queueing System (Unix) • LSF (Load Sharing Facility) and • MSC/Analysis Manager • Covered by this paper!
CAE Downsizing Project at DASA Airbus from host based to distributed computing Hardware situation in the CAE area realization with LSF and MSC/Analysis Manager General Themes of this paper • CAE-Jobmanagement/Handling • Batch Queuing-System • (Load Sharing Facility (LSF) from Platform Computing Ltd.) • Job-Submitting Tool • (MSC/Analysis Manager)
DASA Airbus is running a CAE-downsizing project started in 1994 Main issue is to put all MSC/Nastran and self developed CAE specific Fortran applications from the IBM Mainframe to decentralized Unix-based workstations and reduce elapsed/CPU-time factor of typical CAE-Batch jobs to a maximum of 2 Platform decision was (in 1995): HP-workstations as the desktop systems for the engineers IBM RS6000/R24 as central CAE-Batch Server systems (at each site) No further DEC investment because of no 100% IEEE compatibility Platform decision revised (09/97): Implement a pure HP-environment in the major CAE-departments Use more decentralized Batch Servers instead of the former IBM RS6000/R24 CAE Downsizing at DASA Airbus
CAE Downsizing at DASA Airbus What is LSF? • LSF is the abbreviation for Load Sharing Facility • It is developed by Platform Computing Ltd., Canada • It is the world leading Unix-based Batch Queueing and Load Leveling System • (LSF can even share batch-jobs with NQS running on a CRAY)
CAE Downsizing at DASA Airbus and why did we choose LSF? • It combines all Unix based systems to one "virtual cluster" • The CAE-user get a "one system" view • It is a tool with which the usage of workstation resources can be optimized • It is very flexible in setting up specific Batch-Queues • It is flexible in setting up site specific load indices • It works in an optimal way together with the MSC/Analysis Manager
xlsbatch GUI for submitting, controlling and monitoring of batch jobs, hosts and queues DA-specific implementation: click in the "submit" button opens MSC/Analysis Manager Note: This tool is currently not in use by the "normal" CAE-user LSF is used at DA more or less as a "black box" behind the MSC/Analysis Manager
Load Indices There are a number of built-in load indices which are measured continously by the LIM process on each participating workstation Examples: r15s 15 sec run-queue length r1m 1 minute run-queue length r15m 15 min run-queue length ut CPU utilization it interactive idle time of workstation pg paging rate mem memory usage swp swap space External load indices can be defined in addition Example: nas_scratch free disc space in directory /nas_scratch
Load Indices All load indices can be used for a number of configuration options as well as for job-submit options Examples: 1.) In overall LSF workstation declarations declare a workstation to be busy if load index is above a specific value declare a workstation as busy if it is interactively used (index: it) 2.) In queue definitions take only that workstations for decision where to run a job were load index is not higher than specified 3.) In submit commands bsub -q normal -type==any order[ut:r1m:] command or: bsub -q server -type==HP700 -mem>200 command 4.) During the runtime of a job if load index gets above a specified value (loadstop): stop the job if load index gets below a specified value (loadsched): resume or start a job if software allows and LSF is configured to do it: migrate the job to another machine
Flexibility in Queue definitions Scheduling policies FIRST IN - FIRST OUT FAIRSHARE POLICY PREEMPTIVE and PREEMPTABLE jobs EXCLUSIVE jobs USER (and/or department shares) Limits can be set CPULIMIT RUNLIMIT FILELIMIT DATALIMIT STACKLIMIT CORELIMIT MEMLIMIT PROCLIMIT Resources/Load indices can be used RUN_WINDOWS DISPATCH_WINDOWS USERS HOSTS PREEXEC/POSTEXEC commands LOADSCHED/LOADSTOP values of all Load indices MULTICLUSTERING
Flexibility in Queue definitions bigmem-queue contains only one big pure CAE-batch server accepts one job per processor (currently one) accepts two jobs in the Queue per user No overload indices declared Note: see later described Multi-Clustering smallmem-queue contains all CAE-workstations, eg. old and small HP/710´s jobs are killed when requiring more than 22 MB Memory accepts two jobs per processor accepts five jobs in the Queue per user jobs will be suspended when load index is above limit jobs will be resumed when load index is going below limit
xlsmon GUI for monitoring the cluster load here: different stati of hosts ok: host accepts new batch load busy: host threshold values too high and host will not accept new load unav.: host is currently offline closed: host run-window is currently closed
xlsmon GUI for monitoring the cluster load here: actual workload indices of two hosts Usage of the tool is good for LSF-administrators and/or CAE-systems support people
LSF MultiCluster feature • Reasons for independent (multiple) Clusters may be: • decentralized IT infrastructure • independent clusters for organizational reasons • geographical distributed sites • With the LSF MultiCluster application/functionality: • independent LSF-clusters can be combined to one “virtual Cluster” • jobs can be distributed from one cluster to another
Additional LSF features • There is a number of functionalities given by LSF, • but currently not in use at DASA Airbus: • Production job scheduling (Calendars) • Sharing / load leveling of interactive jobs • Load sharing shell (lstcsh) • parallel (or distributed) make utility
What is theMSC/Analysis Manager? • It is an easy to use graphical user interface (GUI) for • submitting and monitoring typical CAE-batch jobs • It is developed by the MacNeal Schwendler Corp. and why did we choose it? • In downsizing projects GUI´s are needed to assure an ease of use of complex software • The need of developing own GUI´s/tools can be reduced by using a standard but customizable GUI’s • It works in an optimal way together with LSF and MSC/Nastran • No need of user specific training to use the LSF cluster features • Other codes can be easily integrated • The CAE-user can use one Interface for all his CAE-batch jobs • It is an integral part of MSC/Patran but can also be run in a standalone mode
Submit CAE-jobs with MSC/Analysis Manager All kind of typical CAE-Batch systems can be integrated in the GUI. At DASA Airbus are integrated: MSC/Nastran (shown left) LS-Dyna 3D, Marc, Exform, VSAERO, STARS In addition DASA Airbus has integrated a special feature (shown right) that allow users to compile, link and run any Fortran-job, even on pure batch-servers without any interactive access. Later the developed code can be integrated as a real application inside the MSC/Analysis Manager GUI Note: The GUI changes the outfit according to the chosen application
Submit CAE-jobs with MSC/Analysis Manager • The user I/F is the same for all applications! • The user just has to choose with a click on the GROUP button • his application and the GUI will be adjusted to the settings • applicable for that application • The I/F changes the outfit accordingly to the organisation/ • department choosen • Usage for standard jobs is very easy • choose the input file (e.g. MSC/Nastran *.dat file) • click on the "Apply" button • Note: In MSC/Patran this procedure is fully integrated What happens after clicking on the "Apply" button?
What happens with a MSC/Nastran job after clicking on the "Apply" button? • The monitoring window will appear on the screen • The MSC/Analysis Manager will build one MSC/Nastran input-file (included files) • The MSC/Analysis Manager can/will generate MSC/Nastran FMS commands • The CAE-job will be submitted to the chosen LSF-Queue • LSF considers which workstation is capable to run the job • Files needed for the job will be copied to the execute host • The application program (e. g. MSC/Nastran) will be executed • All generated (or specified) files will be copied back to the submit host • Specifically for MSC/Nastran *.f06 file will be searched through for FATAL messages Note: All is done automatically, the user doesn´t have to use any OS-System command!
Job Monitoring with MSC/Analysis Manager During the execution time users can download some files to their desktop system At the end of a job the user is informed which files are re-ceived from his "submit-host" Each file will be opened directly by a mouse click Monitoring of important tasks and the MSC/Nastran *.log file and at the end of the job looking for MSC/Nastran FATAL messages
Setting up job chains inside MSC/Analysis Manager With the PRE and POST options users can setup any job-chain All Unix-commands, scripts and programs can be chained Here is shown: transfer the .nastxxrc file to the execute host before execution of MSC/Nastran at the end of the MSC/Nastran run generate on the execute host an I_DEAS universal-file compress all generated file on the execute-host (e.g. to reduce network-traffic) uncompress all transfered files on the submit-host
Setting up specific Job-options inside MSC/Analysis Manager left: MSC/Nastran and LSF Memory parameters center: MSC/Nastran Restart options right: Submit-time of the job
More in depth integration of other CAE-applications like MSC/Nastran (application specific job-handling) Better procedure for running jobs on local machines (no copy of files) Interaction to / Integration of MSC/Estimate choose the right host for the job setup MSC/Nastran specific job parameters automatically (e.q. memory/disc) Continue/enhance partnership with Platform Computing e. g. integrate LSF libraries with MSC/Nastran to make a job checkpointable, rerunnable and migratable inside the LSF environment Future enhancements (requests)for the MSC/Analysis Manager
Conclusion • With the combination of LSF and MSC/Analysis Manager CAE • batch-jobs can be setup, scheduled and monitored in a very easy way • In a heterogenous UNIX-environment with the combination of both tools • a maximum performance for each batch-job • a maximum job throughput • and a maximum usage of unused hardware resources • was achieved at DASA Airbus • (e.g. turnaround times of typical jobs where reduced by a factor of 6 compared with the mainframe) • All workstations are coupled to a "virtual cluster"; the CAE-user doesn´t need • special Unix know-how to use other workstations than "his own"