Intelligent GRID Scheduling Service (ISS)

Intelligent GRID Scheduling Service (ISS) Vincent Keller, Ralf Gruber,EPFL K. Cristiano, A. Drotz, R.Gruber, V. Keller, P. Kunszt, P. Kuonen, S. Maffioletti, P. Manneback, M.-C. Sawley, U. Schwiegelshohn, M. Thiémard, A. Tolou, T.-M. Tran, O. Wäldrich, P. Wieder, C. Witzig, R. Yahyapour, W. Ziegler, “Application-oriented scheduling for HPC Grids”, CoreGRID TR-0070 (2007)available on http://www.coregrid.net

Outline • ISS Goals • Applications & Resources characterization • ISS architecture • Decision model : CFM • ISS Modules/Services Implementation Status • Testbeds (HW & SW)

Goals of ISS 1. Find most suited computational resources in a HPC Grid for a given component 2. Use best an existing HPC Grid 3. Predict best evolution of an HPC Grid

Va=O/W: Number of operations per memory access [Flops/Word] a = O/S: Number of operations per word sent [Flops/Word] Γ model : Characteristic parameters of an application task* O: Number of operations per node [Flops] W: Number of main memory accesses per node [Words] Z: Number of messages to be sent per node S: Number of words sent by one node [Words] *suppose the parallel subtasks are well equilibrated

VM=R/ M: Number of operations per memory access [Flops/Word] ra= min (R, M* Va): Peak task performance on a node [Flops/s] tc= O/ra: Minimum computation time [s] Γ model : Characteristic parameters of a parallel machine P: Number of nodes in a machine R: Peak performance of a node [Flops/s] M: Peak main memory bandwidth of a node [Words/s] Note: ra= Rmin (1, Va/VM)

Vc=P R/ C: Number of operations per sent word [Flops/Word] b=C/(P*<d>): Inter-node communication bandwidth per node [Words/s] tb=S/b: Time needed to send S words through the network [s] tL=LZ: Latency time [s] T=tc+ tb+ tL: Minimum turn around time of a task* M=(ra/b)(1+tL/tb): Number of operations per word sent [Flops/Word] B=b L: Message size taking L to be transfered Γ model : Characteristic parameters of the internode network C: Total network bandwidth of a machine [Words/s] L: Latency of the network [s] <d>: Average distance (= number of links passed) *I/O is not consideredand communication cannot behidden behind computation

 model (One value per application and machine)  > 1  = a / M Task/application: a= O / S [flops/64bit word] Machine (if LZ/S<<1): M = ra / b [flops/64bit word] Speedup Efficiency

Cluster P R Gflops/s P R Gflops/s M Gwords/s VM f/w C Gwords/s VC f/w b Mwords/s L s B Words NoW 10 6 60 0.8 7.5 0.003 19’200 0.3 60 200 Pleiades1 132 5.6 739 0.8 7 0.4 1’792 3 60 180 Pleiades2 120 5.6 672 0.8 7 3.75 179 30 60 1’800 Mizar 224 9.6 2’150 1.6 6 14 154 62 10 620 BlueGene 4’096 5.6 22’937 0.7 8 1’065 22 100* 2.5 250 Horizon 1’664 5.2 8’650 0.8 6.5 2’650 3.3 160** 6.8 1’080 SX-5*** 16 8 128 8 1 128 128 Pleiades 2+ 99 21.3 2110 2.7 8 3.1 682 30 60 1’800 Terrane 48 121.6 5’836 25 5 Parameters of some Swiss HPC machines *<d>32 for half of C **<d>10 *** decommissioned

Example: Speculoos Pleiades 1 FE =1.4 Pleiades 2 GbE =3.8 Pleiades 2+ GbE =1.6

ISS/VIOLA environment

ISS : Job Execution Process Goal: Find most suited machines in a Grid to run application components

Cost Function Model

Cost Function Model • CPU Costs Ke • licence fees Kl • Results waiting time Kw • Energy Costs K eco • Data Transfer Costs Kd • All the costs are expressed in Electronic Cost Unit (ECU)

Cost Function Model : CPU costs with investment cost, maintenance fees, bank interest, etc..

Cost Function Model : Broker • The broker computes a list of machines with their relative costs for a given application component • This ordered list is sent to the MSS for final decision and submission

Other important goal of ISS Simulation to evolve cluster resources in a Grid (uses the same simulator as to determine ,  ,  using statistical application execution data over a long period in time (same data as to determine ,  ,  Support tool to decide on how to choose new Grid resource

Side products VAMOS monitoring service (measurement of Ra, ) Application optimization (increase Va, Ra) Processor frequency adaptation (reduce energy consumption)

What exists? Simulator to determine ,  ,  VAMOS monitoring service to determine  Cost Function Model

What is in implementation phase? Interface between ISS and MSS (first version ready by end of June 07) Ra monitoring (ready by end of Mai 07) Cost Function Model (beta version ready by end of 07) Simulator to predict new cluster acquisition (by the end of 07)

Application testbed CFD, MPI: SpecuLOOS (3D spectral element method) CFD, OpenMP: Helmholtz (3D solver with spectral elements) Plasma physics, single proc: VMEC (3D MHD equilibrium solver) Plasma physics, single proc: TERPSICHORE (3D ideal linear MHD stability analysis) Climate, POP-C++: Alpine3D (multiphysics, components) Chemistry : GAMESS (ab-initio molecular quantum chemistry)

First hardware testbed UNICORE/MSS/ISS GRID Pleiades 1 (132 single proc nodes, FE switch, OpenPBS/Maui) Pleiades 2 (120 single proc nodes, GbE switch, Torque/Maui) Pleiades 2+ (99 dual proc/dual core nodes, GbE switch, Torque/Maui) CONDOR pool EPFL (300 single & multi proc nodes, no interconnect network)

ETHZ: SMP/NUMA High m cluster EPFL: SMP/NUMA High m cluster CERN: egee Grid EIF: NoW CSCS: SMP/vector Low m cluster ISS as a SwissGrid metascheduler I S S Switch SWING

Conclusions Automatic: Find best suited machines for a given application Monitor application behaviours on single node and network Guide towards: Better usage of overall GRID Extend existing GRID by best suited machines for an application set Single node optimization and better parallelization http://web.cscs.ch/ISS/

Intelligent GRID Scheduling Service (ISS)