210 likes | 298 Views
Is a Grid cost-effective?. Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne. HPC in Europe. TOP500: 176 in Europe, 12 have more than 1 Tflops/s Linpack First is CEA-DAM: No. 7 Germany: 71, UK: 39, France: 22, Italy: 16, Others: 28 Industry: 108, first (Telecom I) at No. 96
E N D
Is a Grid cost-effective? Ralf Gruber, EPFL-SIC/FSTI-ISE-LIN, Lausanne SOS7
HPC in Europe TOP500: 176 in Europe, 12 have more than 1 Tflops/s Linpack First is CEA-DAM: No. 7 Germany: 71, UK: 39, France: 22, Italy: 16, Others: 28 Industry: 108, first (Telecom I) at No. 96 BMW: 11, Daimler-Chrysler: 5, Car F: 6 Not one big, but many smaller machines HPC Companies: Quadrics Scali, SCI-based clusters: No. 51 SCS: see Toni’s presentation Beowulf production: Paralline, Dalco, ...... SOS7
Swiss-Tx project The Swiss-Tx machines (with TNet switch): 1998: Prototype Swiss-T0 with 16 Alphas 21164 1999: Swiss-T1 (Baby) with 16 Alphas 21264 2000: Swiss-T1 with 70 Alphas 21264 Know-how transfer to industry: 2001: GeneProt protein sequencing machine with 1420 Alphas 21264 Peak performance=1780Gflop/s In June 2001, would have been No. 12 in the Top500, 2nd in Europe and Was world number 1 of industrial computer installations Would be No. 48 (=C-Plant) in the Top500 list of November 2002 and Is still number 2 of industrial computer installations SOS7
Is a grid cost-effective? NO! Reasons: Since 25 years, we can use machines all over the world Those who needed good connections, installed it (HEPNET, Swissprot, ..) Using Java is against HPC SOS7
Parallel machines at EPFL and CSCS EPFL-SIC: SGI Origin3800 (500 MHz) 128 processors HP Alpha ES45/Quadrics (1.25 GHz) 100 processors Institutes PC clusters (CFD, Chemistry, Mathematics, Physics) IBM SP-2 (EFD) CSCS NEC SX-5 (16 processors) IBM Regatta (256 processors, 1.3 GHz) SOS7
Optimal grid scheduling Parameterisation of . Single processor . Cluster . Application Application tailored Grid scheduling SOS7
Characteristic single processor parameters Va and ra Va = Operations (Ops) / Memory accesses (LS) Examples SAXPY:y = y + a * x Ops = 2 LS = 3 (2 loads + 1 store) Va = 2 / 3 Matrix*matrix multiply and add: Va = n / 2 ra = min (R¥ , R¥ * Va / Vm) = min (R¥ , M¥ * Va) -> ra = 2/3 * M¥ -> ra = R¥ SOS7
Results with MATMULT Va =1 (double precision) Vm = R¥[Mflop/s] / M¥[Mword/s] R¥[Mflop/s]= Theoretical peak performance M¥[Mword/s] = Theoretical peak memory bandwidth Machine P R¥ ra=M¥VM r % NEC SX-5 1 8000 8000 1 Pentium 4 1.5/R 1 1500 400 4 229 57 Alpha 21264 2 2000 333 6 200 60 Pentium 4 1.7/S 1 1700 133 12 92 69 AMD 1.2/S 1 2400 133 18 57 43 r: Performance mesurée %: 100*r/ ra /S: Slow SDRAM memory /R: Fast Rambus or RDRAM memory SOS7
Tailoring clusters to applications G > 1 SOS7
Tailoring clusters to applications G = ga / gm Application:ga= O / S Machine: gm = ra / b O: Number of operations in Flops S: Number of words sent in Words ra : Theoretical peak performance of application in Mflops/s b: Peak network bandwidth per processor in Mwords/s SOS7
Cluster characterisation gm = ra / b b = C / P <d> gm = P * ra[Mflops/s] * <d> / C [Mwords/s] Table : The gm values for MATMULT (double precision) Machine P P*ra C <d> gm [Mflops/s] [Mwords/s] T1 (TNet) 32*2 21333 640 1.25 40 T1 (Fast Ethernet) 32*2 21333 48 1 444 IELNX (P4+FE) 22 8800 34 1 250 SOS7
LAUTREC on Swiss-T1 + TNet Swiss-T1 (TNet): ra= 1000 Mflops/s, b = 10 Mwords/s gm = 100 Water molecules: ga = 5*P*(0.65*Norb+4.24*log2V) / 3*(P-1) P=8, Norb=128, log2V=20 ga = 330 G = 3.3 (3.6 measured) -> 25% of overall time is due to communication 75% is due to computation SOS7
LAUTREC on Swiss-T1 + Fast Ethernet Swiss-T1 (FE): ra= 2000 Mflops/s, b = 1.5 Mwords/s gm = 1333 Water molecules: ga = 5*P*(0.65*Norb+4.24*log2V) / 3*(P-1) P=8, Norb=128, log2V=20 ga = 330 G = 0.25 (0.25 measured) -> 20% of overall time is due to computation 80% is due to communication SOS7
LAUTREC : Effect of latency TNet/Swiss-T1: L=13 ms MPI latency, b=80MB/s Break-even message length: beml=L*b=1000B Fast Ethernet: L=100 ms MPI latency, b=10MB/s Break-even message length: beml=L*b=1000B Average message length in Lautrec: aml= p*V/16*P2 For test case (V=96**3, P=8): aml=40 kB>>beml SOS7
Point-to-point applications ga = Operations (O) / Sends (S) FE/FV: O ~ Nb of volume nodes O ~ Nb of variables per node square O ~ Nb of non-zero matrix elements O ~ Nb of operations per matrix element FE/FV: S ~ Nb of surface nodes S ~ Nb of variables per node FE/FV: ga~ Nb of nodes in one direction ga~ Nb of variables per node ga~ Nb of non-zero matrix elements ga~ Nb of operations per matrix element ga~ 1/Nb of surfaces ga (NS/FV/100**3) C 2000 ga (Poisson/FD/100**3) C 400 Reminder (Beowulf+Fast Ethernet): gm C 250 SOS7
Other quantities Memory usage Price per 1h CPU time Engineering salary Energy consumption Maintenance/servicing/personnel costs User commodity SOS7
Optimal Grid scheduling Goal: Add an application tailored Grid scheduling to RMS . Estimate machine and application parameters by counts . Measure machine and application parameters (PAPI, ...) . Build up a data base on these parameters . Find and submit to best suited Grid ressource (not always optimum) . Update the data base dynamically . Perform statistics on decisions and decision failures SOS7
Optimal Grid scheduling Settle and apply rules to find best suited ressource by: . Match machine/application (MPI or not MPI) . Best price/performance ratio based on parameterisation . Availability of the ressources . Engineering costs . Energy consumption SOS7
Optimal Grid scheduling Perform statistics to: . Detect too often demanded unavailable ressources . Detect real costs of an application . Detect applications that should be parallelised/optimised to reduce costs . Guide decision making for the next purchase . Guide decision on R&D money attribution SOS7
Is a grid cost-effective? Yes, it can be! Minimise overall costs by application adapted job execution Purchase not available demanded low-cost ressources Parallelise cost-ineffective applications Reduce engineering and energy costs Note: “Cheap” ressources do not have to be used up during 90% Results in More computing ressources for the same price More rapid increase of application efficiencies Questions Do computer manufacturers play the game? Do application owners play the game? Can we change users, decision makers and computing centres? SOS7
Reference R. Gruber, P. Volgers, A. de Vita, M. Stengel, T.-M. Tran, Parameterisation to tailor commodity clusters to applications, Future Generation Computer Systems 19 (2003) 111-120 see also: http://sawww.epfl.ch/SIC/SA/publications/SCR02/scr13e.html SOS7