Status of PDC’05/SC3 System stress test

Status of PDC’05/SC3System stress test LCG + ALICE + Site experts ALICE-LCG TF meeting Geneva, December 08, 2005

General running statistics • Event sample (last 2 months running): • 22500 jobs completed (Pb+Pb and p+p): • Average duration 8 hours, 67500 cycles of jobs • Total CPU work: 540 KSi2K hours • Total output: 20 TB (90% CASTOR2, 10% Site SEs) • Centres participation (22 total): • 4 T1’s: CERN, CNAF, GridKa, CCIN2P3 • 18 T2’s: Bari (I), Clermont (FR) , GSI (D), Houston (USA) , ITEP (RUS), JINR (RUS) , KNU (UKR), Muenster (D), NIHAM (RO), OSC (USA), PNPI (RUS), SPbSU (RUS), Prague (CZ), RMKI (HU), SARA (NL), Sejong (SK), Torino (I), UiB (NO) Status of PDC’05 – operational issues

General running statistics (2) • Jobs done repartition per site: • T1’s: CERN: 19%, CNAF: 17%, GridKa: 31%, CCIN2P3: 22%: • Very evenly distribution among the T1’s • T2’s: total of 11%: • Extremely good stability at: Prague, Torino, NIHAM, Muenster, GSI, OSC • Some under-utilization of T2 resources – more centres available, could not install the Grid software to use fully Status of PDC’05 – operational issues

Efficiency numbers • Event failures: • 562 jobs persistent (up to 3 retries) AliRoot failure (2.5%) • Errors saving, downloading input files – non-persistent and due to temporary services malfunction • All other errors (application software area not visible, connectivity issues, black holes) are non-existent with the Job agent model – jobs are simply not pulled from TQ. Status of PDC’05 – operational issues

System stress test • Goals of the test: • Central services behaviour: • Many of the past problems (large number of proxies, overload of server machines, etc…): improved with AliEn v.2-5 and through redistribution of services over several servers • Site services behaviour (VO-boxes, interaction with LCG): • Connection to central services, stability, job submission to RB): improved with AliEn v.2-5 • CERN SE behaviour (CASTOR2): • Overflow of xrootd tactical buffer: improved with additional protection in migration scripts Status of PDC’05 – operational issues

System stress test (2) • General targets: • Number of concurrently running jobs: 2500/24 hours (7500 jobs total) • Storage facilities: CASTOR2, 15K files (2 per job), each file is archive of 5 root files, 7.5 TB total • Special target: • GridKa provides 1200 job slots – test of VO-box Status of PDC’05 – operational issues

Results: 2450 jobs • Running job profile: Negative slope: see results(4) Status of PDC’05 – operational issues

Results: • 15 sites CPU utilization (80% T1/ 20%T2): • T1’s: CERN: 8%, CCIN2P3: 12%, CNAF: 20%, GridKA: 41% • T2’s: Bari: 0.5%, GSI: 2%, Houston: 2%, Muenster: 3.5%, NIHAM: 1%, OSC: 2.5%, Prague: 4%, Torino: 2%, ITEP: 1%, SARA: 0.1%, Clermont: 0.5% • Number of jobs: 98% of target number: • Special thanks to Kilian and the GridKa team for making 1200 CPUs available for the test • Duration: 12 hours (1/2 of the target duration) • Jobs done: 2500 (33% of target number) • Storage: 33% of target Status of PDC’05 – operational issues

Results (2): • VO-box behaviour: • No problems with services running, no interventions necessary • Load profile on VO-boxes – in average proportional to the number of jobs running on the site, nothing special CERN GridKA Status of PDC’05 – operational issues

Results(3): • Storage behaviour: • xrootd (interface) and CASTOR2 – no problems: • However the objective was not to stress-test the MSS and network • Central AliEn services behaviour: • Job submission: 3000 jobs (6 master jobs) submitted/split and available in TQ in 2 hours (0.8 jobs/sec) • Jobs starting and running phase - no problem with number of jobs, no special load on proxy. DB or any other service Status of PDC’05 – operational issues

Results (4): • Negative slope on number of jobs plot: • During job saving phase • Post-mortem analysis by experts (Pablo and Predrag) • Prevented us from reaching the target duration of the excercise. Status of PDC’05 – operational issues

Conclusions • These are still preliminary – the exercise de facto ended at 02:00 this morning • VO-boxes model: shows sclability up to 1000 jobs running concurrently at a given site (maximum CPUs available): • We are confident, that it can handle much more than that • Storage CASTOR2 – stable interface and storage beaviour, next target is to test throughput performance. • Central services: • Job submission/splitting – high performance • Starting/running – no problem (limited by the number of available CPUs) • Saving – server (DB) overload – will be analysed and fixed by experts • Should repeat the exercise soon… Status of PDC’05 – operational issues

Acknowledgements • Many thanks to the site experts for the excellent support throughout the PDC’05/SC3 so far • Special thanks: to Kilian and the GridKa team for the making available for the stress test 1200 CPUs • And as usual: Patricia, Stefano, Pablo, Predrag, Andreas and Derek. Status of PDC’05 – operational issues

Status of PDC’05/SC3 System stress test