50 likes | 156 Views
Feedback on the experiences of the BioMed DC. Nicolas Jacq LPC, IN2P3/CNRS, France. DC report. WISDOM statistics All instances not yet registered Wisdom.egee-eu.fr unavailable JRA2 statistics : Only with RB, biomed DC jobs were selected 1 missing RB (SINICA for technical reason)
E N D
Feedback on the experiences of the BioMed DC Nicolas Jacq LPC, IN2P3/CNRS, France
DC report • WISDOM statistics • All instances not yet registered • Wisdom.egee-eu.fr unavailable • JRA2 statistics : • Only with RB, biomed DC jobs were selected • 1 missing RB (SINICA for technical reason) • Estimation : • No of job run : +60 000 • CPU consumed : +100 CPU years • Total size : -1TB N. Jacq, 05/09/05
Jobs failure • Estimation • 40% successfully done • 20% unsuccessfully done (after checking) • 10% aborted • 10% cancelled : resubmission, errors during the process • 20% failed/unknown : RB or nodes failure, other reasons ? • Unsuccessed done reasons : • 80% License server : server down, SCAI electric cut, droped no of server licenses, no possible flexx jobs on CE - Important because of automatic resubmission • 15% CE configuration : tar pbl, missing space… • 5% No results transfer : lcg AND globus commands failed • Aborted reasons • 63% mismatching resources : Failing middleware component or wrong request in the job JDL • 28% wrong configuration • 4% network/connection failure • 4% proxy problems • 1% JDL problem • Finaly : • ~30 % of failures due to the grid (middleware, resources…) • ~30% due to the WISDOM application • 40% successed done N. Jacq, 05/09/05
Operational issues • RB : Main bottleneck despite the 12 available RB (3-7 in the same time) • Overload, crash, space limit for share repository, disk failure, bad status information for done jobs in CE, impossible to download outputsandboxes • RLS/RMC : Was the main bottleneck for me before the DC • No RLS/RMC during a Renater cut • 10% lcg-cp and lcg-cr commands failed • 5 % globus-url-copy failed also • SE : No important problem • Corrupted tar in the SE UPV • Electrical cut of SCAI, Renater cut for CC • CE : No critical problem • Configuration problems, air-conditioning • BDII update problem (problems for jobs distribution) • Difference of priority for biomgrid/biomed certificates • UI : Handicap for the submission • Slow in multithreaded submission process, disk space, crash N. Jacq, 05/09/05
Improvement proposition (for intensive use of the grid) • Load balancing • Know all possible CE configurations specific to a VO • Have a dynamic information system update for intensive submission • Robustness of the BDII update in a CE • Define a ranking attribute appropriate for the intensive use • Avoid bottlenecks • Not only 1 RLS/RMC (and not only 1 license server) • RB synchronisation • As it is now, a RB failure means losing jobs or OutputSandBoxes. There is a real need for synchronization between RB (specially LB service) to be able to send a job through a given RB and check the status or retrieve results via another. It would mean that job databases and outputSandBoxes are not stored on RB but somewhere on the grid, and can be reached by any RB, and maybe replicated. The idea is to have a job management similar to the data management , with something like the LFN for the jobs. • Process Management • Know all possible errors of the grid (RB, commands) • Small nodes need to have a limit of scheduled jobs in their queue N. Jacq, 05/09/05