280 likes | 297 Views
Explore the motivations, testing results, and real-world experiences with the gLite WMS from the 2006 All-Hands meeting. Learn about improvements over the LCG Resource Broker and the latest developments in job handling and dispatching rates. Dive into the bug fixes, scalability tests, and production use cases in various experiments. Discover the journey towards a more robust gLite WMS with advanced features like job perusal and VOMS-based access control. Stay updated on the progress towards gLite 3.1 and the overall status of the grid middleware system.
E N D
Experiences with the gLite WMS Andrea Sciabà JRA1 All-Hands meeting 8-10 November 2006 Abingdon, UK
Outline • Motivations • Testing and using the gLite WMS • The ATLAS experience • The CMS experience • The LHCb experience • The ARDA experience • Current status of the gLite 3.0 WMS • First look to the gLite 3.1 WMS • Conclusions
Motivations (I) • The LCG Resource Broker is robust but • The code is basically frozen • no new features • difficult to have bug fixes • The Network Server submission is too slow • Job submission may require tens of seconds/job when the RB is loaded • The maximum job rate is limited • Experience from the LHC experiments indicates that the LCG RB cannot handle more than ~7000 jobs/day • It does not support the renewal of VOMS proxies • VOMS proxies are now the standard because they allow for fine-grained authorization (data access, job priorities)
Motivations (II) • The gLite WMS improves on almost every aspect • Higher job submission and dispatching rate • Typical submission rate: 0.5 s/job • Typical dispatching rate: 3/s job • More job types (DAG, collections, parametric) • Automatic zipping of sandboxes • and many other features • Job Perusal, support for VOViews with VOMS FQAN-based access control, … • The goal is to have also a gLite WMS at least as robust as the LCG RB • Not yet completely accomplished
Testing the gLite WMS (I) • A intensive testing activity started on mid July involving • Developers (JRA1) • SA1, SA3 • EIS team (on behalf of the experiments) • A few gLite WMS were installed at CERN, CNAF and Milan • Initial tag: 3.0.2 RC6 • Bug fixes deployed as soon as available • All instances were kept synchronized • All the WMS were configured to see the EGEE production Grid • Necessary to run tests at a realistic scale • The tests were primarily done by submitting jobs to LCG CEs • The gLite 3.0 CE is still too unreliable • The focus was more on “getting the jobs done” than debugging the gLite CE
Testing the gLite WMS (II) • The testing process was primarily driven by ATLAS and CMS • ATLAS • Committed to use the gLite WMS for the official Monte Carlo production on EGEE resources • if the WMS did not work, part of the MC production stopped! • The job load is determined by the production assignments • CMS • A real CMSSW application was run but it did not perform real work • The number of submitted jobs and collections was changed to determine the scalability of the WMS • Now the gLite WMS is being used in production for the CSA06
History of the bug fixes • Upgrade of Condor (6.7.10 6.7.19) • Fixed bug #18123 • LB processes stuck in infinite loop • Condor configuration heavily modified • To cure the tendency of DAGs to die • GRIDMANAGER_MAX_JOBMANAGERS_PER_RESOURCE increased • to speed up Condor submission • Fixed bug #18725 • Job Wrapper tried to upload also non-existent files • Discovered very poor I/O disk performance of the rb102.cern.ch node and reconfigured the RAID arrays for optimal performance • This cured the high mortality of DAGs due to LB connection timeouts • Reduced the number of planners per DAG to limit the number of concurrent planners, which used a lot of swap • Reduced the number of WMProxy server processes • Limited but not cured the usage of swap memory by the WMProxy server
Don • Quijote • Don • Quijotte • ProdDB • ProdDB • supervisor • supervisor • supervisor • supervisor • supervisor • supervisor • supervisor • supervisor • LCG • LCG • OSG • NG • batch • OSG • NG • batch • executor • executor • executor • executor • executor • executor • executor • executor • LFC • LFC • LRC • RLS • LRC • RLS • batch • LCG • batch • OSG • LCG • OSG • NG • NG • batch • LCG • GRID3 • NG ATLAS jobs • Production of simulated events • A central database of jobs to be run • Event generation • Simulation + digitization • Reconstruction • A “supervisor” for each Grid • takes jobs from the central database • submits them • monitors them and checks their outcome • An “executor” acting as interface to the Grid middleware • EGEE/WLCG • Lexor using the gLite WMS • Condor-G direct submission
Submitting jobs via the gLite WMS Eowyn Logging & bookkeeping Information System Cache Query the job status Pass the job description WMproxy Resourcebroker Lexor Submit job collection Schedule jobs on selected resources ComputingElement & Local Batch System ComputingElement Return Job output to Lexor ComputingElement ComputingElement JOB LIFECYCLE
ATLAS Tests • Job characteristics • Simulation • Approximate CPU time: 3 h • Simulation • Approximate CPU time: 20 h • Reconstruction • Approximate CPU time: 3 h • Job submission • Bulk submission • The supervisor groups jobs to be executed in collections of 100 jobs each • Each job in a collection can run on a different site • Also synthetic tests run • Very simple jobs (“Hello world”) that can run anywhere • To study the impact of the shallow resubmission • To assess the reliability of the bulk submission
Latest ATLAS Results From 7-jun to 9-nov • Official Monte Carlo production • Up to ~3000 jobs/day • Less than 1% of jobs failed because of the WMS in a period of 24 hours • Synthetic tests • Shallow resubmission greatly improves the success rate for site-related problems • Efficiency =98% after at most 4 submissions
ATLAS issues • LB in Milan failed to update job status • interlogd LB daemon stuck • Cured by a restart of glite-lb-logger but not understood • Missing documentation for gLite 3.1 new features • DAGMan-based collections too fragile to be used by final users • Timescale to get rid of DAGMan for collections? • Still missing from UI the Python bindings for the WMS client API
CMS jobs • Job characteristics • Software: CMSSW 0.6.1 • Data analyzed: test sample preinstalled at CMS sites • Approximate CPU time: 30’ • Job submission • Predefined number of jobs submitted at each CMS site • Various mechanisms tested • Network Server • WMProxy • ~20% faster submission rate than via NS • Collections • Best possible submission speed • Submission in parallel from up to three User Interfaces
CMS testing process • The submission is started • The total submission time is measured • The job status changes are monitored • The time to have all the jobs moved to Scheduled status is measured • All job failures are investigated and classified • A report is sent to support-exp-glite-rb@cern.ch • Bugs are identified, understood and fixed • The process is iterated several times
CMS results (I) • Non-bulk submission was soon found to be as reliable as for the LCG RB • Bulk submission was plagued for long by serious problems • Mass cancellation of DAG nodes • Very slow Condor submission rate • Excessive use of swap memory • Service crashes (e.g. LogMonitor) • Submission failures due to timeouts under load • Truncation of file name in input sandbox files • Almost all of the serious problems have been eventually fixed • The gLite WMS is now sufficiently stable to be used for real work with some effort • But still much more fragile than the LCG RB
Some CMS Results • ~20000 jobs submitted • 3 parallel UIs • 33 Computing Elements • 200 jobs/collection • Bulk submission • Performances • ~ 2.5 h to submit all jobs • 0.5 seconds/job • ~ 17 hours to transfer all jobs to a CE • 3 seconds/job • 26000 jobs/day • Job failures • Negligible fraction of failures due to the gLite WMS • Either application errors or site problems
Other CMS results • Submission of 25000 jobs/day spread over 24 hours • Already too heavy on the system • Too much memory used • Submission of 15000 jobs/day with a standalone LB server • Totally safe
gLite WMS usage in CMS CSA06 • Submission rate < 10000 jobs/(dayRB) so far • Issues encountered so far • libtar bug in User Interface (it led to file name truncation) (fixed) • Gang-matching bug (if used, it blocks the RB for hours) • Log Monitor crashed and could not be restarted (corrupted CondorG logs) • rb102.cern.ch becoming very slow (LB stuck and using CPU?) • RAL CEs invisible only from gLite WMS (problem with published info?)
LCHb first tests • Testing new functionalities • Job Perusal • Sandbox from/to Gridftp servers • VOViews • VOMS extension renewal • Bulk submission • Performances Tests • Timing (Submitted, Ready, Scheduled) • Efficiency (Scheduled and Done vs. Shallow Resubmission retry) • Scalability tests • To measure how many jobs the new gLite WMS can substain Timing (single job, single thread) No resubmission Efficiency Vs Shallow retry 5000 jobs with ShallowRetryCount=10
LHCb issues • Proxy exception: Proxy validity starting time in the future Please check client date/time Method: jobSubmit Error code: 1222 • Message obtained during single thread and single job submission • Operation failed Unable to submit the job to the service:https://cert-rb-04.cnaf.infn.it:7443/glite_wms_wmproxy_server SOAP version mismatch or invalid SOAP message Error code: SOAP-ENV:VersionMismatch • Message obtained for 10 threads, 500 jobs each from a single UI (UI was not overloaded) • Inconsistent status timestamp using glite-job-status although the job was successfully executed. Glite-job-logging-info holds the right information • Submitted : Fri Sep 29 12:33:38 2006 CESTWaiting : Fri Sep 29 12:33:38 2006 CESTReady : ---Scheduled : Fri Sep 29 12:37:30 2006 CESTRunning : Fri Sep 29 12:38:43 2006 CESTDone : Fri Sep 29 12:43:21 2006 CEST • Default Ranking problem: in less than 3 hours more than 800 jobs went to Bari • Fuzzy rank switched off nonetheless this points a wrong behavior of default rank
Job reliability studies Most common failure reasons from 400K jobs • Trying to improve the knowledge about the failure reasons for Grid jobs • Collecting statistics • Pictorial representations of the job history • Not yet enabled on gLite WMS • The gLite RBs are not yet sending info to R-GMA • Now the same framework is used to collect FTS failure statistics
Extract jobID, worker node, additional information (error conditions breakdown) Job reliability Web interface to navigate submitted jobs and failure reasons
Configuration parameters to tune • Number of planners/DAG • Needs to be optimized as a function of the number of collections • Too many DAGs too many planners • Number of WMProxy processes • Too many of them take too much memory • GLITE_WMS_QUERY_TIMEOUT • If too short, queries to LB can fail under heavy load • Time between retries when a job is not matched to any CE • If too short, the WMS spends too much time in matchmaking
Some unsolved bugs in gLite 3.0 • The number of WMProxy processes can increase beyond limits if large collections are submitted (>1000 jobs) • due to callouts to create job directories • Duplicated CEs in the list of matched CEs • Due to the introduction of VOViews • LogMonitor repeatedly dies • Due to corrupted CondorG log files • Some CEs do not appear in the Information Supermarket • But they are seen by LCG RBs • impossible to submit jobs to some sites (like RAL!)
Desired features • Automatic proxy delegation with a non-random delegID • Already foreseen • Configurable duration of the time interval during which the WM tries to match a job to a CE • Cancellation of single collection nodes
First tests with gLite 3.1 Site Submit Wait Ready Sched Run Done(S) Done(F) Abo Clear Canc ce01-lcg.cr.cnaf.infn.it 20 0 0 0 0 980 0 0 0 0 ce03-lcg.cr.cnaf.infn.it 107 0 0 0 0 893 0 0 0 0 ce04.pic.es 37 0 0 0 2 961 0 0 0 0 ce101.cern.ch 40 1 0 0 1 934 0 24 0 0 ce102.cern.ch 215 12 0 0 0 0 0 773 0 0 ce105.cern.ch 178 13 0 0 0 0 0 809 0 0 ce106.cern.ch 236 0 0 0 0 764 0 0 0 0 ce107.cern.ch 288 25 0 0 0 117 0 570 0 0 ceitep.itep.ru 110 0 0 0 0 890 0 0 0 0 cmslcgce.fnal.gov 49 0 0 0 0 951 0 0 0 0 cmsrm-ce01.roma1.infn.it 259 0 0 0 0 741 0 0 0 0 dgc-grid-40.brunel.ac.uk 26 0 0 0 0 974 0 0 0 0 grid-ce0.desy.de 228 1 0 0 0 771 0 0 0 0 grid10.lal.in2p3.fr 69 0 0 0 0 931 0 0 0 0 grid109.kfki.hu 50 0 0 0 0 950 0 0 0 0 gridce.iihe.ac.be 244 0 0 0 0 745 11 0 0 0 gw39.hep.ph.ic.ac.uk 0 0 0 0 0 442 168 383 0 7 lcg00125.grid.sinica.edu.tw 7 57 0 0 0 773 17 146 0 0 lcg02.ciemat.es 16 0 0 0 0 980 0 4 0 0 oberon.hep.kbfi.ee 206 0 0 0 0 392 359 43 0 0 t2-ce-02.lnl.infn.it 375 0 0 0 0 625 0 0 0 0 21000 jobs submitted in 24 hours to 21 CEs using collections of 200 jobs
Results on first tests on 3.1 • ~20% of jobs affected by a bug which makes the look as still Submitted even if they finished • Sequence code of some events is wrong • A RegJob event at the end • understood how to fix it • Some DAGs aborted • error message "cannot create LB context" • No other WMS-related errors seen
Conclusions • The gLite WMS is used in production by ATLAS and CMS • Significant progresses were made in the last few months to bring it to an acceptable level of stability • It is not yet in a condition where it can run unattended for several days without problems with a realistic load • Given that the performances and the feature set are fine enough, for the applications the main focus should now be on robustness and reliability • Automatic refusal of new jobs when too loaded • Memory usage under control • Collections not as DAGs • Services should do their best not to die • ...