130 likes | 262 Views
Speeding up and Stabilising the CE. WLCG Operations Workshop, CERN January 25-26, 2007. Antun Balaz SEE-GRID-2 WP3 Leader EGEE-SEE ROC Serbia Country representative Institute of Physics, Belgrade antun@phy.bg.ac.yu. Overview.
E N D
Speeding up and Stabilising the CE WLCG Operations Workshop, CERN January 25-26, 2007 Antun Balaz SEE-GRID-2 WP3 Leader EGEE-SEE ROC Serbia Country representative Institute of Physics, Belgrade antun@phy.bg.ac.yu The SEE-GRID-2 initiative is co-funded by the European Commission under the FP6 Research Infrastructures contract no. 031775
Overview • There are many tricks, post deployment alterations and improvements that have been found over the many years of the CEs being in production • A summary of these will be presented, roughly falling into the following categories: • Hardware/General • Deployment • Information system/Monitoring • JobManager • Batch system • Experiences collected from EGEE/WLCG community, with special contributions from the SEE-GRID project, EGEE-SEE ROC, and: Emanouil Atanassov, Maarten Litmaath, Fotis Georgatos, Peter Love, Einat Bielopolski, Valentin Vidic WLCG Operations Workshop, CERN, January 25-26, 20072
Hardware/General (1) • A fast (multi-processor) machine obviously helps making the CE less of a bottleneck • A lot of memory is a good idea too (the CERN CEs have >= 2 GB); when the CE gets really busy, a lot of small or short-lived processes might still cause the machine to start paging otherwise • A big farm should have multiple CEs in front, to spread the load, but also for redundancy • A big farm ought to be split into multiple clusters, each with its own CE (or CEs); this allows for better scaling and easier maintenance (only 1 cluster at a time needs to be down) WLCG Operations Workshop, CERN, January 25-26, 20073
Hardware/General (2) • Separate site-BDII from the CE: A few tens of CPUs (say 50) already make it worthwhile to prevent the site from dropping out of the top-level BDII when the CE head node is highly loaded due to job submissions and cleanup. • The CE GRIS should not run on the head node either, because it could suffer the same problem. Running also the GRIS on the site BDII has been worked on, but those changes have not been certified yet. In the meantime the site BDII allows for longer timeouts, so the problem is mitigated somewhat. • The site BDII can be a very minimal node, because it has almost no work, but it should be quite reliable WLCG Operations Workshop, CERN, January 25-26, 20074
Hardware/General (3) • Currently, it is too easy for a single fan (a fan!!) to fail and bring down with it tens if not hundreds of WNs, or grid services. The grid, as we know it, lacks redundancy and resilience at large. The general trend now should be to remove as many SPOFs as possible • If WNs use some shared file system (NFS, GPFS, …) for homes, CE should never be their file server • Installing the name service caching daemon (package nscd) may be a good idea WLCG Operations Workshop, CERN, January 25-26, 20075
Deployment (1) • A fabric monitoring system (Ganglia / Nagios / BigBrother / Lemon /...) should be used to monitor the health of the nodes and discover problems early on; James Casey and Ian Neilson are leading a grid service fabric monitoring workgroup whose goal is to have all grid services monitored by the local fabric monitoring system as well as remotely, such that problems can be detected and fixed as early as possible • To stabilize the operations the admin should be careful with changes that are not demanded by middleware updates WLCG Operations Workshop, CERN, January 25-26, 20076
Deployment (2) • If gCE at one point is (by mistake) installed and configured on a node intended to be the lcg-CE, it is very difficult to undo the damage; clean reinstall is much easier option in such and similar cases • Installing just the minimal list of OS packages and leaving to RPM dependencies installation of additional ones is highly advisable • LCG Quattor Working Group provides templates for various node types • https://trac.lal.in2p3.fr/LCGQWG • Keeping just the needed set of services active on all nodes (daemons etc.) is highly advisable WLCG Operations Workshop, CERN, January 25-26, 20077
Information system/Monitoring (1) • Information system is complex • GIP, MDS, GRIS, site-BDII • Each component queries for information using plugins/providers, which can cause high CPU load on the CE • Job monitoring scripts, and other monitoring tools at the CE (e.g. GridICE) also can add to this high CPU load • Example: the current lcg-info-generic script can take as many as 90 seconds to process moderately loaded CE serving 100 CPUs in WNs • Considerable ffort invested in providing more efficient scripts, plugins, providers by Jeff; also by Steve • https://savannah.cern.ch/bugs/?func=detailitem&item_id=16625 WLCG Operations Workshop, CERN, January 25-26, 20078
Information system/Monitoring (2) • Emanouil Atanassov provided two hyper-optimized versions of lcg-info-generic • http://glite.phy.bg.ac.yu/GLITE-3_0_2/lcg-info-generic-1.0.22-1_sl3/ • The first one (lcg-info-generic-EMO1) utilizes the fact that the value of the logical expression in this statement somewhere in the code if (lc($dynamic_dn) eq lc($static_dn)){ is known in advance - it is fixed after the statement: $dynamic_dn=$_; • This can reduce the script execution time considerably: using previous example, execution time of 90 seconds on the CE can be as low as 10 seconds! WLCG Operations Workshop, CERN, January 25-26, 20079
Information system/Monitoring (3) • The second, favourite version (lcg-info-generic-EMO2) utilizes the obvious fact that it is not necessary to scan through all of the dynamic lines, in order to find the relevant ones; one can store in an array the locations of the matching dynamic DN lines and then for every static DN loop only through the dynamic attributes, found after a matching DN line • Example: on the same CE we used previously, this script reduces 90 seconds of execution time to just 1-2 seconds • In terms of CPU load, this can translate in enormous decrease, depending on the number of jobs WLCG Operations Workshop, CERN, January 25-26, 200710
JobManager (1) • The Globus JobManager running on a CE polls a batch server frequently (by issuing commands like 'qstat -f' for torque) • These requests can ask for a very detailed output from the batch system server; since only the job IDs and corresponding job states are the required information, a simpler commands which produce far less output may be built into the JobManagers • Details for torque: • http://goc.grid.sinica.edu.tw/gocwiki/High_%28network%29_load_on_PBS_server_and_CE_caused_by_JobManager WLCG Operations Workshop, CERN, January 25-26, 200711
JobManager (2) • If MPI is supported by the CE and homes shared on WNs, the following adjustment to the JobManager allows for the usage of local scratch dir for non-MPI jobs: • http://listserv.cclrc.ac.uk/cgi-bin/webadmin?A2=ind05&L=LCG-ROLLOUT&P=R559098&I=-3 • If WNs share homes from the CE over NFS and MPI is supported according to the usual MPI recipe, all jobs will be executed from pool account home directories, which will effectively kill the CE for a large number of jobs • This highlights how a bad general decision (where to put the file server) can affect job execution in a non-trivial way, and how it can be avoided with a small trick WLCG Operations Workshop, CERN, January 25-26, 200712
Batch system • Batch system should be separated from the CE • However, if this is not the case, since some VOs tend to submit a lot of small jobs (= a lot of perl scripts creating high load), you may want to limit the number of queueable jobs: • set queue biomed max_queuable = 100 • If you have a lot of CPUs (> 50) then it may be a good idea to introduce some kind of caching of the command used to provide job states (if not built-in); older example for torque developed at NIKHEF: • http://www.dutchgrid.nl/Admin/nikhef/ • More details and tips&tricks will be given in Steve’s tutorial WLCG Operations Workshop, CERN, January 25-26, 200713