100 likes | 225 Views
Di Qing Academia Sinica & CERN Barcelona, 23/05/2007. Test gLite CE with experimental services. Outline. Goals Strategy Typical tests Problems found during the tests Future tests. Goals. Make gLite CE at least as good as the LCG one in terms of performance and stability
E N D
Di Qing Academia Sinica & CERN Barcelona, 23/05/2007 Test gLite CE with experimental services
GLite CE Testing - Di Qing Outline Goals Strategy Typical tests Problems found during the tests Future tests
GLite CE Testing - Di Qing Goals Make gLite CE at least as good as the LCG one in terms of performance and stability 5000 simultaneous jobs per CE node Job failure rates caused by the CE in standard operation: <0.5% Job failures due to restart of CE services or CE reboot <0.5% 5 days unattended running without performance decreasing 1 month at the end of 2007 Using new version of Condor Detect new problems in the middleware
GLite CE Testing - Di Qing Strategy Involving developers and certification team Specially Francesco Prelz and Condor developers, and Nuno Orestes Vaz Da Silva of Certification team One cluster on CERN CTB dedicated to the test Increasing the CPU numbers to as many as possible 20 WNs, each with 10 virtual CPUs Tunning the paramters following the test results Applying new patches as soon as possible to fix the bugs All changes and results of some tests recorded on a wiki page https://twiki.cern.ch/twiki/bin/view/LCG/GliteCETest
GLite CE Testing - Di Qing Typical tests Basic tests Test the basic functionalities Gradual stress test Every few minutes a bunch of 5 jobs were submitted directly from the wms via condor_submit One different Condor instance launched for each job Stress test Thousands of simultaneous jobs per CE node From 4000 to 10000 jobs per day Typically sleep jobs Fill the queue as quickly as possible
GLite CE Testing - Di Qing Problems found during the tests • 100 maximum job limits on glite CE • Bug in older condor and fixed in condor 6.8.4 • Stale condor launcher job blocks all users job • As bug #23779, typically jobs failed as The “PeriodicHold expression 'Matched =!= TRUE && CurrentTime > QDate + 900'” • gridmonitor on WMS could not update gram job status (gram job to launch Condor instance) on CE • Solution: create link from /var/glite/gram_job_state to /opt/globus/tmp/gram_job_state, or /opt/glite/etc/grid-services to /opt/globus/etc/grid-services
GLite CE Testing - Di Qing Problems found during the tests Condor jobs can go on hold when pending removal Under load, the periodic_hold condition gets evaluated between the submission of a condor_rm command and its execution, causing jobs to terminally go on hold in the Condor queue. Bug 25053 Attempts to submit failed Under high load, blah failed to get the job info from BLParser Increase the time interval between retries to exponentially increasing one to solve BLAH proxy renewal can create corrupted files as bug 25841
GLite CE Testing - Di Qing Problems found during the tests Repeated submit attempts (GAHP reports:) Under high CPU load, from time to time,the connection to Schedd on CE met timeout Increase the timeout parameters on WMS C-ghap -> Schedd errors Increase the timeout parameter on CE as workaround Firewall on WNs Port range 20000-20100 must be opened for incoming connections from CE for proxy renewal C-gahp worker thread crash Corrupted file name Another bug from Condor, caused by race condition
GLite CE Testing - Di Qing Achievement Understood the long standing issue Condor instance could not be launched Found several bugs in Condor, BLAH and WMS Performance improvement Job success rates from around 50% to more than 98% Stability improvement 1 or 2 days to more than 5 days unattended running
GLite CE Testing - Di Qing Future tests Move user based condor instance to VO based on glite CE The gLite CE needs a static Condor-C per VO, submitting jobs to the batch system via glexec Condor 6.9.3 Changes needed on both WMS and CE Stress tests with other batch systems Measure and improve job failures due to restart of CE services or CE reboot Extend to longer unattended running Other issues ?