170 likes | 320 Views
CREAM Status and plans. Massimo Sgaravatto INFN Padova On behalf of the CREAM product team. Status. CREAM CE v. 1.5 now in production In gLite 3.1 (sl4_i386) In gLite 3.2 (sl5_x86_64) CREAM CE v. 1.6 for gLite 3.2 / sl5_x86_64 certified 2 weeks ago
E N D
CREAMStatus and plans Massimo Sgaravatto INFN Padova On behalf of the CREAM product team
Status • CREAM CE v. 1.5 now in production • In gLite 3.1 (sl4_i386) • In gLite 3.2 (sl5_x86_64) • CREAM CE v. 1.6 for gLite 3.2 / sl5_x86_64 certified 2 weeks ago • Certified according to the new certification model • While trying it at UCSD last week (in the context of the OSG CREAM evaluation) bug #64516 in trustmanager was found • Issue not in CREAM, but affecting (also) CREAM • It affects certificates signed by the DoE CA • Needed to rebuild against a fixed version of trustmanager, when ready GDB - Amsterdam, March 24, 2010
New certification model • Each middleware product team is responsible to certify its software • Two type of patches • Metapackage patch: a patch corresponding to a Grid node (e.g. CREAM CE, WMS node, LB server, …) • Internal patch: software to be included in one or more metapackage pacthes (e.g. trustmanager, lcas, lcmaps, voms, etc.) • When a metapackage patch is certified, the relevant internal patches certified so far are also “included” • Old style patches not accepted anymore by the gLite release team since December 9, 2009 • The transition of the new certification model is taking much more than originally hoped • Certification of patches wrt the new model possible for gLite 3.2 / sl5_x86_64 since ~ mid of February • Still not technically possible to certify patches for gLite 3.1 / sl4_i386 • Not only for CREAM, but for gLite in general GDB - Amsterdam, March 24, 2010
CREAM CE 1.6 • New operation QueryEvent • To be used by WMS (ICE) • To improve ICE’s job status changes detection • Scalability problems in current (CEMon + polling) approach • Glexec sudo • Glexec used only once per job submission (just to find the uid to be used in sudo calls) • Better performance • Will facilitate the integration with Argus • New BLAH BLparser for LSF and Torque/PBS • Using the status/history commands instead of parsing the log files • Log files might still be needed by the batch system commands (e.g. tracejob, bhist) • Allows also easier configuration • Not needed anymore to configure the CE and then the blparser • Old parser still supported • At configuration time the BLparser type (new/old) can be chosen GDB - Amsterdam, March 24, 2010
CREAM CE 1.6 (cont.ed) • Support of SGE in BLAH (as contributed by CESGA/LIP) • Limiter to protect CREAM when the machine is overloaded • New job submissions are blocked when this happens • Taking into account load, memory usage, # of file descriptors, etc. • Very similar to the limiter used in the WMS • Proxy purger • To clean from the delegationDB and from the file system expired delegations • Support of file transfers from/to gridftp servers started with user credentials • Asked by Condor • Improved CREAM startup • It could take a while if needed to get the status of jobs in a non-terminal status GDB - Amsterdam, March 24, 2010
CREAM CE 1.6 (cont.ed) • Improved proxy renewals when there are multiple jobs sharing the same delegationid • Typical use case considering at least the submissions from WMS and from Condor • Improved performance of some DB queries • Several bug fixes • User ‘tomcat’ not added anymore to VO groups and glexec group • Glexec’s lcmaps conf file fixed • It could happen that you get mapped to a user different than the one mapped by gridftpd • GLITE_WMS_RB_BROKERINFO properly set in CREAM jobwrapper • … • Configuration changes already communicated to M. Jouvin for Quattor QWG templates GDB - Amsterdam, March 24, 2010
Batch system support status • Batch system support is coordinated by NIKHEF/SA3 • A batch system xyz is supported in the CREAM CE • When xyz is supported in BLAH • When glite-xyz-utils is provided • Apel • Information providers • Configuration (yaim-xyz) • Torque/PBS • Supported in gLite 3.1 and gLite 3.2 • Also new blparser available with CREAM CE 1.6 • LSF • Supported in glite 3.1 • Supported in glite 3.2 when glite-lsf-utils is released • Current status of relevant patch (#3403): ready for production • Also new blparser available with CREAM CE 1.6 GDB - Amsterdam, March 24, 2010
Batch system support status (cont.ed) • SGE • CESGA-LIP responsibility • Support in BLAH provided with CREAM CE 1.6 • glite-sge-utils provided with patch #3764 • Current status: ready for roll-out • A couple of sites already trying it • Uni-Muenchen, LIP • Only new blparser • Condor • PIC responsibility • Support in BLAH in place since a while • glite-condor-utils still missing in gLite 3.2 • According to Pau Tallada (PIC) they are close to finalize it • In the meantime CREAM with Condor is possible with some manual configurations • E.g. this is what was done at UCSD (OSG), where they use Condor as batch system • Only new blparser • BQS • “Running in production and available to the 4 LHC VOs” (S. Reynaud) GDB - Amsterdam, March 24, 2010
WMS CREAM • Main issue is detection of job status changes by the ICE component of the WMS • Job status changes is some cases are not detected by ICE (bug #61405) • Job finished, but reported as Running wrt WMS/LB • Job status changes might be detected late • Current approach based on CEMon notification + polling doesn’t scale • Problems addressed in WMS 3.2.14 (patch #3621) for gLite 3.1 • Bug fixes and use of the new QueryEvent operation provided by CREAM CE 1.6 • Our tests looks promising wrt ICE (see next slide) • In some tests main bottleneck now appears to be the LB, in some cases (in particular when the LB DB not properly configured) not able to sustain a high submission rate • Stuck in finalizing and certifying patch #3621 • Metapackage preparation, etc. • Still not possible to certify patches for gLite 3.1 GDB - Amsterdam, March 24, 2010
Job status changes detection by ICE GDB - Amsterdam, March 24, 2010
Interoperability • New ICE (wms 3.2.14) New CREAM CE (CREAM 1.6) • Job status changes supposed to be all detected and much more promptly than now • New ICE (wms 3.2.14) Old CREAM CE (CREAM < 1.6) • Just status changes supposed to be all detected if there is a valid proxy around (i.e. no more bug #61405) • But there could be still problems in promptly detecting job status changes • Old ICE (the one in production now) New CREAM CE (CREAM 1.6) • Working not worse than now GDB - Amsterdam, March 24, 2010
Condor CREAM • Issues • Lease not renewed • Confirmed by Condor people that the problem is in the Condor side • They are going to provide a fix in ~ 2 weeks • Problems with proxy renewals • Noticed that the proxy renewal is done very (too) often by Condor • Still not fully clear why • Proxy renewal not very efficient in CREAM when many jobs use the same delegationid • CREAM can take too much to satisfy such requests and the queue of commands to be executed in CREAM can grow too much • Problem fixed in CREAM CE 1.6 • Overload of the gridftp server on the Condor host when many short jobs start together • Condor is going to use a different approach for sandbox management (see next slide) GDB - Amsterdam, March 24, 2010
Sandbox transferring • Condor (and WMS) now uses 1 • File transfers 1 done when job starts running • Gridftpd on Condor host can get overloaded when many job starts running together • Going to move to 2 • 2a done when job is submitted • 2b done when job starts running • OSG is also asking to use batch system staging facilities instead of gridftp for 2b • Likely appropriate (only) if Condor is used as batch system • With e.g. Torque (when ssh id used) I am afraid it will make things worst • it will be configurable Condor submitting host 1 2a Job 2b CREAM CE WN GDB - Amsterdam, March 24, 2010
Output Sandbox • Right now in the JDL it is necessary to specify where (which gridftp/https) the OSB must be staged • LHCB has very recently (last week) asked for a different approach • Possibility to store the OSB files in the CREAM CE • Possibility to then retrieve them using a glite-ce-job-output command • This was also discussed with Alice time ago, but the outcomes was using gridftp servers on their VOBOXes • Also M. Jouvin raised the issue recently • This can be done, but requires some work • Requires also some “space management” in the CREAM CE • Enforcing of max sandbox size, purging of old sandboxes when the free disk space is getting too low • Still to understand how critical is this request, in order to evaluate how the current plans must be modified GDB - Amsterdam, March 24, 2010
Other issues • LCG-CE coupled with CREAM-CE • Having a cluster used by a gLite 3.1/sl4 LCG CE and by a gLite 3.2/sl5 CREAM CE is a common use case • Open issue for Torque: maui client/server mismatch (bug #61698) • Unfortunately this is not in our domain • Documented a workaround (suggested in the LCG-ROLLOUT mailing list) in the CREAM known issue page (http://grid.pd.infn.it/cream/field.php?n=Main.KnownIssues) • Some recent requests by LHCB • Define in the env of the job the CREAMjobid • Needed for some DIRAC monitoring issues • Easy to be done • To be provided with CREAM CE 1.6 (since we have to rebuild it for the trustmanager issue) GDB - Amsterdam, March 24, 2010
What’s next (current plan) ? • Certification of CREAM CE 1.6 client for gLite 3.2/sl5_x86_64 (patch #3671): on-going • Support of QueryEvent operation • Some minor fixes • Certification supposed to be quite fast • Certification of CREAM CE 1.6 (server side) for gLite 3.1/sl4_i386 (patch #3898): • Not possible now (as all patches for gLite 3.1) • Wrt CREAM same software than the one used for gLite 3.2 (i.e. ready) • … but build and run against different versions of other software components • Certification of CREAM CE 1.6 client for gLite 3.1/sl4 • Not possible now (as all patches for gLite 3.1) • New ICE for gLite 3.1 (provided with WMS 3.2.14): patch #3621 • Certification not possible now (as all patches for gLite 3.1) • CREAM CE 1.6 client must be certified first • CREAM CE 1.7 GDB - Amsterdam, March 24, 2010
CREAM CE v. 1.7 • Integration with Argus • Argus used to decide if a certain operation on a CREAM CE is authorized • Also used to get the local user id • Single AuthZ system in the CREAM CE • Now there is AuthZ layer in CREAM, LCAS/LCMAPS for glexec, LCAS/LCMAPS for gridftpd • Because of bugs/misconfigurations inconsistent authZ decisions could be taken • Also gridftpd integrated with Argus • Not using glexec (and dependencies) anymore • Initially the old code will be maintained • At configuration possible to decide if Argus or “old system” has to be used • Some work still to be done also in the Argus side • Bug fixes • Unlikely to be finalized by the end of EGEE-III GDB - Amsterdam, March 24, 2010