SouthGrid Status

SouthGrid Status Pete Gronbech: 31st March 2010 Technical Meeting

Timetable • Review of Performance • GridPP3 H/w Money • MoU’s • Site Tickets. • Site Status Reports • Lunch • Monitoring • Pakiti - Kashif • Ganglia - Ewan • PBSWEBMON – Ewan • ipmi temperature monitoring? - Ewan • ATLAS - Pete • AOB

UK Tier 2 reported CPU – Historical View to present

SouthGrid SitesAccounting as reported by APEL Sites Upgrading to SL5 and recalibration of published SI2K values RALPP seem low, even after my compensation for publishing 1000 instead of 2500

GridPP3 h/w money

New Mou Figures Birmingham CPU up to 576 disk down to 114TB Bristol CPU down to 252 disk down to 25TB Cambridge CPU up to 329 disk up to 164TB Oxford CPU up to 638 disk up to 318TB RALPP CPU up to 4129 disk down to 573TB JET MOU should be shared amongst the other sites

Tickets

Atlas Production Monitoring http://panda.cern.ch:25980/server/pandamon/query?dash=prod

SL5 • Need to complete any migration to SL5

CREAM ce’s • Request for sites to setup test cream ce’s • SCAS /glexec • ARGUS

BDIIs ? (Also DNS serversWe decided other locations would suffice ) MAY 2009 and then again this year: Discuss Hi Pete and Ewan, In the light of today's Tier 1 downtime (and coincident Manchester downtime), Chris and I were having a chat and thinking that perhaps it would be a good idea for there to be a top BDII somewhere in SouthGrid. Our feeling is that it wouldn't be a great idea to host it at RAL as we are often affected by the same issues as the Tier 1. So would Oxford be a sensible location? And while on the subject, Chris says he remembers seeing some configuration incantation to give a list of BDIIs to query in order. Do you folks happen to know anything about this (preferably what it is)? Cheers, Rob --

VOMS CERTS - the need or not! • Dear All, • In Dec'09 (trying to debug new CE & SL5 WN) I asked Maarten Litmaath : • > Question: on SL4 WN there's lots of *pem in /etc/grid-security/vomsdir • > On SL5 WN there's no *pem in /etc/grid-security/vomsdir • > Should there be? (Maybe SL5 WN do things different) • And Maarten Litmaath said • > Indeed, on gLite 3.2 we only put the VOMS certs where they are needed, • > i.e. on none of the node types so far. • So, re: new gridpp: the rpm contains • /etc/grid-security/vomsdir/voms.gridpp.ac.uk-20090112.pem • /etc/grid-security/vomsdir/voms.gridpp.ac.uk-20100119.pem • So one doesn't need to install the new gridpp rpm on SL5 WN - confirmed? • The SL5 WN .lsc for gridpp are correct, no change needed - is that • right?

Biomed • Hi, • We are still haven't solved this problem. Our version of • /etc/grid-security/vomsdir/biomed/cclcgvomsli01.in2p3.fr.lsc is • identical to that provided by Chris. • The relevant part of out siteinfo.def reads: • VO_BIOMED_VOMS_CA_DN="'/C=FR/O=CNRS/CN=GRID2-FR'" • VO_BIOMED_VOMS_SERVERS="vomss://cclcgvomsli01.in2p3.fr:8443/voms/biomed?/biomed/" • VO_BIOMED_VOMSES="'biomed cclcgvomsli01.in2p3.fr 15000 /O=GRID-FR/C=FR/O=CNRS/OU=CC-LYON/CN=cclcgvomsli01.in2p3.fr biomed'" • lcg-vomscerts is at version 5.7.0-1 • The dpm is at version glite-SE_dpm_mysql-3.1.33-0 • On our srmv2.2 logs, we see • 02/09 16:56:01 31534,0 srmv2.2: SRM02 - soap_serve error : [::ffff:134.158.72.154] (grid04.lal.in2p3.fr) : CGSI-gSOAP • running on grid001.jet.efda.org reports Could not accept security context • which seem to be a symptom of this problem. • Anyone have any idea on how to proceed? • Thanks • Dave

QUEUES="long medium short express" QUEUE_GROUPS="${TESTQUEUE_GROUPS} ${REALQUEUE_GROUPS}" SHORT_GROUP_ENABLE=$QUEUE_GROUPS EXPRESS_GROUP_ENABLE=$QUEUE_GROUPS # don't give ops access to long/medium = SAM jobs time out in queue LONG_GROUP_ENABLE=$REALQUEUE_GROUPS MEDIUM_GROUP_ENABLE=$REALQUEUE_GROUPS Are "" really needed around $QUEUE_GROUPS in <queue>_GROUP_ENABLE? Is the space in the new defined QUEUE_GROUPS likely the bug? Is there some other bug there? Having rerun yaim, qmgr now says ops only allowed to short & express. The symptoms are in /var/log/globus-gatekeeper.log lcas client name: /DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=samoper/CN=582979/CN=Judit Novak LCAS 0: lcas_plugin_voms-plugin_confirm_authorization_from_x509(): Did not find a matching VO entry in the authorization file Until finding the post http://scotgrid.blogspot.com/2008/08/nanocmos-lcas-fail.html So as they said, commenting out the lcas_voms.mod in /opt/glite/etc/lcas/lcas.db Now the error in /var/log/globus-gatekeeper.log changes to LCMAPS 0: 2010-02-14.04:02:48.0000006664.0000000000 : lcmaps_plugin_voms_localaccount-plugin_run(): Could not find a VOMS localaccount in /etc/grid-security/grid-mapfile (failure) Help! How does yaim make files in /etc/grid-security & 'lose' knowledge of opssgm ? (Afraid don't know much about this) This CE shares a gridmapdir with its sister CE (different arch SL5 WN) via NFS so that pool accounts should get mapped the same on both clusters, but opssgm is not in there. Now, anyway. Sister CE passing all OPS SAM tests, still has older form of site-info.def. (PS both CEs actually in Scheduled MAINT ATM for HPC machine room Power Maintenance) Dear All, Our CE running lcg-CE-3.1.37-0 had its site-info.def changed yesterday & yaim rerun. Since then it fails all OPS SAM tests being unable to map DN to opssgm account. Alice, LHCb & CMS SAM tests all ok though. In /etc/grid-security grep -l opssgm * results in nothing whereas on sister CE (running 3.1.38 if it matters) it's found in several files. The change to site-info.def was changing QUEUE_GROUPS=" /dteam/ROLE=lcgadmin /dteam/ROLE=production dteam /ops/ROLE=lcgadmin /ops/ROLE=production ops /alice/ROLE=lcgadmin /alice/ROLE=production /alice/ROLE=pilot alice etcetc-for-other-VOs " QUEUES="long medium short express" LONG_GROUP_ENABLE="$QUEUE_GROUPS" MEDIUM_GROUP_ENABLE="$QUEUE_GROUPS" SHORT_GROUP_ENABLE="$QUEUE_GROUPS" to TESTQUEUE_GROUPS=" /dteam/ROLE=lcgadmin /dteam/ROLE=production dteam /ops/ROLE=lcgadmin /ops/ROLE=production ops " REALQUEUE_GROUPS=" /alice/ROLE=lcgadmin /alice/ROLE=production /alice/ROLE=pilot alice etcetc-for-other-VOs-except-ops+dteam "

Cambridge Publishing > > >> Hi Santanu/All >> >> I've noticed that UKI-SOUTHGRID-CAM-HEP has some rather odd >> publishing >> figures in the latest WLCG Tier-2 report (see page 8 of https://twiki.cern.ch/twiki/bin/viewfile/LCG/SamMbReports?filename=Tier2_Reliab_201001.pdf >> ). In particular the site KSI2K value is given as "-0". I'm sure >> there is an interesting philosophical discussion about negative zero >> but in this case it is evidently wrong (perhaps a Condor config >> issue?). I'd like to feedback a correct figure to the WLCG >> management >> and understand the figure by Wednesday morning if possible (I have to >> report corrections by that afternoon). On the wider publishing front >> the storage also looks a bit screwy unless there is no storage: http://gstat-prod.cern.ch/gstat/summary/GRID/GRIDPP/ >> . >> >> Many thanks, >> Jeremy >> > Hi Santanu I would suggest raising a GGUS ticket against gstat so that you can work through the details with Laurence et. al. Alternatively the support list email address is: project-grid-info-support (Grid Information System Suport ) <project-grid-info-support@cern.ch>. Cheers, Jeremy On 22 Feb 2010, at 12:45, Santanu Das wrote: > Hi Jeremy, > > I've to agree that don't understand GStat-2 at all. I don't know how > it's being calculated but ldapsearch result is not too bad. > > [root@serv07 ~]# ldapsearch -x -H ldap://serv07.hep.phy.cam.ac.uk: > 2170 -b mds-vo-name=UKI-SOUTHGRID-CAM-HEP,o=grid | egrep > 'GlueCECapability|GlueSubClusterPhysicalCPUs| > GlueSubClusterLogicalCPUs' > GlueCECapability: unset > GlueCECapability: unset > GlueCECapability: unset > GlueCECapability: unset > GlueCECapability: unset > GlueSubClusterPhysicalCPUs: 40 > GlueSubClusterLogicalCPUs: 160 > GlueCECapability: CPUScalingReferenceSI00=2013 > GlueSubClusterPhysicalCPUs: 3 > GlueSubClusterLogicalCPUs: 12 > GlueCECapability: CPUScalingReferenceSI00=2013 > > > Isn't the CPUScalingReferenceSI00 represents the new KSI2K values? I > have two CEs (serv03 & serv07), running two different sets of WNs (SL4 > and SL5 respectively) and they are publishing correct number of > Physical > and Logical CPUs. Whereas in the GStat-2 reports all 0 for them. Pete > may have in-depth knowledge on this. > > > cheers, > Santanu

EGEE Broadcasts

Minutes • Present: • Pete Gronbech (Chair/ Minutes) • Ewan MacMahon • Kashif Mohammad • Chris Brew • Rob Harper • David Robson • Santanu Das • Winnie Lacesso • Chris Curtis PDG used these slides to start the meeting and encourage round the table exchanges of information. ‘It would be good if more of our sites could pick up CMS work, CB said its down to him finding time to enable a site. Site Update: Birmingham Have been in downtime, now just the HPC for 3 days more for power work. There was a s/w area issue. 60TB se (40 TB new + 20 fixed) Twins*8=64 job, uses Xen for VMs. There is KVM support natively in SL5.4 so should be the way to go in future(EM). New cream ce, alice issues, local WN glexec, ARGUS server. Common gridmaap dir required across cream/lcg-ce s HPC same as before Problems with GPFS disappearing randomly, exported to NFS. No using SL5 no longer chroot environment.Atlas pilot jobs failing on the tar ball installed HPC. Atlas supposed to fix but fudged in the mean time Bristol: lcgce02 gpfs sluggish, small files cause a problem. Need to setup SL5 HPC cluster, but it is busy with other work so may not be easy to get an allocation. SL3 mon nodes needs to be upgraded. SL4 HPC is at 94 job slots. Is using VMware server 2 for new ce’s but sees slow performance. Intermittant bdii perf probably down to time outs from the ces. JET: Has been quiet but stable. Fusion/hone and biomed work only.

Minutes 2 RALPP: Uses Xen for VMs with the GUI manager. Black holing WNs seem to be on a rack delivered by Streamline. Disk goes read only but no smart errors. Taken rack offline for disk read/write testing. Will keep better logging of which nodes fail to help spot trends. Old A/C needed to be upgraded, planned temporary A/C was not up to the job. There had not been a full risk assessment and it started on Friday pm! One group of systems are having a memory upgrade (£50k+ end of year money!!) to 32GB ie 4GB core. CAM: 40TB disk added, SL5 WN, SL3 ce to go. Oxford: Mostly stable. AC issues at Xmas, now monitoring closly, and looking at hot spots. (Later turned out to be RACU switched off). T2se15 and DPM pool node failed today causing loss of some ATLAS data. Talked about Pakiti 2, the client sends a list of its rpms to the server and the server compares that to what is in the repos for that type of node. All felt this was a retrograde step compared to pakiti 1 that did the equivalent of yum check-update on the node itself which will follow all local rules, such as priorities, and enabled or not config options for each repo. Workers and UI’s can be configured to point to a list of BDII’s which helps when one is over loaded. This did cause a failure with a Nagios test but that has been fixed now.

Monitoring tools etc

SouthGrid Status