100 likes | 214 Views
A GSI-secured job manager for connecting PBS servers in independent administrative domains. John Walsh, Brian Coghlan, Stephen Childs, Eamonn Kenny (Trinity College Dublin/EGEE) EGEE 2 nd User Forum – Manchester, May 2007. Introduction. RemotePBS
E N D
A GSI-secured job manager for connecting PBS servers in independent administrative domains John Walsh, Brian Coghlan, Stephen Childs, Eamonn Kenny (Trinity College Dublin/EGEE) EGEE 2nd User Forum – Manchester, May 2007
Introduction • RemotePBS • Based on/extends lcgpbs job manager on LCG-CE • Implements secure execution of grid jobs on remote batch systems (RBS) • Separate administrative domains • Single gatekeeper, multiple RBS model • RBS head/submit node • Installed with gLite WN (+ YAIM/Quattor) • Lightweight • Additional “mini” information provider (IP) • Remote access uses grid credentials • A work in progress, but used at three production EGEE sites • Restricted VO/users EGEE 2nd User Forum, Manchester, May 11th 2007
Current GK problems lcgpbs • Allows “remote” execution using routing queues • Requires /etc/hosts.equiv authentication • Known PBS issue • Remote batch submit node → gatekeeper • Weak security model gLite-CE • Separate CE/RBS possible • RBS requires /etc/hosts.equiv • Same administrative domian EGEE 2nd User Forum, Manchester, May 11th 2007
Mini IP # gridgate.ucd.ie:2119/jobmanager-remotepbs-rowan, mpUCDie, local, grid dn: GlueCEUniqueID=gridgate.ucd.ie:2119/jobmanager-remotepbs-rowan,mds-vo-name =mpUCDie,mds-vo-name=local,o=grid GlueCEHostingCluster: gridgate.ucd.ie GlueCEName: rowan GlueCEUniqueID: gridgate.ucd.ie:2119/jobmanager-remotepbs-rowan GlueCEInfoGatekeeperPort: 2119 GlueCEInfoHostName: gridgate.ucd.ie GlueCEInfoLRMSType: remotepbs GlueCEInfoLRMSVersion: 2.1.8 GlueCEInfoTotalCPUs: 194 GlueCEInfoJobManager: remotepbs GlueCEInfoContactString: gridgate.ucd.ie:2119/jobmanager-remotepbs-rowan GlueCEInfoApplicationDir: /home/ # cosmo, gridgate.ucd.ie:2119/jobmanager-remotepbs-rowan, mpUCDie, local, gri d dn: GlueVOViewLocalID=cosmo,GlueCEUniqueID=gridgate.ucd.ie:2119/jobmanager-rem otepbs-rowan,mds-vo-name=mpUCDie,mds-vo-name=local,o=grid GlueVOViewLocalID: cosmo GlueCEAccessControlBaseRule: VO:cosmo EGEE 2nd User Forum, Manchester, May 11th 2007
RemotePBS network architecture EGEE 2nd User Forum, Manchester, May 11th 2007
Job execution flow Remote PBS queue info published by site BDII TLGS/RB ↔ GK interaction remains the same However, no local queue required on GK • Queue name used by remotepbs as lookup to config data • Remote submission node name • Remote gsisshd port • Real remote queue name on RBS • additional PBS server directives (PPN etc) Job Script/Data constructed on GK • Minor modifications • Symbolic links are now relative • Copied to remote submission node via gsissh • Job submitted via gsissh using qsub on remote submission node EGEE 2nd User Forum, Manchester, May 11th 2007
Job status Job ID tracked by GK Monitor process on GK looks up all jobs • Iterates over all remote jobs • Gets unique remote host/queuename pairs • Gsissh qstats to all unique hosts for user jobs • Removes completed jobs • Safe clean up job data on RBS EGEE 2nd User Forum, Manchester, May 11th 2007
RBS setup Gsisshd from VDT • RBS needs host cert • Config can limit connection to only those from GK Shared home directory on RBS/W Modules (optional) • User Grid Context can be determined at login to RBS • Grid environment set up • “module load grid.ie” implicit with gsissh connection • Allows static user to use local batch + grid access EGEE 2nd User Forum, Manchester, May 11th 2007
Current issues JM doesn’t yet implement Access Control on Users/VO • Globus monitoring process connects for invalid user Lifetime of LCG-CE • Move to gLite-CE(?) • Timeframe to implement equivalent + improvements • gLite-CE BLAHP could simplify matter Independent pool accounts not yet possible • Username and $HOME must be same on GK and RBS • Use static accounts • Need to implement pool on CE + pool or static on RBS gsissh needs quick timeout • RBS responsive? APEL accounting records EGEE 2nd User Forum, Manchester, May 11th 2007
Summary • RemotePBS • Implements secure execution of grid jobs on RBS • Separate administrative domains • Single gatekeeper, multiple RBS model • Accommodates Compute Centres with headnode-only model • A work in progress, but used at three production EGEE sites • Acknowledgements • David Golden (UCD & DIAS) • Maarten Litmaath & David Smith (CERN) • Alastair McKinstry (ICHEC) • Stephane Dudzinski (DIAS & TCD) • CosmoGrid project consortium EGEE 2nd User Forum, Manchester, May 11th 2007