180 likes | 287 Views
UKI-SouthGrid Report and Final Preparation Steps. Kashif Mohammad Deputy SouthGrid Technical Coordinator GridPP 23 - Cambridge 9th September 2009. SouthGrid Tier 2. The UK is split into 4 geographically distributed tier 2 centres SouthGrid comprise of all the southern sites not in London.
E N D
UKI-SouthGrid Reportand Final Preparation Steps Kashif Mohammad Deputy SouthGrid Technical Coordinator GridPP 23 - Cambridge 9th September 2009
SouthGrid Tier 2 • The UK is split into 4 geographically distributed tier 2 centres • SouthGrid comprise of all the southern sites not in London • New sites likely to join GridPP-23 Cambridge
UK Tier 2 reported CPU – Historical View to present GridPP-23 Cambridge
SouthGrid SitesAccounting as reported by APEL GridPP-23 Cambridge
New Total Q209SouthGrid GridPP-23 Cambridge
Site Setup Summary GridPP-23 Cambridge
SL5 Migration and Benchmarking • RAL-PPD has already moved its whole cluster to SL5. • Oxford has moved a small part of its cluster to SL5. • Plan to move rest of Cluster before end of September. • Bristol has a small dedicated cluster and a shared HPC Cluster • Ready to move, but some problem due to shared resources and GPFS file system • Birmingham has also a dedicated cluster and shared HPC • Plan to move in October • Cambridge is planning to move in October. • Condor support is an issue • Benchmarking • All sites have benchmarked their system using HEPSPEC 2006 but not publishing it yet in BDII. GridPP-23 Cambridge
New Staff • May 2009 Chris Curtis SouthGrid Hardware support based at Birmingham • June 2009 Bob Cregan HPC support at Bristol GridPP-23 Cambridge
GRIDPPNAGIOS https://gridppnagios.physics.ox.ac.uk/nagios http://www.gridpp.ac.uk/wiki/UKI_Regional_Nagios Many new features are available Use of messaging bus through message broker Most of the SAM equivalent test is available But still in development stage GridPP-23 Cambridge
Prepare To Run GridPP-23 Cambridge
LHC VO Usage in last 9 months GridPP-23 Cambridge
CMS and LHCb • CMS jobs are running very efficiently. • Bristol and RALPP are two T2 CMS Sites in SouthGrid. • At Bristol, problem using GPFS/StoRM. CMS jobs using file protocal change the ACLs/ permission. Temporary solution is to run cron job every half hour. • Not very efficient. • Oxford is running CMS jobs using PhEDEx server at RAL-PPD • LHCb Jobs are also running very efficiently. • Some times sites were banned and site admins have no idea that they are banned. • Should be a mechanism to notify sites before banning it. • Otherwise no major problems. • But, do we need stress test for LHCb and CMS? • Once data taking commences will the load at sites increase significantly ? GridPP-23 Cambridge
Step09 and HammerCloud Tests • 4 SouthGrid sites participated in Step09. • Very useful in finding bottlenecks and configuration problems. http://gangarobot.cern.ch/st/step09summary.html GridPP-23 Cambridge
Bottlenecks and Solution • First series of HC tests used RFIO access. • At RAL-PPD, network connection between two machine rooms having WN’s and storage was found to be problem. • Currently it is 2 X I Gbps, plan to upgrade to 10Gbps in very near future. • In Oxford also we faced a similar problem with the network link to storage pool node becoming saturated. • Currently 1Gpbs connection. • Wish to upgrade to 10Gbps. GridPP-23 Cambridge
Bottlenecks and Solution • Second series of HC tests used file staging. • At Oxford, we Increased the number of job slot available to atlas pilot jobs periodically. • We faced the disk contention issue as the number of jobs on a single wn increases. GridPP-23 Cambridge
Conclusion • Control number of jobs per WN through MAUI • Currently not available in maui in a clean way • RFIO read ahead buffer • Experimenting with different read ahead buffer size • Inconclusive • Channel Bonding • Helped but not much. • SSD for DPM Head Node database • Ordered one for oxford. Would test it. • 10Gbps Network Connection between WN and Disk pool • Would certainly help Low cost High cost GridPP-23 Cambridge
Conclusion • File system • Require man power and expertise • Lustre : Seems to improved performance at QMUL • GPFS : Bad experience at Bristol • Xrootd : No idea • SSD Hard Disk in Worker Nodes • Too expensive • May be next time. Low cost High cost GridPP-23 Cambridge
Thank You GridPP-23 Cambridge