NorthGrid

NorthGrid Alessandra Forti GridPP 16 27 June 2006

Outline • Current problems • Sheffield support • Good news • Conclusions

Current problems • Lancaster and Manchester have been heavily affected by the infamous 4444 problem. • Caused by heavy load on the CE. • Torque server hangs -> Maui server hangs -> Info System gives wrong responses. • Affects job submission from users • Different solutions have been more or less effective some with draw backs like caching torque queries had to be removed. • Manchester seems ok since installation of ncsd (S. Traylen suggestion). • Sheffield is having support post problems (see next slide) • Manchester dcache unstable since upgrade to 2_7_0 • Lately from unstable it has become broken not understood why yet.

Current problems • Manchester has put online only a 3rd of the nodes • computing room rearrangement • But mostly 4444 problem and dcache instability • if they are caused by the load increasing it doesn’t do any good to increase it • Liverpool has had problems with an unscheduled downtime • They were still receiving jobs • Problem has been solved adding downtime to the Freedom of Choice tool? • Is it usable by normal users • Sheffield: adding few VOs has affected their Classic SE

Sheffiled support • Sheffield support post probably probably going away. • Meeting with computing centre people with whom the post is shared. • Replaced by someone still from the university Computing Centre • Person located at Computing Centre not involved enough with PP community • PP people will have weekly meeting to follow situation with new person (or old if he stays) • Explained LCG requirements • Software upgrades vs service uptime.

Concerns • Lancaster: • Storage: ratio of 1TB for every 2 kSpecInt. Even if we "dcache-up" all our spare WN disk we will have about half this-and thats giving it all to atlas! Even if we get the funding for the extra disk it'll be hell finding somewhere to put it. • Network: Gb/s links between the WN and SE are going to be challenging to get, particularly with the NAT. • Sheffield: • Importing data for local Atlas users with both lcg utils and DQ2 tools has 50% failure rate • Manchester: • SFT partial failures and sites suspension: Manchester risked to be suspended due to RM test failures despite the fact that the cluster is constantly loaded with running jobs.

Good news • All the sites participated to dteam SC4 • It helped to understand bottlenecks • Atlas SC4 • Lancaster will participate • Networking work to put UkLight/SRM switch on the same subnet as the cluster • Manchester has been volunteered before dcache problems manifest themselves • hasn’t been contacted yet by Atlas anyway • 3 sites have already upgraded to glite3.0 • Lancaster and Liverpool are SFTs almost completely green carpets. • Liverpool working on networking, firewall bottlenecks • Manchester has now a 1 Gbs dedicated link directly to NNW • skips completely the campus network and should be upgraded to 10Gbs

Q1 2006: 36%

Q2 2006: 51%

Conclusions • Despite a number of problems NorthGrid is delivering resources quite successfully • Biggest issues: • SE stability • Support post stability

VOMS deployment Alessandra Forti Sergey Dolgobrodov

Glite 1.5 VOMS production • 1 production machine, 1 backup and 1 public testing machine • 9 VOs supported: local and regional • manmace,ralpp,ltwo,gridpp,t2k,minos,cedar,gridcc,mice • 28 users • However not much load from the users • Mostly from services building gridmap files • Few bugs make support difficult • Users can’t have more than a role in a VO • Users in VO admin role cannot be cleanly deleted • Admin interface hangs easily after simple VO management requiring reinstallation from scratch • Tools to mirror database content don’t work properly making difficult to maintain a backup • Developers respond but slowly (mostly don’t bother to acknowledge) • Some problems with same VO supported across the ATLANTIC

Glite3.0 VOMS tests • 1 machine is dedicated to trash and test not publicly accessible • Currently used for glite3.0 evaluation • Production configuration (9 VOs, 28 users) • Testing have showed: • Incomplete deletion of VO Admin user has been corrected • A user can now have more than one role in a VO • The administrator service improved significantly. One can manipulate separate VOs and accounts without pain, e.g. threat to hang the whole Admin interface.

VOMS • The configuration process was also improved and simplified. • There is single configuration file rather then several ones in this version. • Many parameters are automatically defined by the system. • Looks like YAIM… ;-) • Stability and performance under load haven’t been tested. • Test with fake requests has been planned • New bugs • Wrong permissions for /etc/cron.d files resulted in "crl" files not updated and some proxy refused. • As with the previous version the log entries are not helpful. • Waiting for the summer to upgrade the production system. • Possibly already in the position to upgrade the public test machine

NorthGrid

NorthGrid

Presentation Transcript

Northgrid

Northgrid Status