140 likes | 304 Views
NorthGrid. Alessandra Forti GridPP 16 27 June 2006. Outline. Current problems Sheffield support Good news Conclusions. Current problems. Lancaster and Manchester have been heavily affected by the infamous 4444 problem. Caused by heavy load on the CE.
E N D
NorthGrid Alessandra Forti GridPP 16 27 June 2006
Outline • Current problems • Sheffield support • Good news • Conclusions
Current problems • Lancaster and Manchester have been heavily affected by the infamous 4444 problem. • Caused by heavy load on the CE. • Torque server hangs -> Maui server hangs -> Info System gives wrong responses. • Affects job submission from users • Different solutions have been more or less effective some with draw backs like caching torque queries had to be removed. • Manchester seems ok since installation of ncsd (S. Traylen suggestion). • Sheffield is having support post problems (see next slide) • Manchester dcache unstable since upgrade to 2_7_0 • Lately from unstable it has become broken not understood why yet.
Current problems • Manchester has put online only a 3rd of the nodes • computing room rearrangement • But mostly 4444 problem and dcache instability • if they are caused by the load increasing it doesn’t do any good to increase it • Liverpool has had problems with an unscheduled downtime • They were still receiving jobs • Problem has been solved adding downtime to the Freedom of Choice tool? • Is it usable by normal users • Sheffield: adding few VOs has affected their Classic SE
Sheffiled support • Sheffield support post probably probably going away. • Meeting with computing centre people with whom the post is shared. • Replaced by someone still from the university Computing Centre • Person located at Computing Centre not involved enough with PP community • PP people will have weekly meeting to follow situation with new person (or old if he stays) • Explained LCG requirements • Software upgrades vs service uptime.
Concerns • Lancaster: • Storage: ratio of 1TB for every 2 kSpecInt. Even if we "dcache-up" all our spare WN disk we will have about half this-and thats giving it all to atlas! Even if we get the funding for the extra disk it'll be hell finding somewhere to put it. • Network: Gb/s links between the WN and SE are going to be challenging to get, particularly with the NAT. • Sheffield: • Importing data for local Atlas users with both lcg utils and DQ2 tools has 50% failure rate • Manchester: • SFT partial failures and sites suspension: Manchester risked to be suspended due to RM test failures despite the fact that the cluster is constantly loaded with running jobs.
Good news • All the sites participated to dteam SC4 • It helped to understand bottlenecks • Atlas SC4 • Lancaster will participate • Networking work to put UkLight/SRM switch on the same subnet as the cluster • Manchester has been volunteered before dcache problems manifest themselves • hasn’t been contacted yet by Atlas anyway • 3 sites have already upgraded to glite3.0 • Lancaster and Liverpool are SFTs almost completely green carpets. • Liverpool working on networking, firewall bottlenecks • Manchester has now a 1 Gbs dedicated link directly to NNW • skips completely the campus network and should be upgraded to 10Gbs
Conclusions • Despite a number of problems NorthGrid is delivering resources quite successfully • Biggest issues: • SE stability • Support post stability
VOMS deployment Alessandra Forti Sergey Dolgobrodov
Glite 1.5 VOMS production • 1 production machine, 1 backup and 1 public testing machine • 9 VOs supported: local and regional • manmace,ralpp,ltwo,gridpp,t2k,minos,cedar,gridcc,mice • 28 users • However not much load from the users • Mostly from services building gridmap files • Few bugs make support difficult • Users can’t have more than a role in a VO • Users in VO admin role cannot be cleanly deleted • Admin interface hangs easily after simple VO management requiring reinstallation from scratch • Tools to mirror database content don’t work properly making difficult to maintain a backup • Developers respond but slowly (mostly don’t bother to acknowledge) • Some problems with same VO supported across the ATLANTIC
Glite3.0 VOMS tests • 1 machine is dedicated to trash and test not publicly accessible • Currently used for glite3.0 evaluation • Production configuration (9 VOs, 28 users) • Testing have showed: • Incomplete deletion of VO Admin user has been corrected • A user can now have more than one role in a VO • The administrator service improved significantly. One can manipulate separate VOs and accounts without pain, e.g. threat to hang the whole Admin interface.
VOMS • The configuration process was also improved and simplified. • There is single configuration file rather then several ones in this version. • Many parameters are automatically defined by the system. • Looks like YAIM… ;-) • Stability and performance under load haven’t been tested. • Test with fake requests has been planned • New bugs • Wrong permissions for /etc/cron.d files resulted in "crl" files not updated and some proxy refused. • As with the previous version the log entries are not helpful. • Waiting for the summer to upgrade the production system. • Possibly already in the position to upgrade the public test machine