280 likes | 403 Views
RAL Tier1/A Status Report GridPP 13 – 4 th July 2005. Steve Traylen <s.traylen@rl.ac.uk>. Talk Overview. Who we now are. Current Utilisation. Grid Interfaces to RAL. VO Policies. Batch Scheduling. Next Six Months. Who We Are Now. Tier1 Manager – Andrew Sansum. CPU Manager – Martin Bly.
E N D
RAL Tier1/A Status ReportGridPP 13 – 4th July 2005 Steve Traylen <s.traylen@rl.ac.uk> RAL Tier1 / Steve Traylen
Talk Overview • Who we now are. • Current Utilisation. • Grid Interfaces to RAL. • VO Policies. • Batch Scheduling. • Next Six Months. RAL Tier1 / Steve Traylen
Who We Are Now • Tier1 Manager – Andrew Sansum. • CPU Manager – Martin Bly. • Disk Manager – Nick White. • Grid Deployment – Steve Traylen and Derek Ross. • Hardware - Georgios Prassas. • Application Support – Two posts….. RAL Tier1 / Steve Traylen
Application Support Posts • Matthew Hodges and Catalin Condurache. • Both started one week ago. • Running experiment specific software and services outside the batch farm and storage. • LFC, FTS, FPS, Phedex,…… • Ensuring that the experiments needs are met by the whole of the Tier1. • Controlling the throughput of the farm. RAL Tier1 / Steve Traylen
Current Utilisation • CPU undersubscribed. • Grid use going up. • CPU/Walltime statistics will be produced. RAL Tier1 / Steve Traylen
Job Efficiency (2005) • Jan – 1.010 Very early figures (today) • Feb – 0.890 Tier1 will produce a better • Mar – 0.850 break down by VO. • Apr – 0.530 Efficiency = CPU/Walltime • May – 0.560 • Jun – 0.740 • Jul - 0.920 RAL Tier1 / Steve Traylen
Job Efficiency Grid/Non-Grid(2005) Very early figures (not verified) RAL Tier1 / Steve Traylen
Job Efficiency By Group(2005) • Early Figures (Not verified) RAL Tier1 / Steve Traylen
Grid Compute Resources • There is now no hard divide for grid and non grid jobs on the farm. • Currently running LCG-2_5_0 for SC3 • LCG releases take about one day to install on the batch farm. RAL Tier1 / Steve Traylen
Gatekeeper Changes • JobManagers. • Fork JobManager. • LCG PBS JobManager. • Introducing a SAM JobManager now. • Queues with different memory limits now published. • Ideas stolen from the ARDA testbed. • This has increased occupancy. • Currently only one for the whole farm. • A second one will be added for redundancy. • Sam expected add a lot more load to the GK. RAL Tier1 / Steve Traylen
Grid Storage Interfaces • Storage- DCache, DCache, DCache….. • Tier1 is very happy with DCache. No plans to do anything else at all at the moment (except Xrootd). • CMS and Atlas by far the largest users. • Atlas use SRMget and SRMput. • CMS use SRMcp. • We see completely different problems for the two groups - Many ways to use SRM… RAL Tier1 / Steve Traylen
Current DCache Deployment RAL Tier1 / Steve Traylen
Tier1 DCache Tape Integration • Tape enabled. • LHCb, Atlas, CMS and DTeam. (Last week) • One DCache – Two storage paths per VO. • /pnfs/gridpp.rl.ac.uk/data/lhcb • /pnfs/gridpp.rl.ac.uk/tape/lhcb • Files in tape directory are placed on tape • Glue/LCG dictates only one storage root per SRM endpoint. • An alias is published. dcache-tape.gridpp.rl.ac.uk. RAL Tier1 / Steve Traylen
Tier1 DCache Access • SRMv1 and then: • Access is possible via: • GridFTP • DCAP (anonymous, read only, LAN only.) • Is this useful? • GSIDCAP( LAN only) • A LD_PRELOAD is available for POSIX access to DCAP/GSIDCAP protocol. • We are looking forward to some serious use of this. RAL Tier1 / Steve Traylen
Tier1 Migration to DCache • All allocations for new experiments will be pushed in the direction of DCache. • Existing NFS file systems at RAL can be migrated to DCache – Please ask us. • This will allow us to swap disk around faster. • Currently a lot of disk is allocated and empty. RAL Tier1 / Steve Traylen
Other gLite/LCG Services • JRA1 Testbed infrastructure. • R-GMA, VOMS and gLite I/O <-> DCache. • File Transfer Service. • Functionality test to some Tier2s now complete. • LCG File Catalogue. • Recently installed, testing at zero. RAL Tier1 / Steve Traylen
VOMS Services • VOMS server. • phenoGrid nearly in production. • Nearly in production since before GridPP 12 • Slowed down, trying to use gLite software to early. • SA1/LCG testing and subsequent fixes of VOMS are ongoing. • Once phenoGrid is good, we will welcome more VOs. • There are no others in the queue, please ask. RAL Tier1 / Steve Traylen
Other Grid Interfaces • MyProxy service. • Currently configured so only RAL’s Resource Broker and FTS can retrieve proxies. • Will be configured so users can specify authorised renewers. e.g for use with PhedEX or Grid Canada. • GSI KLogD • Obtain an rl.ac.uk AFS token from your GSI credentials. Please ask to be added. RAL Tier1 / Steve Traylen
RAL VO Policy • VOs Supported • Computing. • Atlas, BaBar, CMS, D0, LHCb, Alice, Zeus, H1, CDF, Biomedical. • Storage • Atlas, CMS, LHCb – Now • Zeus, Biomedical, PhenoGrid – Coming on line shortly. • We propose that all minor VOs receive 150GB if they request it. • Recent request to add support for ILC @ DESY. • Any other VOs, please ask? RAL Tier1 / Steve Traylen
Batch Scheduling • Scheduling method is now settled. • A lot of tuning and experimenting was done. • Using MAUI from www.supercluster.org. • A couple of local patches added to the standard release. RAL Tier1 / Steve Traylen
Maui Fair Share • A queued jobs priority is calculated as: %userUsedi is % used in 24 hours i days ago. %groupUsedi is % used in 24 hours i days ago. RAL Tier1 / Steve Traylen
Teir1 Fairshare • What’s good. • Light users have priority over heavy users. • All users have a %allocation of 30%. • Having 9 days of history. • Allows us to let groups be optimistic. We never reserve a CPU as empty for someone. • Zeus have run many times their allocation. • Good tools for users to understand current status. (Better than qstat). • showq, showq –i, diagnose –f. RAL Tier1 / Steve Traylen
Tier1 Fairshare • What’s Bad. • History can cause huge swings. • e.g. LHCb switch on their data challenge and no one gets a job start for two days. Problems for analysis. • 100% utilisation =total walltime used and not available. People are penalised when farm moves from quiet to full. • Solution - We need a full farm. • Currently no extra priority for Grid Jobs. • H1 grid jobs always waiting when full. • We should add priority to grid work….? RAL Tier1 / Steve Traylen
Service Challenge 2 • RAL was successful - 73 Mbytes/s. • Sustained continuous transfer for 12 days during March. RAL Tier1 / Steve Traylen
Next Six Months • SC3 starting now. • We are ready with no more software to deploy. • SC4 preparations are beginning. • We have found the service challenges extremely useful. • Everything not in the service challenge has suffered though. RAL Tier1 / Steve Traylen
Next Six Months • gLite • Deployment of some gLite components into the main farm. e.g. A gLite CE. • Much of it is yet to be proved for production use but the plans for gradual deployment sound good. • VOMS groups and roles are being deployed. • These must be mapped into the local fabric. • e.g. production manager rights. RAL Tier1 / Steve Traylen
Challenges to come. • There is a blur between Production and Testing. • SC2 was an infrastructure test. • SC3 is an infrastructure test with production data. • Over the next few months we must improve deployment quality of essential services such as FTS, e.g. there is currently little backup on the oracle instance we are using. RAL Tier1 / Steve Traylen
Challenges to Come • We must spread some knowledge around. • Some critical services are currently in the hands of just a single person. • Address under utilisation. • Introducing new users. • Maintaining production. • Solidifying services. • Availability, reliability. • Finalise architecture and software stack for LHC. RAL Tier1 / Steve Traylen