340 likes | 460 Views
Deployment issues and SC3. Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1 st June 2005. Current deployment issues. Main GridPP concerns: gLite migration, fabric management & future of YAIM dCache Data migration – classic SE to SRM SE Security Ganglia deployment
E N D
Deployment issues and SC3 Jeremy Coles GridPP Tier-2 Board and Deployment Board Glasgow, 1st June 2005
Current deployment issues Main GridPP concerns: • gLite migration, fabric management & future of YAIM • dCache • Data migration – classic SE to SRM SE • Security • Ganglia deployment • Use of ticketing system • Use of UK testzone General • Jobs at sites – improving (nb. Freedom of Choice is coming!) • Few general EGEE VOs supported at GridPP sites Deployment update
2nd LCG Operations Workshop • Took place in Bologna last week: http://infnforge.cnaf.infn.it/cdsagenda//fullAgenda.php?ida=a0517 • Covered the following areas: • Daily operations • Pre-production service • Glite deployment and migration • Future monitoring (metrics) • Interoperation with OSG • User support (Executive Support Committee!) • VO management processes • Fabric management • Accounting (DGAS and APEL) • Little on security! Romain presented potential tools. Deployment update
LCG-2_4_0 CPUs: 2_4_0 10642 2_3_1 912 2_3_0 2167 Plan Deployment update
Version Change in the last 100 days Others: Sites on older versions or down All sites in LCG-2 Deployment update
Russia Canada Italy Regions with less than 5 sites are not shown Germany/Switzerland Deployment update
France SW Northern Asia Pacific Deployment update
Central SE Deployment update
UKI Deployment update
LCG-2_4_0 Lessons learned: • Harder than expected (rate independent of packaging) • Differences between regions --> ROCs matter • Release definition non trivial with 3 months intervals • Components dependencies • X without Y and V is useless…. • During certification we still find problems • Upgrade and installation from scratch needed (time consuming) • Test pilots for deployment are useful • Early announcement of releases is useful • We need to introduce “updates” via APT to fix bugs that show during deployment • Number of sites is the wrong metric to measure success • CPUs on new release needs to be tacked, not sites Deployment update
The next release • Why? • SC3 is approaching and the needed components are not deployed at the sites • What? • File transfer service (will need VDT 1.2.2) • Servers for Tier1 and Tier0, clients for the rest • Improved monitoring sensors for gridFtp • RFC proxy extension for VOMS • New version of the GLUE schema (compatible) • LFC production service • Interoperability with GRID3/OSG • User level stdio monitoring (maybe later) • Bug fixes …….. as always • When? • Aimed at mid June • Who? • Tier 1 centers and Tier 2 centers participating in SC3 • As fast as possible • Others? • At their own pace • Updated release (fixes from 1st release) expected by July 1st. Deployment update
VOMS Coexistence & Extended Pre-Production Catalogue and access control LFC RB gLite WLM FIREMAN myProxy BD-II BD-II APEL dgas Independent IS R-GMA R-GMA R-GMAs can be merged (security ON) UIs gLite-IO LCG gLite LCG CE SITE CEs use same batch system WNs gLite-CE FTS for LCG uses user proxy, gLite uses service cert FTS FTS shared LCG SRM-SE Data from LCG is owned by VO and role, gLite-IO service owns gLite data gLite Deployment update
shared LCG gLite RB gLite WLM myProxy VOMS Gradual Transition 1 Optional additional WLM Data Management LCG Optional dgas accounting LFC BD-II dgas APEL R-GMA UIs LCG gLite LCG CE SITE gLite-CE CEs use same batch system WNs FTS for LCG uses user proxy, gLite uses service cert FTS SRM-SE Deployment update
shared LCG gLite VOMS Gradual Transition 2 LFC gLite WLM FIREMAN myProxy BD-II dgas APEL R-GMA Removed LCG WLM Optional Catalogue R-GMA in gLite mode UIs LCG gLite SITE gLite-CE WNs FTS SRM-SE Deployment update
shared LCG gLite VOMS Gradual Transition 3 LFC gLite WLM FIREMAN myProxy BD-II dgas APEL R-GMA Adding gLite-IO Second path to data Additional security model Data migration phase UIs LCG gLite SITE gLite-CE WNs gLite-IO FTS FTS SRM-SE Data from LCG is owned by VO and role, gLite-IO service owns gLite data Deployment update
shared LCG gLite VOMS Gradual Transition 4 gLite WLM FIREMAN myProxy BD-II dgas APEL R-GMA UIs LFC LCG gLite SITE gLite-CE WNs gLite-IO Finalize switch to new security model. LFC, now a local catalogue under VO control BDII later replaced by R-GMA FTS SRM-SE Deployment update
Metrics - EGEE • General Agreement on the concept • detailed discussions on: • time windows • Sliding windows (week, month, 3 month) • quantities to watch for (RCs, ROCs, CICs…..) • ROCs based on RCs • CICs based on services • Release quality has to be measured • To make progress: workgroup to define quantities • Organized by: Ognjen Prnjat (oprnjat@admin.grnet.gr) • Small (˜5), Ognjen, Markus, Helene, Jeff T. and Jeremy • Ognjen will collect input • ROCs, CICs and OMC have to agree on ONE set of quantities Deployment update
Operations summary • CIC On Duty is now well established • COD is just 6 month old!!!!! • Tools have evolved at a dramatic pace • Portal, SFT,…… • Many rapid iterations • Truly distributed effort • Integration of new COD partner (Russia) went smoothly • Tuning of procedures is an ongoing process • No dramatic changes (take resource size more into account) Deployment update
Accounting Last November still an area of concern • APEL now well established • Support for batch systems is improving • Several privacy related problems have been understood and solved • gLite Accounting: DGAS • Some concerns about amount of information published • Can be handled by proper authorization? • Collaboration with APEL on batch sensors (BBQS, Condor,..) • DGAS agreed to provide them • Will be introduced initially on a voluntary basis • Sites will give feedback (including privacy issues) Deployment update
Current deployment issues (recap) Main GridPP concerns: • gLite migration, fabric management & future of YAIM • dCache • Data migration – classic SE to SRM SE • Security • Ganglia deployment • Use of ticketing system • Use of UK testzone General • Jobs at sites – improving (nb. Freedom of Choice is coming!) • Few general EGEE VOs supported at GridPP sites Deployment update
Freedom of choice - VO Page Deployment update
Service Challenge 3 Deployment update
SC timelines 2005 2006 2007 2008 SC3 First physics cosmics First beams Full physics run SC4 LHC Service Operation June05 - Technical Design Report Sep05 - SC3 Service Phase May06 – SC4 Service Phase Sep06 – Initial LHC Service in stable operation Apr07 – LHC Service commissioned SC2 SC2 – Reliable data transfer (disk-network-disk) – 5 Tier-1s, aggregate 500 MB/sec sustained at CERN SC3 –Reliable base service – most Tier-1s, some Tier-2s – basic experiment software chain – grid data throughput 500 MB/sec, including mass storage (~25% of the nominal final throughput for the proton period) SC4 –All Tier-1s, major Tier-2s – capable of supporting full experiment software chain inc. analysis – sustain nominal final grid data throughput LHC Service in Operation –September 2006 – ramp up to full operational capacity by April 2007 – capable of handling twice the nominal data throughput Deployment update
Service Challenge 3 - Phases High level view: • Throughput phase • 2 weeks sustained in July 2005 • “Obvious target” – GDB of July 20th • Primary goals: • 150MB/s disk – disk to Tier1s; • 60MB/s disk (T0) – tape (T1s) • Secondary goals: • Include a few named T2 sites (T2 -> T1 transfers) • Encourage remaining T1s to start disk – disk transfers • Service phase • September – end 2005 • Start with ALICE & CMS, add ATLAS and LHCb October/November • All offline use cases except for analysis • More components: WMS, VOMS, catalogs, experiment-specific solutions • Implies production setup (CE, SE, …) Deployment update
SC implications • SC3 will involve the Tier 1 sites (+ a few large Tier 2) in July • Must have the release to be used in SC3 available in mid-June • Involved sites must upgrade for July • Not reasonable to expect those sites to commit to other significant work (pre-production etc) on that timescale • T1: ASCC, BNL, CCIN2P3, CNAF, FNAL, GridKA, NIKHEF/SARA, RAL and • Expect SC3 release to include FTS, LFC, DPM, but otherwise be very similar to LCG-2.4.0 • September-December: experiment “production” verification of SC3 services; in parallel set up for SC4 • Expect “normal” support infrastructure (CICs, ROCs, GGUS) to support service challenge usage • Bio-med also planning data challenges • Must make sure these are all correctly scheduled Deployment update
SC3 issues • Tier-1 network being extensively re-configured. Tests showed up to 40% packet loss! Waiting for UKLight to be fixed. Not intending to use dual-homing but dCache have provided a solution • Lancaster link up at the link level • What is the bandwidth of the Lancaster connection • Edinburgh hardware problem with raid-array to be used as SE – IBM investigating • Lancaster set up test system. Now deploying more hardware • Need clarification about classification of volatile vs permanent data in respect of Tier-2s • The file transfer service should be ready now but has problems with the client component • RAL would like longer period for testing tape than suggested in SC3 plans • There has been an issue with CMS preferring to use Phedex and not to use FTS for transfers. We need to add into the plans a period to do Phedex only transfer tests • dCache mailing list very active now. There have been problems with the installation scripts Deployment update
SC3 issues continued • We have questions about whether FTS uses SRM-put or SRM-cp. • From September onwards SC3 infrastructure is to provide a production quality service for all experiments – remember comments about UKLight being a research network – risk!? • Differing engagement with the experiments. Edinburgh needs a better releationship with LHCb • There is an LCG workshop in mid-June where the experiment plans should be almost final! • GridPP needs to do more load testing than is anticipated in SC3 • Planning for SC4 needs to start soon. Currently we are pushing dCache but DPM is also supposed to be available. Deployment update
Imperial (London Tier-2) • SRM/dCache Status • Production server installed • gfe02.hep.ph.ic.ac.uk • Information provider still developing • 1.5TB Pool node added • RHEL 4 , 64 bit system • Installed using dcache.org instructions http://www.dcache.org/downloads/dCache-instructions.txt • Extra 1.5TB ready to add when CMS ready • 6TB being purchased. Should be in place by start of Setup Phase • CMS Software • Service node provided • Phedex installed • Confirmation on FTS/Phedex issue sought Deployment update
Edinburgh • Current LCG production setup: • Compute Element (CE), Classic Storage Element (SE), 3 Worker Nodes (2 machines, 3 CPUs). Monitoring takes place on the SE, running LCG 2.4.0. About to add 2 Worker Nodes (2 CPUs in 1 machine) and have a User Interface (UI) in testing. We have a 22TB datastore available • Plans • £2000 available for 2 machines - one for dCache work and one to connect to EPCC's SAN (10 TBytes promised). • Considering the procurement of more WNs but have no clear requirements from LHCb. Deployment update
Lancaster (current) Deployment update
Lancaster (planned) • LighPath and terminal Endbox installed. • Still require some hardware for our internal network topology. • Increase in Storage to ~84TB to possible ~92TB with working resilient dCache from CE Deployment update
Other areas… Deployment update
JRA4 request • We have some idea of requirements from networking experts within JRA4 • Draft requirements document available here: • https://edms.cern.ch/document/593620/1 • Draft use case document available here: • https://edms.cern.ch/document/591777/1 • We’re looking for more input from NOCs and GOCs • If you have requirements, use cases or opinions on interfaces or needed metrics, please send them to us • Even if you don’t have ideas at the moment, but would like to be involved in the process, please get in contact • Contact details are at the end of the talk Deployment update
DTEAM discussion • Review of team objectives – what is the team focus for the next 3 & 5 months • Communications with the experiments • Using a project tool to work better as a team • Metrics!! • Review of plans and what needs to be done to keep them up-to-date including GridPP challenges and SC4 • Web-page status • Areas raised at the T2B and DB meetings • Security challenge involvement • Accounting – status and making further progress • Libraries and understanding expt. Needs • Review dCache efforts • Address issues with Quarterly reports & weekly reports • Next release, test-zone and test-zone machines • Data management – guidelines required • Improving robustness • GI – (Documentation (esp. releases), multi-Tier R-GMA, intro. New sites, LCFGng distribution (Kickstart & Pixieboot… ), jobs – how to get Deployment update