170 likes | 294 Views
Status of the use of LCG. Stefano Bagnasco, INFN Torino. ALICE Offline Week CERN June 1, 2004. Being bold: from AliEn to a Meta-Grid. LCG-2 core sites CERN, CNAF, FZK, NIKHEF, RAL, Taiwan, Cyfronet, IC, Cambridge (more than 1000 CPUs) GRID.IT sites
E N D
Status of the use of LCG Stefano Bagnasco, INFN Torino ALICE Offline Week CERN June 1, 2004
Being bold: from AliEn to a Meta-Grid • LCG-2 core sites • CERN, CNAF, FZK, NIKHEF, RAL, Taiwan, Cyfronet, IC, Cambridge (more than 1000 CPUs) • GRID.IT sites • LNL.INFN, PD.INFN and several smaller ones (about 400 CPUs not including CNAF) • Pull-model is well-suited for implementing higher-level submission systems, since it does not require knowledge about the periphery, that may be very complex: “A Grid is a system that […] coordinates resources that are not subject to centralized control […] using standard, open, general-purpose protocols and interfaces […] to deliver nontrivial qualities of service.” I. Foster “What is the Grid? A three Point Checklist” Grid Today (2001)
From AliEn to a Meta-Grid – cont’d Design strategy: • Use AliEn as a general front-end • Owned and shared resource are exploited transparently • Minimize points of contact between the systems • No need to reimplement services etc. • No special services required to run on remote CE/WNs • Make full use of provided services: Data Catalogues, scheduling, monitoring… • Let the Grids do their jobs (they should know how) • Use high-level tools and APIs to access Grid resources • Developers put a lot of abstraction effort into hiding the complexity and shielding the user from implementation changes
Implementation Implementation: • Manage LCG (and Grid.it) resources through a “gateway”: an AliEn client (CE+SE) sitting on top of an LCG User Interface. • The whole of LCG computing is seen as a single, large AliEn CE associated with a single, large SE. • Job management interface • JDL translation (incl. InputData statements) • JobID bookkeeping • Command proxying (Submit, Cancel, Status query…) • Data management interface • Put() and Get() implementation • AliEn PFN is LCG GUID (plus SE host to allow for optimisation) • AliEn knows nothing about LCG replicas (but RLS/ROS does!)
LCG RB Interface Site Server LCG Site AliEn CE EDG CE LCG UI LCG SE AliEn SE WN AliEn Interfacing AliEn and LCG Job submission Status report PFN PFN= LFN Data Registration Data Registration LFN Data Catalogue Replica Catalogue LFN
Implementation details • Job management interface • JDL translation (incl. InputData statements) • Remote catalogue query for each job – may slow down submission significantly • JobID bookkeeping • AliEn job states do not map onto LCG ones • Job states from the two system are often out-of-sync (LCG bookkkeeping slower) • Command proxying (Submit, Cancel, Status query…) • Have to take care of LCG LB delays • Data management interface • Put() and Get() implementation • AliEn PFN is LCG GUID (plus SE host to allow for optimisation) • lcg://tbed0115.cern.ch/8f6064e7-b580-11d8-ba2a-f8adc5715cd0 • AliEn knows nothing about LCG replicas (but RLS/ROS does!) • Again, let the RB do its job!
Cheating: two grids, same resources! • “Double access” for selected sites (CNAF and CT.INFN) WN A User submits jobs Submission AliEn CE/SE WN Server WN Alien CE LCG UI LCG CE/SE WN LCG RB WN
installAlice.sh installAlice.jdl installAliEn.sh installAliEn.jdl Software installation • Both AliEn and AliRoot installed via “plain” LCG jobs • Do some checks, download tarballs, uncompress, build environment script and publish relevant tags • Single command available to get the list of available sites, send the jobs everywhere and wait for completion. Full update on LCG-2 + GRID.IT (16 sites) takes ~30’ (if queues are not full…) • Manual intervention still needed in few sites (e.g. CERN/LSF) • Ready for integration into AliEn automatic installation system • Experiment software shared area misconfiguration caused most of the trouble in the beginning NIKHEF Taiwan RAL LCG-UI … CNAF TO.INFN
Monitoring • AliEn Job monitoring • Through interface “keyhole” • State misalignment • GridIce (LCG native monitoring) • Only CE/SE load, integrated info • No job monitoring (but coming soon) • MonALISA sensors on server • Checking storage capacity etc. • Looking directly at LCG Info Providers via LDAP • MonALISA sensors on interface • Job distribution • Not much automatism (but then, they do the job) • AliEn knows nothing about LCG replicas (but RLS and ROS do!)
The debugging loop (Some) logs from LCG L&BK Assigned Queued Ready Logs from LCG OutputSandbox LCG Status AliEn Status Waiting Scheduled Running Logs from AliEn Catalogue Started Running Saving Couldn’t save & Ran out of WallClockTime? No logs at all! Done Aborted Done Error_*
Statistics after round 1 (ended april, 4): job distribution Alice::CERN::LCG is the interface to LCG-2 Alice::Torino::LCG is the interface to GRID.IT PDC2004 - Status
LCG-2 jobs seen through AliEn MonaLisa monitoring Ramp-up slope shows no major performance degradation AliEn Vs. AliEn+LCG AliEn native site LCG-2 GRID.IT
Grid.it starting up Larger sites filled first
Phase II: Reconstrution • Will need local storage for all sites • So will need use of native LCG storage • SRM in front of CASTOR, but ugly “Classic SE” in front of disk pools • SRM-to-dCache not working yet • Interface system available, installed everywhere and tested on the EIS testbed. • SRM everywhere would simplify things a lot • Use of GUIDs for files also simplified things • Eagerly waiting to see if large-scale production works… • Many sites have ludicrously small storage • LCG management is informed…
Phase II • The full system was tested in the EIS testbed • With a somewhat simpler JDL than for prduction • Awkward to do tests on the real infrastructure production [aliendb3.cern.ch:3307] /proc/296915/ > ls -l -rwxr-xr-x sbagnasc z2 40190 Jun 16 19:06 ConfigPPR.C -rwxr-xr-x sbagnasc z2 9416 Jun 16 19:10 EMCAL.Hits.root -rwxr-xr-x sbagnasc z2 8538 Jun 16 19:11 FMD.Hits.root -rwxr-xr-x sbagnasc z2 568379 Jun 16 19:12 galice.root -rwxr-xr-x sbagnasc z2 405 Jun 16 19:06 grunPPR.C -rwxr-xr-x sbagnasc z2 13703 Jun 16 19:12 ITS.Hits.root -rwxr-xr-x sbagnasc z2 19652 Jun 16 19:13 Kinematics.root -rwxr-xr-x sbagnasc z2 9552 Jun 16 19:14 MUON.Hits.root -rwxr-xr-x sbagnasc z2 7195 Jun 16 19:14 PHOS.Hits.root -rwxr-xr-x sbagnasc z2 6822 Jun 16 19:15 PMD.Hits.root -rwxr-xr-x sbagnasc z2 1431 Jun 16 19:20 resources -rwxr-xr-x sbagnasc z2 7222 Jun 16 19:15 RICH.Hits.root -rwxr-xr-x sbagnasc z2 10608 Jun 16 19:16 START.Hits.root -rwxr-xr-x sbagnasc z2 2105 Jun 16 19:20 stderr -rwxr-xr-x sbagnasc z2 277336 Jun 16 19:20 stdout -rwxr-xr-x sbagnasc z2 9799 Jun 16 19:17 TOF.Hits.root -rwxr-xr-x sbagnasc z2 35321 Jun 16 19:17 TPC.Hits.root -rwxr-xr-x sbagnasc z2 100688 Jun 16 19:18 TrackRefs.root -rwxr-xr-x sbagnasc z2 32475 Jun 16 19:19 TRD.Hits.root -rwxr-xr-x sbagnasc z2 10992 Jun 16 19:19 VZERO.Hits.root -rwxr-xr-x sbagnasc z2 8069 Jun 16 19:20 ZDC.Hits.root [aliendb3.cern.ch:3307] /proc/296915/ > whereis -o EMCAL.Hits.root And the file is in Alice::CERN::LCGtest lcg://tbed0115.cern.ch/139fff74-bfb8-11d8-91c1-be96c4d6b0ee
Conclusions & lessons learned - 1 • In the first half of Phase I, LCG (+ grid.it) was able to provide ~50% of computing cycles • Fraction was reduced afterwards, but we should be able to have it back • LCG Workload Management proved itself reasonably robust • We never lost a job due because of RB failures • The remote site configuration is still the major source of problems, LCG-side. • Things are unlikely to get better with the use of LCG storage… • Software management tools are still rudimentary • Large sites have often tighter security restrictions & other idiosincracies • Investigating and fixing problems is hard and time-consuming
Conclusions & lessons learned - 2 • The most difficult part of the management is monitoring LCG through a “keyhole”. • Only integrated information available natively • MonALISA for AliEn, GridICE for LCG • Some safety mechanisms are too coarse for this approach (queue blocking) • Anti-blackhole mechanisms being studied in grid.it • Based on run time statistics, CPUTime/WallClockTime ratio… • For short jobs, submission time (and thus the interface system performance) can limit the number of jobs • But the system is trivially scalable • InputData translation may slow down things significantly