150 likes | 292 Views
LHCb on the Grid. Raja Nandakumar (with contributions from Greig Cowan) . GridPP21 3 rd September 2008. LHCb computing model. CERN (Tier-0) is the hub of all activity Full copy at CERN of all raw data and DSTs All T1s have a full copy of dst-s
E N D
LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan) GridPP21 3rd September 2008
LHCb computing model • CERN (Tier-0) is the hub of all activity • Full copy at CERN of all raw data and DSTs • All T1s have a full copy of dst-s • Simulation at all possible sites (CERN, T1, T2) • LHCb has used about 120 sites on 5 continents so far • Reconstruction, Stripping and Analysis at T0 / T1 sites only • Some analysis may be possible at “large” T2 sites in the future • Almost all the computing (except for development / tests) will be run on the grid. • Large productions : production team • Ganga (Dirac) grid user interface
LHCb on the grid • Small amount of activity over past year • DIRAC3 has been under development • Physics groups have not asked for new productions • Situation has changed recently...
LHCb on the grid • DIRAC3 • Nearing stable production release • Extensive experience with CCRC08 and follow-up exercises • Used as THE production system for LHCb • Now testing of the interfaces by Ganga developers • Generic pilot agent framework • Critical problems found with the gLite WMS 3.0, 3.1 • Mixing of VOMS roles under certain reasonably common conditions • Cannot have people with different VOMS roles! • Savannah bug #39641 • Being worked on by developers • Waiting for this to be solved before restarting tests
DIRAC3 Production >90,000 jobs in past 2 months Real production activity and testing of gLite WMS
DIRAC3 Job Monitor https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/display
LHCb storage at RAL • LHCb storage primarily on the Tier-1s and CERN • CASTOR used as storage system at RAL • Fully moved out of dCache in May 2008 • One tape damaged and file on it marked lost • Was stable (more or less) until 20 Aug 2008 • Not been able to take great load on servers • Low upper limit (8) on lsf job slots on various castor diskservers • Too many jobs (>500) can come into the batch system. The concerned service class hangs then • Temporarily fixed for now. Needs to be monitored (probably by the shifter on duty?) • Increase limit to >100 rfio jobs per server • Not all hardware can handle a limit of 200 jobs (start using swap space) • Problem seen many times now over the last few months • Castor now in downtime • This is worrying given how close we are to data taking
LHCb at RAL • Move to srm-v2 by LHCb • Needed to retire srm-v1 endpoints, hardware for RAL • When DIRAC3 becomes baseline for User analysis • Already used for almost all production • Ganga working on submitting through DIRAC3 • Needs LHCb also to rename files in the LFC • All space tokens, etc have been setup • Target : Turn off srm-v1 access by end September • Currently use srm-v1 for user analysis • DIRAC2 does not support srm-v2 • Batch system : • Pausing of jobs during downtime? • Not clear about the status of this • For now, stop the batch system from accepting LHCb jobs a few hours before scheduled downtimes • No LHCb job should run for >24 hours • Announce beginning and end of downtimes • Problems with broadcast tools • GGUS ticket opened by Derek Ross
LHCb and CCRC08 • Planned tasks : Test the LHCb computing model • Raw data distribution from pit to T0 centre • Use of rfcp into CASTOR from pit - T1D0 • Raw data distribution from T0 to T1 centres • Use of FTS - T1D0 • Recons of raw data at CERN & T1 centres • Production of rDST data - T1D0 • Use of SRM 2.2 • Stripping of data at CERN & T1 centres • Input data: RAW & rDST - T1D0 • Output data: DST - T1D1 • Use SRM 2.2 • Distribution of DST data to all other centres • Use of FTS
LHCb and CCRC08 Reconstruction Stripping
LHCb CCRC08 Problems • CCRC08 highlighted areas to be improved • File access problems • Random or permanent failure to open files using gsidcap • Request IN2P3 and NL-T1 to allow dcap protocol for local read access • Now using xroot at IN2P3 – appears to be successful • Wrong file status returned by dCache SRM after a put • bringOnline was not doing anything • Software area access problems • Site banned for a while until problem is fixed • Application crashes • Fixed with new SW release and deployment • Major issues with LHCb bookkeeping • Especially for stripping • Lessons learned • Better error reporting in pilot logs and workflow • Alternative forms of data access needed in emergencies • Downloading of files to WN (used at IN2P3, RAL)
LHCb Grid Operations • Grid Operations and Production team has been created
Communications • LHCb sites • Grid operations team keep track of problems • Report to sites via GGUS and eLogger • All posts are reported on lhcb-production@cern.ch • Please subscribe if you want to know what is going on • LHCb users • Mailing lists • lhcb-distributed-analysis@cern.ch • All problems directed here • Specific lists for each LHCb application and Ganga • Ticketing systems (Savannah, GGUS) for DIRAC, Ganga, apps • User by developers and “power” users • Software weeks provide training sessions for using Grid tools • Weekly distributed analysis meetings (starts Friday) • DIRAC, Ganga, core software developers along with some users • Aims to identify needs and coordinate release plans http://lblogbook.cern.ch/Operations http://lblogbook.cern.ch/Operations RSS feed available
Summary • Concerned about CASTOR stability close to data taking • DIRAC3 workload and data management system now online • Has been extensively tested when running LHCb productions • Now moving it into the user analysis system • Ganga needs some additional development • Grid operations team working with sites, users and devs to identify and resolve problems quickly and efficiently • LHCb looking forward to imminent switch on of the LHC!