LHCb on the Grid

LHCb on the Grid Raja Nandakumar (with contributions from Greig Cowan)‏ GridPP21 3rd September 2008

LHCb computing model • CERN (Tier-0) is the hub of all activity • Full copy at CERN of all raw data and DSTs • All T1s have a full copy of dst-s • Simulation at all possible sites (CERN, T1, T2)‏ • LHCb has used about 120 sites on 5 continents so far • Reconstruction, Stripping and Analysis at T0 / T1 sites only • Some analysis may be possible at “large” T2 sites in the future • Almost all the computing (except for development / tests) will be run on the grid. • Large productions : production team • Ganga (Dirac) grid user interface

LHCb on the grid • Small amount of activity over past year • DIRAC3 has been under development • Physics groups have not asked for new productions • Situation has changed recently...

LHCb on the grid • DIRAC3 • Nearing stable production release • Extensive experience with CCRC08 and follow-up exercises • Used as THE production system for LHCb • Now testing of the interfaces by Ganga developers • Generic pilot agent framework • Critical problems found with the gLite WMS 3.0, 3.1 • Mixing of VOMS roles under certain reasonably common conditions • Cannot have people with different VOMS roles! • Savannah bug #39641 • Being worked on by developers • Waiting for this to be solved before restarting tests

DIRAC3 Production >90,000 jobs in past 2 months Real production activity and testing of gLite WMS

DIRAC3 Job Monitor https://lhcbweb.pic.es/DIRAC/jobs/JobMonitor/display

LHCb storage at RAL • LHCb storage primarily on the Tier-1s and CERN • CASTOR used as storage system at RAL • Fully moved out of dCache in May 2008 • One tape damaged and file on it marked lost • Was stable (more or less) until 20 Aug 2008 • Not been able to take great load on servers • Low upper limit (8) on lsf job slots on various castor diskservers • Too many jobs (>500) can come into the batch system. The concerned service class hangs then • Temporarily fixed for now. Needs to be monitored (probably by the shifter on duty?)‏ • Increase limit to >100 rfio jobs per server • Not all hardware can handle a limit of 200 jobs (start using swap space)‏ • Problem seen many times now over the last few months • Castor now in downtime • This is worrying given how close we are to data taking

LHCb at RAL • Move to srm-v2 by LHCb • Needed to retire srm-v1 endpoints, hardware for RAL • When DIRAC3 becomes baseline for User analysis • Already used for almost all production • Ganga working on submitting through DIRAC3 • Needs LHCb also to rename files in the LFC • All space tokens, etc have been setup • Target : Turn off srm-v1 access by end September • Currently use srm-v1 for user analysis • DIRAC2 does not support srm-v2 • Batch system : • Pausing of jobs during downtime? • Not clear about the status of this • For now, stop the batch system from accepting LHCb jobs a few hours before scheduled downtimes • No LHCb job should run for >24 hours • Announce beginning and end of downtimes • Problems with broadcast tools • GGUS ticket opened by Derek Ross

LHCb and CCRC08 • Planned tasks : Test the LHCb computing model • Raw data distribution from pit to T0 centre • Use of rfcp into CASTOR from pit - T1D0 • Raw data distribution from T0 to T1 centres • Use of FTS - T1D0 • Recons of raw data at CERN & T1 centres • Production of rDST data - T1D0 • Use of SRM 2.2 • Stripping of data at CERN & T1 centres • Input data: RAW & rDST - T1D0 • Output data: DST - T1D1 • Use SRM 2.2 • Distribution of DST data to all other centres • Use of FTS

LHCb and CCRC08 Reconstruction Stripping

LHCb CCRC08 Problems • CCRC08 highlighted areas to be improved • File access problems • Random or permanent failure to open files using gsidcap • Request IN2P3 and NL-T1 to allow dcap protocol for local read access • Now using xroot at IN2P3 – appears to be successful • Wrong file status returned by dCache SRM after a put • bringOnline was not doing anything • Software area access problems • Site banned for a while until problem is fixed • Application crashes • Fixed with new SW release and deployment • Major issues with LHCb bookkeeping • Especially for stripping • Lessons learned • Better error reporting in pilot logs and workflow • Alternative forms of data access needed in emergencies • Downloading of files to WN (used at IN2P3, RAL)‏

LHCb Grid Operations • Grid Operations and Production team has been created

Communications • LHCb sites • Grid operations team keep track of problems • Report to sites via GGUS and eLogger • All posts are reported on lhcb-production@cern.ch • Please subscribe if you want to know what is going on • LHCb users • Mailing lists • lhcb-distributed-analysis@cern.ch • All problems directed here • Specific lists for each LHCb application and Ganga • Ticketing systems (Savannah, GGUS) for DIRAC, Ganga, apps • User by developers and “power” users • Software weeks provide training sessions for using Grid tools • Weekly distributed analysis meetings (starts Friday)‏ • DIRAC, Ganga, core software developers along with some users • Aims to identify needs and coordinate release plans http://lblogbook.cern.ch/Operations http://lblogbook.cern.ch/Operations RSS feed available

Summary • Concerned about CASTOR stability close to data taking • DIRAC3 workload and data management system now online • Has been extensively tested when running LHCb productions • Now moving it into the user analysis system • Ganga needs some additional development • Grid operations team working with sites, users and devs to identify and resolve problems quickly and efficiently • LHCb looking forward to imminent switch on of the LHC!

Backup - CCRC08 Throughput

LHCb on the Grid

LHCb on the Grid

Presentation Transcript

Moving the LHCb Monte Carlo production system to the GRID

Developing LHCb Grid Software: Experiences and Advances

MICE on the GRID

MLalign2D on the Grid

LHCb & GRID in the Netherlands

Telemedicine on the Grid

The LHCb Muon System Burkhard Schmidt / CERN on behalf of the LHCb collaboration LHCb Muon Group:

ThIS on the Grid

Exploiting the Grid to Simulate & Design the LHCb Experiment

Exploiting the Grid to Simulate and Design the LHCb Experiment

Planning on the Grid

MPI on the Grid

LHCb on the Grid A Tale of many Migrations

VCLab on the GRID

LHCb Upgrade Electronics Status On behalf of the LHCb collaboration

LHCb GRID Meeting

LHCb Computing and Grid Status

ICFA Workshop On Grid Activities LHCb Data Management Tools

LHCb Computing Model and Grid Status

MPI on the Grid

LHCb computing model and the planned exploitation of the GRID

LHCb planning on EU GRID activities (for discussion)

LHCb on the Grid

LHCb on the Grid

Presentation Transcript

Moving the LHCb Monte Carlo production system to the GRID

Developing LHCb Grid Software: Experiences and Advances

MICE on the GRID

MLalign2D on the Grid

LHCb &amp; GRID in the Netherlands

Telemedicine on the Grid

The LHCb Muon System Burkhard Schmidt / CERN on behalf of the LHCb collaboration LHCb Muon Group:

ThIS on the Grid

Exploiting the Grid to Simulate &amp; Design the LHCb Experiment

Exploiting the Grid to Simulate and Design the LHCb Experiment

Planning on the Grid

MPI on the Grid

LHCb on the Grid A Tale of many Migrations

VCLab on the GRID

LHCb Upgrade Electronics Status On behalf of the LHCb collaboration

LHCb GRID Meeting

LHCb Computing and Grid Status

ICFA Workshop On Grid Activities LHCb Data Management Tools

LHCb Computing Model and Grid Status

MPI on the Grid

LHCb computing model and the planned exploitation of the GRID

LHCb planning on EU GRID activities (for discussion)

LHCb & GRID in the Netherlands

Exploiting the Grid to Simulate & Design the LHCb Experiment