CCRC08 post-mortem LHCb activities at PIC

CCRC08 post-mortemLHCb activities at PIC G. Merino PIC, 19/06/2008

LHCb Computing • Main user analysis supported at CERN + 6Tier-1s • Tier-2s essentially MonteCarlo production facilities

CCRC08: Planned tasks • May activities: Maintain equivalent of 1 month data taking assuming a 50% machine cycle efficiency • Raw data distribution from pit → T0 centre • Raw data distribution from T0 → T1 centres • Use of FTS - T1D0 • Recons of raw data at CERN & T1 centres • RAW (T1D0)  rDST (T1D0) • Stripping of data at CERN & T1 centres • RAW & rDST (T1D0)  DST (T1D1) • Distribution of DST data to all other centres • Use of FTS - T0D1 (except CERN T1D1)

Activities across the sites Planned breakdown of processing activities (CPU needs) prior to CCRC08

Tier 0  Tier 1 • FTS from CERN to Tier-1 centres • Transfer of RAW will only occur once data has migrated to tape & checksum is verified • Rate out of CERN ~35MB/s averaged over the period • Peak rate far in excess of requirement • In smooth running sites matched LHCb requirements

Tier 0  Tier 1

Tier 0  Tier 1 • To first order all transfers eventually succeeded • plot shows efficiency on 1st attempt… Issue with UK certificates Restart IN2P3 SRM endpoint CERN outage CERN SRM endpoint problems

Reconstruction • Used SRM 2.2 • LHCb space tokens are: • LHCb_RAW (T1D0); LHCb_RDST (T1D0) • Data shares need to be preserved • Important for resource planning • Input 1 RAW file & output 1 rDST file (1.6 GB) • Reduced nos of events per recons job from 50k to 25k (job ~12 hour duration on 2.8 kSI2k machine) • In order to fit within the available queues • Need to get queues at all sites that match our processing time • Alternative: reduce file size!

Reconstruction After data transfer file should be online, as job submitted immediately NOTE: in principle only LHCb has this requirement of “online reconstruction”  Reco jobs will read the input data from the T1D0 write buffer Just in case… LHCb pre-stages files (srm_bringonline) & then checks on the status of the file (srm_ls) before submitting pilot job via GFAL Pre-stage should ensure access availability from cache Only issue at NL-T1 with reporting of file status

Reconstruction • 41.2k reconstruction jobs submitted • 27.6k jobs proceeded to done state • Done/created ~67%

Reconstruction • 27.6k reconstruction jobs in done state • 21.2k jobs processed 25k events • Done/25k events ~77% • 3.0k jobs failed to upload rDST to local SE • Only 1 attempt before trying Failover • Failover/25k events ~13%

Error humano en el PIC: WN con la red desconfigurada 24-27 de Mayo Hacía de black-hole (ticket-4386)

Reconstruction CPU efficiency: ratio of wall/cpu time on running jobs • CNAF: more jobs than cores on a WN … • IN2P3 & RAL: Problems reading input data

Reconstruction CPU efficiency: ratio of wall/cpu time on running jobs • PIC: The most cpu-efficient T1 

dCache Observations • Official LCG recommendation - 1.8.0-15p3 • LHCb ran smoothly at half of T1 dCache sites • PIC OK - version 1.8.0-12p6 (dcap) • GridKa OK - version 1.8.0-15p2 (dcap) • IN2P3 - problematic - version 1.8.0-12p6 (gsidcap) • Seg faults - needed to ship version of GFAL to run • Could explain CGSI-gSOAP problem???? • NL-T1 - problematic (gsidcap) • Many versions during CCRC to solve number of issues • 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4

Databases • Conditions DB used at CERN & Tier-1 centres • No replication tests of conditions DB Pit ↔Tier-0 (and beyond) • Switched to using Conditions DB 15th May for reconstruction • LFC • Use “streaming” to populate the read-only instance at T1 from CERN • Problem with CERN instance revealed local instances not being used by LHCb! • Testing underway now

Stripping Stripping on rDST files 1 rDST file & associated RAW file Space tokens: LHC_RAW & LHCb_rDST DST files & ETC produced during the process stored locally on T1D1 (add storage class) Space tokens: LHCb_M-DST DST & ETC file then distributed to all other computing centres on T0D1 (except CERN T1D1) Space tokens: LHCb_DST (LHCb_M-DST)

Stripping 31.8k stripping jobs were submitted 9.3k jobs ran to “Done” Major issues with LHCb book-keeping

Stripping: T1-T1 transfers CNAF PIC Catch up ok once solved Initial problems uploading to M-DST Token at PIC GridKa RAL Stripping test limited to 4 T1 centres

Conclusiones • A pesar de ser el Tier-1 más pequeño de LHCb, la calidad de servicio del PIC ha sido la más alta en el CCRC08 • Se han testeado los siguientes procesos para los Tier-1 • Recepción de datos desde el CERN • Reconstrucción • Stripping y envío de DST a otros Tier-1 • Los resultados en el PIC han sido positivos • Recepción de datos desde el CERN (~5MB/s) • Lectura de datos desde WNs (dcap) – OK • Demostrada replicación de DST a otros Tier-1s a más velocidad de la requerida (catch-up) • El ejercicio ha sido también útil para que LHCb detecte los puntos débiles de su infraestructura Grid DIRAC • Mejorar el sistema de book-keeping, log-files, etc

CCRC08 post-mortem LHCb activities at PIC

CCRC08 post-mortem LHCb activities at PIC

Presentation Transcript

Post Mortem Review Process

Early post mortem events

[Project Name] Post-Mortem

Post Mortem Changes

Post Mortem Review

Chemistry at Karlsruhe Post-Mortem

Wuppertal Post Mortem

POST MORTEM INTERVAL

Post Mortem

2. Post mortem and tissue harvesting a) Post mortem delays

MDC post-mortem redux

Post Mortem

AI Post Mortem

POST MORTEM

Mercury Post-mortem

POST MORTEM CHANGES

CVMFS Post Mortem

POST MORTEM CHANGES

Temple Review Post-Mortem

POST MORTEM CHANGES