210 likes | 371 Views
CCRC08 post-mortem LHCb activities at PIC. G. Merino PIC, 19/06/2008. LHCb Computing. Main user analysis supported at CERN + 6Tier-1s Tier-2s essentially MonteCarlo production facilities. CCRC08: Planned tasks.
E N D
CCRC08 post-mortemLHCb activities at PIC G. Merino PIC, 19/06/2008
LHCb Computing • Main user analysis supported at CERN + 6Tier-1s • Tier-2s essentially MonteCarlo production facilities
CCRC08: Planned tasks • May activities: Maintain equivalent of 1 month data taking assuming a 50% machine cycle efficiency • Raw data distribution from pit → T0 centre • Raw data distribution from T0 → T1 centres • Use of FTS - T1D0 • Recons of raw data at CERN & T1 centres • RAW (T1D0) rDST (T1D0) • Stripping of data at CERN & T1 centres • RAW & rDST (T1D0) DST (T1D1) • Distribution of DST data to all other centres • Use of FTS - T0D1 (except CERN T1D1)
Activities across the sites Planned breakdown of processing activities (CPU needs) prior to CCRC08
Tier 0 Tier 1 • FTS from CERN to Tier-1 centres • Transfer of RAW will only occur once data has migrated to tape & checksum is verified • Rate out of CERN ~35MB/s averaged over the period • Peak rate far in excess of requirement • In smooth running sites matched LHCb requirements
Tier 0 Tier 1 • To first order all transfers eventually succeeded • plot shows efficiency on 1st attempt… Issue with UK certificates Restart IN2P3 SRM endpoint CERN outage CERN SRM endpoint problems
Reconstruction • Used SRM 2.2 • LHCb space tokens are: • LHCb_RAW (T1D0); LHCb_RDST (T1D0) • Data shares need to be preserved • Important for resource planning • Input 1 RAW file & output 1 rDST file (1.6 GB) • Reduced nos of events per recons job from 50k to 25k (job ~12 hour duration on 2.8 kSI2k machine) • In order to fit within the available queues • Need to get queues at all sites that match our processing time • Alternative: reduce file size!
Reconstruction After data transfer file should be online, as job submitted immediately NOTE: in principle only LHCb has this requirement of “online reconstruction” Reco jobs will read the input data from the T1D0 write buffer Just in case… LHCb pre-stages files (srm_bringonline) & then checks on the status of the file (srm_ls) before submitting pilot job via GFAL Pre-stage should ensure access availability from cache Only issue at NL-T1 with reporting of file status
Reconstruction • 41.2k reconstruction jobs submitted • 27.6k jobs proceeded to done state • Done/created ~67%
Reconstruction • 27.6k reconstruction jobs in done state • 21.2k jobs processed 25k events • Done/25k events ~77% • 3.0k jobs failed to upload rDST to local SE • Only 1 attempt before trying Failover • Failover/25k events ~13%
Error humano en el PIC: WN con la red desconfigurada 24-27 de Mayo Hacía de black-hole (ticket-4386)
Reconstruction CPU efficiency: ratio of wall/cpu time on running jobs • CNAF: more jobs than cores on a WN … • IN2P3 & RAL: Problems reading input data
Reconstruction CPU efficiency: ratio of wall/cpu time on running jobs • PIC: The most cpu-efficient T1
dCache Observations • Official LCG recommendation - 1.8.0-15p3 • LHCb ran smoothly at half of T1 dCache sites • PIC OK - version 1.8.0-12p6 (dcap) • GridKa OK - version 1.8.0-15p2 (dcap) • IN2P3 - problematic - version 1.8.0-12p6 (gsidcap) • Seg faults - needed to ship version of GFAL to run • Could explain CGSI-gSOAP problem???? • NL-T1 - problematic (gsidcap) • Many versions during CCRC to solve number of issues • 1.8.0-14 -> 1.8.0-15p3->1.8.0-15p4
Databases • Conditions DB used at CERN & Tier-1 centres • No replication tests of conditions DB Pit ↔Tier-0 (and beyond) • Switched to using Conditions DB 15th May for reconstruction • LFC • Use “streaming” to populate the read-only instance at T1 from CERN • Problem with CERN instance revealed local instances not being used by LHCb! • Testing underway now
Stripping Stripping on rDST files 1 rDST file & associated RAW file Space tokens: LHC_RAW & LHCb_rDST DST files & ETC produced during the process stored locally on T1D1 (add storage class) Space tokens: LHCb_M-DST DST & ETC file then distributed to all other computing centres on T0D1 (except CERN T1D1) Space tokens: LHCb_DST (LHCb_M-DST)
Stripping 31.8k stripping jobs were submitted 9.3k jobs ran to “Done” Major issues with LHCb book-keeping
Stripping: T1-T1 transfers CNAF PIC Catch up ok once solved Initial problems uploading to M-DST Token at PIC GridKa RAL Stripping test limited to 4 T1 centres
Conclusiones • A pesar de ser el Tier-1 más pequeño de LHCb, la calidad de servicio del PIC ha sido la más alta en el CCRC08 • Se han testeado los siguientes procesos para los Tier-1 • Recepción de datos desde el CERN • Reconstrucción • Stripping y envío de DST a otros Tier-1 • Los resultados en el PIC han sido positivos • Recepción de datos desde el CERN (~5MB/s) • Lectura de datos desde WNs (dcap) – OK • Demostrada replicación de DST a otros Tier-1s a más velocidad de la requerida (catch-up) • El ejercicio ha sido también útil para que LHCb detecte los puntos débiles de su infraestructura Grid DIRAC • Mejorar el sistema de book-keeping, log-files, etc