230 likes | 362 Views
Produzioni MC ai Tiers CMS nel 2007: prospettive CMS-wide e contributo italiano. N. De Filippis M. Abbrescia, G. Cuscela, G. Donvito, G. Maggi, S. My, A. Pierro, A. Pompili, + contribution of developers (Kavka, Fanfani, Codispoti, Bacchi). Università, Politecnico e INFN Bari. Outline.
E N D
Produzioni MC ai Tiers CMS nel 2007: prospettive CMS-wide e contributo italiano N. De Filippis M. Abbrescia, G. Cuscela, G. Donvito, G. Maggi, S. My, A. Pierro, A. Pompili, + contribution of developers (Kavka, Fanfani, Codispoti, Bacchi) Università, Politecnico e INFN Bari
Outline • Status of CMS Monte Carlo production: • organization and current requests • Monte Carlo production in Italy: • Activity post –CSA06 • Problems with sites • Efficiency of italian sites • Reliability of sites • CMS plans and milestones for 2007
MC production cycle • Goal of MC production: to produce events for CMSSW validation (simulation/reconstruction) and physics studies • Small RelVal samples upon a new CMSSW release • PhysVal / HLT groups make requests in form of cfg´s • Experts provide ProdAgent Workflows • Assignment to Production Teams posted on twiki: https://twiki.cern.ch/twiki/bin/view/CMS/ProdOps • Currently 6 teams: LCG(1,2,3,5,6) and OSG • Each team has O(10) dedicated T1/T2 sites’ • When done, files merged and injected to PhEDEx • Too manymanual steps and too many extra-prod. duties (e.g. monitoring/dealing with sites availability & stability) • A lot of pressure from SDV group ( P. Janot) to produce events ASAP
After CSA06: CMSSW_1_1_1 and 1_1_2 used until Xmas CMSSW_1_2_0 released mid-Dec06 Production with CMSSW_1_2_0 running continously since Dec06 PhysValrequests (10M w/o PU + 16.5M w PU) HLT requests (100Mw/o PU+ 20M w PU x 2) HLT + PU in 2 steps GEN-SIM / DIGI-RECO about 20M done, many running, but very tight schedule! Current official requests P. Kreuzer some samples: –QCD di-jets (0 < pt-bin< 3.5TeV), w & w/o PU –Excl. W & Z decays, Wjets(0 < pt < 1TeV) w & w/o PU –Inclusive ttbar, … see https://twiki.cern.ch/twiki/bin/view/CMS/ProdOps120
PhysVal samples with CMSSW_1_2_0 LCG (3) LCG (3)
HLT samples with CMSSW_1_2_0 LCG (3) • After120 bulk production over, a few «special» requests will be addressed: • – Muon Enriched sample with 121: few hundredK events • – Cosmics for Tracker with122: 2.5 -5M events
On going effort of the OSG, LCG1,2,5,6 • Conclusions of P. Kreuzer: • with2 new and efficient production teams on board, remaining120 assignments should be delivered(at least partially) within 10 days.
Post-CSA06 activity (1) • Official CSA06 note complete • Internal CMS note on CSA06 in italian tiers complete • CSA06 analyses completed
Post-CSA06 activity (2) • Since October 2006 until today the LCG(3) team: • re-started the Monte Carlo production withous stops also during the Xmas break • has increased the number of esperts to run ProdAgent • has exported the monitoring tool developed at Bari also at the other LCG teams • has produced about 15 M events for the studies of Physics validation and HLT with and without PU…..1/3 of the entire production in CMS • has used the European LCG resources with continuity, giving enormous feedback for the problem resolution of remote sites
Sites used by the LCG(3) team CERN used intensively before and after Xmas Italian sites English sites Hungary Taiwan IN2P3
On going effort of LCG (3) On going GEN-SIM and DIGI-RECO with low luminosity Pileup
Issues about ProdAgent • Production setup at Bari: • 3 instances of PA running at Bari: • two for FEVT and GEN-SIM production • one for DIGI-RECO production with PU • one machine for on-line dump of the DBs • Monitoring toolexported to other LCG teams with positive feedback. • The submission of jobs is somehow slow (up to 2-3 job/min)due to: • performances of the PA machines which are two years old • overhead of the RBs • no bulk submission • The control of jobs that failed or aborted because of the middleware problems is difficult. Killing jobs of a given production or submitted to a given site was problematic PA developers provided a script to do this. • LCG(3) will smoothly leave English CEs to LCG (6) (the english team) and IN2P3 to LCG(5) (the belgian team) w.r.t debugging & intensive use. On the long run: BulkSubmission& Resource Monitor
Problems with sites • Most of LCG(3) sites had various problems before and during the Xmas break • November:Bari, Pisa, Roma when restarting production, CNAF: problems with castor • English sites and IN2P3 had alternate periods of activity also during last month. Italian sites were really efficient during last month. • Debugging of sites is tipically really painful and requires continous interaction with the site administrators. • Problems: • stage out was the main cause of job failures. • site validation: storage, software tag, software mount points, local copy of PU • grid problems: instabilities of the CE because of high load, overload of RBs which caused: • RB didn´t change status of jobs («Waiting» status forever) • No chance to monitor: FWJobreport and log files lost • Difficult/tedious for prod. teams to kill jobs via BOSS commands • The debugging of sites is not a task to be covered by production teams. • CMS is reacting and preparing centralized tests to ensure the reliability of sites.
Efficiency of the italian sites (last month): CNAF No PU CE replaced Except for few days CNAF worked very well to ensure high efficiency of the CMS production during last month
Statistics of use of CNAF (last month) CPU hours and the percentage % of Tier-1 resources used by CMS: Month-week | CPU hr | % --------------------------------------- 15 jan 21 jan : 33.4% 22 jan 28 jan : 19.0% 29 jan 4 feb : 24.8% 5 feb 11 feb : 22.4% The percentage of use depends on the fairshare setup at CNAF Successful jobs Queues always full of jobs, CMS at maximum of use at CNAF.
Efficiency of the italian sites (last month): INFN Except for limited problems with the storage at Bari, Pisa and Rome all the Italian tier-2 like sites worked very well during last month.
Reliability of sites: tests Following the feedback of problems found by production operators CMS is defining centralized tests to be run every given time to certify sites for production and analysis. The ideas are: • Submit a small processing job for each advertised CMSSW release at a site. This job checks: • Job can be submitted to site • Local stage out can be done • report can be made back via grid middleware • 10 event Minimum Bias? • test frontier access as well? • Following completion of the test job, submit a read back job: • verifies job submission • checks data access • clean up file to test cleanup procedure • Check global DBS datasets at site: • check read access to all fileblocks at site • report back bad files and invalidate in DBS • perhaps randomly select a dataset to test every day/week etc.
Reliability of sites: SAM tests SAM (Service Availibility Monitoring) Hopefully the human resources needed for MC production are expected to decrease so less production teams submitting jobs to any sites
2007 milestones • Finalize 120 Production (aim for mid-Feb!) • Expecting small 12x requests (RelVal, Muon-enrichedHLT, …) • 130 Release (all HLT components) end Feb07 • 130 HLT Production in Mar07 • In parallel, Alpgen Integration in Production • Timescale: integrate till Mar07 + test samples, PH prod. Apr-May07 • 140 Release (new geo) end Mar07 • 140 Physics production Apr-May07 (30M / month) • 150 Release mid-May07 with improved reco algorithms(re-RECO) • Launch CSA07 with16x end-July07 • To be defined the contribution of Italy to the previous activities and the manpower. In addition the CSA07 during summer could be a real problem.
Conclusions • Monte Carlo production of LCG(3) team run continuosly since the end of CSA06 until now • About 15M of events produced (1/3 of the overall CMS productio) • Italian sites are working very well during last month to unsure high efficiency production. • Warning: keep high the attention to Italian Tiers, mainly at CNAF • Effective interaction between operators and developers of PA • The load of production operators should decrease as soon as (possible) the centralized SAM tests will run to certify sites for production. • The Italian contribution to the activities in preparation and for CSA07 has to be discussed.