CMS Data Analysis Model Overview

CMS data analysis N. De Filippis - LLR-Ecole Polytechnique

Outline • CMS computing/analysis model • Physics analysis tools and resources: • at CAF: local analysis • at Tier-2s: distributed analysis, local analysis • at Tier-3s: interactive analysis • Experience of Higgs analysis at CC-IN2P3

Offline farm recorded data Online system • Filterraw data • Data Reconstruction • Data Recording • Distribution to Tier-1 CERN Computer center Tier 0 . . • Permannet data storage and management • Skimming analysis • re-processing • Simulation • ,Regional support Italy Regional Center Fermilab Regional Center France Regional Center Tier 1 . . . Tier 2 Tier2 Center Tier2 Center • Well-managed disk storage • Massive Simulation • official physics group anal. • end-user analysis Tier2 Center workstation Tier 3 InstituteA InstituteB • End-user analysis CMS computing model

GRIF LNL Wisconsin Rome CMS dataflow • Data are collected, filtered online, stored, reconstructed with HLT information at Tier-0 and registered in Data Bookeeping Service (DBS) at CERN. CERN Computer Centre Tier 0 DBS • RECO data are moved from Tier-0 to Tier 1 via PhEDEx PhEDEx PhEDEx PhEDEx France Regional Centre (IN2P3 Italy Regional Centre (CNAF) FermiLab Tier 1 • Data are filtered (in a reduced AOD format) at Tier-1 according to the physics analysis group selection for skimming; skim output are shipped to Tier-2 via PhEDEx PhEDEx PhEDEx PhEDEx Tier 2 LLR • Data are analysed at Tier-2 and Tier-3 by physics group users and published in a DBS instance dedicated to the physics group. Tier 3

CMS analysis model (CTDR) • In the computing model CMS stated that the analysis resources are: • Central Analysis Facility (CAF) at CERN: • intended for specific varieties of analysis with requirements of low latency access to the data (data are on disk pool of CASTOR) • CAF is a very large resource in terms of absolute capacity  next slide • policy decisions about the users and the use of the resource • processing at CAF: calibration and alignment, few physics analyses • Tier-2 resources: • two groups of analysis users to be supported in Tier-2 • support for analysis groups/specific analyses • support for local communities • 1. is addressed by using the distributed analysis system • 2. is addressed via local batch queues/access and distributed systems • Association Tier-2 – analysis group under definition • Tier-3 resources: local and interactive analysis, private resources

CAF status and planning for 2008

Tier-2 resources • A nominal Tier-2 should be: • 0.9MSI2k of computing power, 200TB of disk and 1Gb/s of WAN. • that means several hundred batch slots and disk for large skim samples • A reasonably large fraction of the nominal is devoted to analysis group activities and the remainder is assignable to the local community • Proposal under discussion is ~50% of nominal processing resources for simulation, ~40% of the nominal resources for analysis groups (specific and well organized analyses) , and the remainder for local users • 10% of the local storage for simulation, 60% for analysis groups and 30% for local communities • • for 2008 lower guideline of 60TB of disk and 0.4MSI2k for analysis groups

Distributed analysis in Tier-2 • Access to the resources: • Mandatory: A grid certificate is needed, eventually with specific role to set a priorities or to cope with policies • Not needed a direct access to the site via login • Mandatory: a user interface environment • A registration in a virtual organization (VO) : CMS • …… data discovery • ….. job analysis builder (CRAB) • ….. storage allocation for physics groups and local user data

CMS Data Discovery • Data discovery page gives the following info: • what data are available, where are stored (storage element) • the list of LFN of files, number of events included, size, checksum • run information, parameter used, provenance, parentage • analysis dataset created by a initial sample • https://cmsweb.cern.ch/dbs_discovery/_navigator?userMode=user

CRAB (CMS remote analysis builder) • CRAB is a python user-friendly tool for: • job preparation, splitting and submission of CMS analysis jobs • analysing data available at remote sites by using the GRID infrastructure • Features: • User Settings provided via a configuration file (dataset, data type) • Data discovery querying DBS for remote sites • Job splitting performed per events • GRID details mostly hidden to the user • status monitoring, job tracking and output management of the submitted jobs • Use cases supported: • Official and private code analysis of publishedremote data

CRAB standalone and server

UI The end-user analysis workflow • The user provides: • Dataset (runs,#event,..) taken by DBS • Private CMSSW code DataSet Catalogue DBS • CRAB discoveries data and sites hosting them by querying DBS CRAB Job submission tool • CRAB prepares, splits and submits jobs to the Resource Broker/WMS Workload Management System • The RB/WMS sends jobs at sites hosting the data provided the CMS software was installed Resource Broker (RB/WMS) • CRAB retrieves automatically logs and the output files of the jobs; it’s possible to store output files into SE (best solution) CMSSW Computing Element • CRAB can publish the output of the jobs in DBS to make output data available officially for subsequent processing. Worker node Storage Element

Monitor of analysis jobs (last month) job distribution: ~25% CERN (reduced by 5%) ~20% FNAL ~10% FZK ~7 % IN2P3 Users dataset accessed

User job statistics All Tiers:~400 Kjobs: Successfull 53%Failed 20%27% Unknown Tier2 only (32%): ~130 Kjobs: Successfull 49%Failed 20%31% Unknown Details of the errors • Errors breakdown: • 20% outpuf file not found on the WN • misleading error: CMSSW16x not reporting failures • 17% CMSSW did not produce a valid FWJobReport • same as above but CRAB210 fix to report a less misleading message • 20% Termination + Killed • likely hitting batch queue’s limit • 10% SegV • 4 % failure staging out remotely (it was 27% the month before CMS week) Consistent with 40% the month before CMS week

Storage management at Tier-2 • CMS users/physics groups can produce and store large quantities of data: • due to re-processing of skim • due to common preselection processing and iterated reprocessing • due to private productions (especially fast simulation) • due to end-user analysis ROOT-ples •  care on how to manage the utilization of the storage • Tier-2 disk-based storage will be fixed size and so care is needed to ensure every site can meet their obligations with respect to the collaboration and analysis users are treated fairly: • user quota, policy are under definition • physics analysis data namespace • user data namespace Official Physics groups Private user

Physics analysis data namespace

User analysis data namespace

Local analysis in Tier-2 • Goal: to run a lot of iterations of the analysis and to derive the results in few hours • Direct access to the site (login) but not more than about 40 users to avoid jam • Direct batch system access: PBS, BQS, LSF; • options to be supplied for an efficient use of resources • Ex: qsub -eo -q T –l platform=LINUX32+LINUX,hpss,u_sps_cmsf,M=2048MB,T=200000,spool= 50MB <job.sh> • Storage area for temporary files: gpfs, hpss • Support for dcap, xrootd for direct access to ROOT files; • Eventually PROOF can be a solution for massive interactive analysis but I didn’t test it directly…

Interactive analysis in Tier-3 • Goal: to have a fast feedback about changes in the code, iterations of the analysis supposed to be on very reduced format; results should be derived in not more than few hours • Login access needed; about 10 users at most • interactive cluster of machine to be setup; CMSSW requires a lot of RAM for complex processing (1 - 8 GB depending on the number of concurrent users and applications) • disk pools accessible directly (but also via xroot and dcap); fast and eventually dedicated access required; private resources mostly; user ROOT-ples only • Eventually PROOF can be a solution for massive interactive analysis

Iterations of end-user analysis • Reasonable hypotheses: • AOD-SIM of CMS (250kB/event) and RECO-SIM (600 kB/event) for physics studies • about 5 TB of simulated data per analysis (7.5 M events including fast sim.) • 5 concurrent analyses of the same physics group to be run on signal and bkg samples in two weeks with at least 3 iterations • assuming at least 2 physics groups (Higgs , EWK) running at the same time • power needed to analyze an event of ~0.5 kSI2K.s for AOD-SIM and 1 kSI2K.s for RECO-SIM • so a CPU power of about ~ 200 kSI2K (15 WN ) • access to a big fraction of control samples -> ~ 1.5 TB of additional disk (2.5 M events) and ~65 kSI2k of CPU power (5 WN) to analyze the samples in three days. •  about 20 WN for physics studies.

Discussion items for CC-IN2P3 • How can we make sure the facility will not be impacted by the centrally submitted job overloading the Tier-1 part ? • dCache resources are shared between Tier-1 and Tier-2; • a lot of advantages in terms of data placement and access • but also the access to data affected by the overload of the official jobs at Tier-1  it is a delicate balance • increasing the number of dcache servers • duplicate the data mostly frequently accessed • monitoring facilities are fundamental • how to guarantee free CPUs/ priority: • priority of official jobs set by VO roles • dynamic share to ensure 50 % simulation, 40 % official analysis, 10 % local access • local priorities agreed with site administrators and physics conveners of the groups related to the Tier-2 • bunch of CPUs only accessible by local users; the number has to be dimensioned by the number of users

Experience of Higgs Analysis at CC-IN2P3

..few words about roadmap of the Higgs WG Evolution towards “official Higgs WG software” providing the flexibility and adaptability required at analysis level for fast and efficient implementation of new ideas: provide common basic tools (and reference tunes) to do Higgs analyses provide basic code for official Higgs software in cvs hierarchical structure, modularity enforced provide start-up software code for Higgs analysis users (make the life easier) provide guidelines for benchmak Higgs analysis software development provide control histograms to validate the results (e.g. pre-selection) provide user support and profit from central user support provide support for distributed analysis More coherent and organized effort for the on going analyses

HiggsAnalysis Package Higgs analysis package schema Code in cvs organized in packages and with a hierarchical structure Configuration Skimming HiggsToZZ4Leptons HiggsTo2photons HiggsToZZ4e HiggsToZZ4m HiggsToWW2Leptons HiggsToZZ2e2m HiggsToWW2e HiggsToWW2m HiggsToWWem SusyHiggs VBFHiggsTo2Tau HeavyChHiggsToTauNu VBFHiggsToWW VBFHiggsTo2TauLJet SUSYHiggsTo2m VBFHiggsToWWto2l2nu VBFHiggsTo2TauLL NMSSMqqH1ToA1A1To4Tau VBFHiggsToWWtolnu2j

Analysis code modelization • Physics objects provided through POGs: electrons, muons, jets, photons, taus, tracks, MET, Pflow • Common tools already provided or to be provided centrally: • combinatorial, (PhysicsTools in CMSSW) • vertex fitting, (PhysicsTools in CMSSW) • boosting particles, (PhysicsTools in CMSSW) • dumping MC truth, (PhysicsTools in CMSSw) • matching RecoTo MCTruth, (PhysicsTools in CMSSw) • kinematic fit (PhysicsTools in CMSSW) • selection particles based on pT (PhysicsTools in CMSSW) • isolation, elId,muonID algorithms provided by POG • optimization machinery (Garcon, N95, neural networks, multivariate • analysis) • significance, CL calculation • systematic evaluation tools for objects (mostly through POGs) • Official analysis: clear combination of framework modules, PhysicsTools and at the end bare ROOT. C++ classes used in ROOT can be interfaced to CMSSW if needed and of general interest.

Analysis operation mode • Different options are possible: • Full or partial CMSSW: at this time we decided full CMSSW up to preselection level or official and common processing • FWlite: which means by reading events in their EDM/CMSSW format and then use EDM trees/collection to iterate the analysis; • Bare ROOT: hopefully for the latest step of the analysis (no provenance information can be retrieved), for fast iterations • The best is the combination of the three modes according to the step of the analyses. • Ex: HZZ->2e2mu analysis is using full CMSSW + a ROOT tree used for fast iterations and filled with relevant information

Official analysis workflows run at CC-IN2P3 • HZZ analyses run with common software and tools • Generator Filter: • Filter at the level of generator for 2e 2m, 4e and 4m final state without any cut with an acceptance cut (|h|<2.5) on leptons from Z; 3 set of output samples • Skim: • HLT selection based on the single mu, single ele, double mu and double mu trigger • Skim 4 Leptons: at least 2 leptons with pT > 10 GeV and one with pT >5 GeV • Preselection run for 2e2mu: • duplication removal of electrons sharing the same supercluster/track • “loose” ElectronID • at least 2 electrons with pT> 5 GeV/c irrespective of the charge • at least 2 muons with pT > 5 GeV/c irrespective of the charge • at least 1 Zee candidate with mll > 12 GeV/c2 • at least 1 Zmm candidate with mll > 12 GeV/c2 • at least one l+l-l+l- (H) candidate with mllll > 100 GeV/c2 CRAB_2_1_0 version used Output files published in local DBS instance “local09” Output files stored at Lyon in /store/users/nicola/

…additional needs • DBS instance of Higgs working group mantained by DBS and CMS Facilities operation teams: • Examples of specific processing chains that requires a specific registration in the physics group DBS instance are: • prefiltering of samples; • iterations of skim processing, • iteration of improved or new pre-selection strategies, • iteration of selection chain for each physics channel, • common processing for homogeneus analyses up to a given step not overloaded access to the database •  DBS obtained, working and in production • CRAB server at CC-IN2P3 specific for Higgs analysis users. •  waiting for a crab server version stable …in contact with Claude and Tibor

The last 15 days of my processing Data access/ job submission • Data access: • dcache, hpss and gpfs all used to support physics analysis use case • no stop processing but not so heavy; sometimes overload is caused by centrally submitted jobs • temporary files written in /sps and deleted from time to time • important files from re-processing written in dcache /store/users area but this should be in /store/results/Higgs_group • Job submission: via CRAB (grid tools), via direct access to the batch system, interactiveworking efficiently • Prioritization of job and users issue addressed by CMS management via VO roles  a solution is under investigation • Priority of local users can be agreed with administrators and physics groups

Conclusions • CMS is currently validating the analysis model, solve the ambiguities, define policy, setup a coherent strategy to use distributed, local resources for analysis purposes. • CC-IN2P3 is advanced and the data access and job speed is efficient although the problem of shared facilities between Tier-1 and Tier-2 can impact the latency of analysis results (hence) the impact of physicists in France • Scalability problems seem addressed at CC-IN2P3; the solution is a delicate balance and tuning of few parameters of dCache configuration and batch system • the CMS support at CC-IN2P3 is efficient • My feeling is that most of the analysis users are not aware of how to use tools and resources in a efficient way  CMS is organizing tutorials to make the life easier for the physicists

Backup

Ex: CAF workflow/groups • Handling of CAF users under CMS control • CAF Groups able to add/remove users • Aim for about 2 users per sub-group

CMS Data Analysis Model Overview