170 likes | 183 Views
CMS Data Distribution. General Thoughts Data Volume Computing Model Data Processing Guidelines The Plan Primary & Secondary Datasets Skims. Journées “Grilles France” October 16, 2009 IPN de Lyon. Tibor Kur ča Institut de Physique Nucléaire de Lyon. General Thoughts.
E N D
CMS Data Distribution General Thoughts Data Volume Computing Model Data Processing Guidelines The Plan Primary & Secondary Datasets Skims Journées “Grilles France” October 16, 2009 IPN de Lyon Tibor Kurča Institut de Physique Nucléaire de Lyon T.Kurca Grilles France
General Thoughts • Imperative :Data must be accesible to all of CMS quickly initial data distribution model should be simple later will be adapted to reality • How to use the data most efficiently? - large data volumes, mostly background - computing resources distributed at unprecedented level - very broad physics program with a diverse needs • Reduce the amount of data to be processed reduce the strain on computing resources • A core set of data has to be small enough & very representative for a given analysis - most analyses don’t need access to all data split data into datasets easy manageable day-to-day jobs enable full tests of analysis components before running the full statistics allow prioritisation of data analysis T.Kurca Grilles France
CMS Data Volume • Estimated rates/ data volumes - 2009/10 - 70 days running - 300 Hz rate in physics stream, 2.3x 10E9 events - Assume 26% mean overlap • 3.3 PB RAW data (1.5MB/evt) : detector data + L1, HLT results after online formatting • 1.1 PB RECO (0.5MB/evt): reconstructed objects with their associated hits • 220 TB AOD (0.1MB/evt): main analysis format: clusters, tracks, particles id • Multiple re-reco passes • Data placement • RAW/RECO: one copy across all T1, disk1tape1 • Sim RAW/RECO: one copy across all T1, on tape with 10% disk cache • AOD: one copy at each T1, disk1tape1 T.Kurca Grilles France
Tier-0-1-2 7 2 x T0: Prompt reco (24/24), FEVT storage, data distribution T1: Data storage, processing (Re-Reco, skim, AOD extraction), raw data access, Tier-2 support data serving to T2s T2: Analysis, MC production, specialised support tasks, local + group use T.Kurca Grilles France
T1/T2 Associations • Associated Tier-1: hosting MC prod + reference for AOD serving - Full AOD sample at Tier-1 (after T1 T1 transfers for re-recoed AODs) • Stream “allocation” ~ available disk storage at centre T.Kurca Grilles France
T2-PAG,POG,DPG Associations T.Kurca Grilles France
Data Processing Guidelines • we aim for prompt reconstruction and analysis - reduce backlogs • we need the possibility of prioritisation - cope with backlogs without delaying critical data - prompt calibration using low latency data • we are using data streaming based on trigger bits need to understand the trigger and event selection - early events classification allows later prioritisation 1. Express-stream of : ‘hot’ physics events calibration events Data Quality Monitoring (DQM) events 2. Physics stream - propose O(7) ‘primary datasets’ , immutable but can have overlap T.Kurca Grilles France
Commissioning Express Streams T1s T.Kurca Grilles France
Trigger – Dataset Connection The Goal: create Datasets based on triggers for specific physics objects Datasets distributed to central storage at T2s run skims on those Datasets (or skim of a skim …) Purpose- benefits of this kind of Datasets: - group similar events together to facilitate data analyses - data separation (streaming) is based on trigger bits no need for additional processing triggers are persistent and Datasets resilient - recognized drawback – events are not selected with optimal reco & AlCa Obvious consequences: - every trigger has to be asigned to at least one Dataset - increasing Datasets overlaps increasing storage requirements T.Kurca Grilles France
The Plan 1. Primary Datasets (PD) - immutable, based on triggers, split at T-0 - RAW/RECO/AOD PDs distributed (limited) to central storage at T1s 2. Secondary Dataset (SD) - produced from PDs at T1s by dataOps and distributed to central storage at T2s - RECO or AOD format, trigger based as well 3. Central skims - produced at T1s by dataOps - very few initiall for key applications that cannot be served otherwise 4. Group skims - run on datasets stored at T2s; flexibility in choice of event content but provenance must be maintained. - approved by group conveners and expect to have a tool allowing them to be registered in global DBS tested in October excercise ! - subscribable to Group space 5. User analysis skims - a dataset that is no more than skim away from provenance T.Kurca Grilles France
From PD to SD & Skims • The Goal: Provide easy acces to interesting data • PD could be quite large reduce the amount of data to easily manageable sizes • Secondary Datasets (SD) - each SD centrally produced subset of one PD using trigger info …. not more than ~30% of events - RECO format initially, later AOD • Central Skims: - produced centrally at T1s, 1-2 per PD 10% of most interesting events or an uniform subset (prescale 10) • Group skims: - designed by groups run on any data at T2 - could be run also on PD if in the T2 - ideally central skims as input 1% of PD …. manageable day-to-day • User skims: final skims by individual users for their needs • Procedures tested during October exercise T.Kurca Grilles France
Physics Objects Only example Physics objects not well balanced in size combine or split them for balance … keep overlaps low T.Kurca Grilles France
Object DS Primary DS • For rate equalization: • Some are too big Split • Some are too small Merge • From object datasets to PDs: • Splitting based only on trigger bits • Merge correlated triggers • keep unprescaled triggers together • Allow duplication of triggers if • meaningful from physics point of view Example here is for L >>8E29 for L=8 E29 need only 8 PD J. Incandella T.Kurca Grilles France
Object DS Primary DS (2) • Bjet , Lepton+x datasets: very small rates • Bjet: merged with MultiJet dataset • LepX: - they are combined object triggers • split & absorbed into 2 relevant • lepton datasets • same trigger appearing in 2 DS T.Kurca Grilles France
1st Iteration Secondary Datasets 8E29 Primary Datasets(PD) Secondary Datasets (SD): - Dataset/Skim name - Data format (RECO, AOD, reduced AOD, etc) - Prescale wrt to the PD data - Rate in Hz - Fraction of events wrt to the total parent PD - Fraction of disk wrt to the PD size (RECO assumed) T.Kurca Grilles France
Secondary Datasets L=8 E29 Jet SDs • - low pT Jet triggers already prescaled at HLT further prescaled at SD level • full DiJetAve15 stats needed for JEC to be stored in reduced format • keep 50 GeV as lowest unprescaled single jet threshold • full stat for DiJetAve30 again neede for JEC … reduced event size • keep 35 Gev as the lowest unprescaled MET trigger • - keep also events from Btag and HSCP triggers T.Kurca Grilles France
Summary • The Goal of CMS data distribution model is to make data access easier , more reliable and efficient • We have in place many components - Tiers-structure - T2 associations and data transfer tools - trigger tables - Primary Datasets and 1st iteration of SDs based on triggers • PDs & SDs in standard formats will be distributed to T2s (AOD, RECO if possible) • Central and group skims run on PDs accessible at T2s and more manageable • No restrictions on group and user skims … even if special data formats are required • in process post mortem analysis of October exercise where all this was tested before real data taking T.Kurca Grilles France