1 / 17

CMS Data Distribution

CMS Data Distribution. General Thoughts Data Volume Computing Model Data Processing Guidelines The Plan Primary & Secondary Datasets Skims. Journées “Grilles France” October 16, 2009 IPN de Lyon. Tibor Kur ča Institut de Physique Nucléaire de Lyon. General Thoughts.

rstach
Download Presentation

CMS Data Distribution

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CMS Data Distribution General Thoughts Data Volume Computing Model Data Processing Guidelines The Plan Primary & Secondary Datasets Skims Journées “Grilles France” October 16, 2009 IPN de Lyon Tibor Kurča Institut de Physique Nucléaire de Lyon T.Kurca Grilles France

  2. General Thoughts • Imperative :Data must be accesible to all of CMS quickly  initial data distribution model should be simple  later will be adapted to reality • How to use the data most efficiently? - large data volumes, mostly background - computing resources distributed at unprecedented level - very broad physics program with a diverse needs • Reduce the amount of data to be processed  reduce the strain on computing resources • A core set of data has to be small enough & very representative for a given analysis - most analyses don’t need access to all data  split data into datasets easy manageable day-to-day jobs  enable full tests of analysis components before running the full statistics  allow prioritisation of data analysis T.Kurca Grilles France

  3. CMS Data Volume • Estimated rates/ data volumes - 2009/10 - 70 days running - 300 Hz rate in physics stream, 2.3x 10E9 events - Assume 26% mean overlap • 3.3 PB RAW data (1.5MB/evt) : detector data + L1, HLT results after online formatting • 1.1 PB RECO (0.5MB/evt): reconstructed objects with their associated hits • 220 TB AOD (0.1MB/evt): main analysis format: clusters, tracks, particles id • Multiple re-reco passes • Data placement • RAW/RECO: one copy across all T1, disk1tape1 • Sim RAW/RECO: one copy across all T1, on tape with 10% disk cache • AOD: one copy at each T1, disk1tape1 T.Kurca Grilles France

  4. Tier-0-1-2 7 2 x T0: Prompt reco (24/24), FEVT storage, data distribution T1: Data storage, processing (Re-Reco, skim, AOD extraction), raw data access, Tier-2 support  data serving to T2s T2: Analysis, MC production, specialised support tasks, local + group use T.Kurca Grilles France

  5. T1/T2 Associations • Associated Tier-1: hosting MC prod + reference for AOD serving - Full AOD sample at Tier-1 (after T1 T1 transfers for re-recoed AODs) • Stream “allocation” ~ available disk storage at centre T.Kurca Grilles France

  6. T2-PAG,POG,DPG Associations T.Kurca Grilles France

  7. Data Processing Guidelines • we aim for prompt reconstruction and analysis - reduce backlogs • we need the possibility of prioritisation - cope with backlogs without delaying critical data - prompt calibration using low latency data • we are using data streaming based on trigger bits  need to understand the trigger and event selection - early events classification allows later prioritisation 1. Express-stream of : ‘hot’ physics events calibration events Data Quality Monitoring (DQM) events 2. Physics stream - propose O(7) ‘primary datasets’ , immutable but can have overlap T.Kurca Grilles France

  8. Commissioning Express Streams T1s T.Kurca Grilles France

  9. Trigger – Dataset Connection The Goal: create Datasets based on triggers for specific physics objects  Datasets distributed to central storage at T2s  run skims on those Datasets (or skim of a skim …) Purpose- benefits of this kind of Datasets: - group similar events together to facilitate data analyses - data separation (streaming) is based on trigger bits  no need for additional processing  triggers are persistent and Datasets resilient - recognized drawback – events are not selected with optimal reco & AlCa Obvious consequences: - every trigger has to be asigned to at least one Dataset - increasing Datasets overlaps  increasing storage requirements T.Kurca Grilles France

  10. The Plan 1. Primary Datasets (PD) - immutable, based on triggers, split at T-0 - RAW/RECO/AOD  PDs distributed (limited) to central storage at T1s 2. Secondary Dataset (SD) - produced from PDs at T1s by dataOps and distributed to central storage at T2s - RECO or AOD format, trigger based as well 3. Central skims - produced at T1s by dataOps - very few initiall for key applications that cannot be served otherwise 4. Group skims - run on datasets stored at T2s; flexibility in choice of event content but provenance must be maintained. - approved by group conveners and expect to have a tool allowing them to be registered in global DBS  tested in October excercise ! - subscribable to Group space 5. User analysis skims - a dataset that is no more than skim away from provenance T.Kurca Grilles France

  11. From PD to SD & Skims • The Goal: Provide easy acces to interesting data • PD could be quite large  reduce the amount of data to easily manageable sizes • Secondary Datasets (SD) - each SD centrally produced subset of one PD using trigger info …. not more than ~30% of events - RECO format initially, later AOD • Central Skims: - produced centrally at T1s, 1-2 per PD 10% of most interesting events or an uniform subset (prescale 10) • Group skims: - designed by groups run on any data at T2 - could be run also on PD if in the T2 - ideally central skims as input  1% of PD …. manageable day-to-day • User skims: final skims by individual users for their needs • Procedures tested during October exercise T.Kurca Grilles France

  12. Physics Objects Only example Physics objects not well balanced in size  combine or split them for balance … keep overlaps low T.Kurca Grilles France

  13. Object DS  Primary DS • For rate equalization: • Some are too big Split • Some are too small  Merge • From object datasets to PDs: • Splitting based only on trigger bits • Merge correlated triggers • keep unprescaled triggers together • Allow duplication of triggers if • meaningful from physics point of view Example here is for L >>8E29 for L=8 E29 need only 8 PD J. Incandella T.Kurca Grilles France

  14. Object DS  Primary DS (2) • Bjet , Lepton+x datasets: very small rates • Bjet: merged with MultiJet dataset • LepX: - they are combined object triggers •  split & absorbed into 2 relevant • lepton datasets •  same trigger appearing in 2 DS T.Kurca Grilles France

  15. 1st Iteration Secondary Datasets 8E29 Primary Datasets(PD) Secondary Datasets (SD): - Dataset/Skim name - Data format (RECO, AOD, reduced AOD, etc) - Prescale wrt to the PD data - Rate in Hz - Fraction of events wrt to the total parent PD - Fraction of disk wrt to the PD size (RECO assumed) T.Kurca Grilles France

  16. Secondary Datasets L=8 E29 Jet SDs • - low pT Jet triggers already prescaled at HLT  further prescaled at SD level • full DiJetAve15 stats needed for JEC  to be stored in reduced format • keep 50 GeV as lowest unprescaled single jet threshold • full stat for DiJetAve30 again neede for JEC … reduced event size • keep 35 Gev as the lowest unprescaled MET trigger • - keep also events from Btag and HSCP triggers T.Kurca Grilles France

  17. Summary • The Goal of CMS data distribution model is to make data access easier , more reliable and efficient • We have in place many components - Tiers-structure - T2 associations and data transfer tools - trigger tables - Primary Datasets and 1st iteration of SDs based on triggers • PDs & SDs in standard formats will be distributed to T2s (AOD, RECO if possible) • Central and group skims run on PDs accessible at T2s and more manageable • No restrictions on group and user skims … even if special data formats are required • in process post mortem analysis of October exercise where all this was tested before real data taking T.Kurca Grilles France

More Related