LHC Computing Models

LHC Computing Models Commissione I 31/1/2005 Francesco Forti, Pisa Gruppo di referaggio Forti (chair), Belforte, Menasce, Simone, Taiuti, Ferrari, Morandin, Zoccoli

Outline • Comparative analysis of the computing models (little about Alice) • Referee comments • Roadmap: what’s next Disclaimer: it is hard to digest and summarize the available information. Advance apologies for errors and omissions. LHC Computing Models - F. Forti

A Little Perspective • In 2001 the Hoffman Review was conducted to quantify the resources needed for LHC computing • Documented in CERN/LHCC/2001-004 • As a result the LHC Computing Grid project was launched to start building up the needed capacity and competence and provide a prototype for the experiments to use. • In 2004 the experiments have been running a Data Challenge to verify their ability to simulate, process and analyze their data • In Dec 2004 Computing Model Documents have been submitted to LHCC, who has reviewed them on Jan 17-18, 2005 • The Computing TDRs and the LCG TDR are expected this spring/summer. LHC Computing Models - F. Forti

Running assumptions • 2007 Luminosity 0.5x1033 cm-2 s-1 • 2008-9 Luminosity 2x1033 cm-2 s-1 • 2010 Luminosity 1x1034 cm-2 s-1 • but trigger rate is independent of luminosity • 7 months run pp = 107 s (real time 1.8x107 s) • 1 month run AA = 106 s (real time 2.6x106 s) • 4 months shutdown LHC Computing Models - F. Forti

Data Formats • Names differ, but the concepts are similar: • RAW data • Reconstructed event (ESD, RECO, DST) • Tracks with associated hits, Calorimetry objects, Missing energy, trigger at all levels, … • Can be used to refit, but not to do pattern recognition • Analysis Object Data (AOD, rDST) • Tracks, particles, vertices, trigger • Main source for physics analysis • TAG • Number of vertices, tracks of various types, trigger, etc. • Enough information to select events, but otherwise very compact. LHC Computing Models - F. Forti

General strategy • Similar general strategy for the models: • Tier 0 at CERN: • 1st pass processing in quasi-real time after rapid calibration • RAW data storage • Tier 1s (6 for Alice,CMS, LHCb; 10 for Atlas): • Reprocessing; Centrally organized analysis activities • Copy of RAW data; some ESD; all AOD; some SIMU • Tier 2s (14-30) • User analysis (chaotic analysis); Simulation • Some AOD depending on user needs LHC Computing Models - F. Forti

Event sizes LHC Computing Models - F. Forti

First Pass Reconstruction • Assumed to be in real time • CPU power calculated to process data in 107 s. • Fast calibration prior to reconstruction • Disk buffer at T0 to hold events before reconstruction • Atlas: 5 days; CMS: 20 days; LHCb: ? LHC Computing Models - F. Forti

Streaming • All experiments foresee RAW data streaming, but with different approaches. • CMS: O(50) streams based on trigger path • Classification is immutable, defined by L1+HLT • Atlas: 4 streams based on event types • Primary physics, Express line, Calibration, Debugging and diagnostic • LHCb: >4 streams based on trigger category • B-exclusive, Di-muon, D* Sample, B-inclusive • Streams are not created in the first pass, but during the “stripping” process  Not clear what is the best/right solution. Probably bound to evolve in time. LHC Computing Models - F. Forti

Data Storage and Distribution • RAW data and the output of the 1st reconstruction are store on tape at the T0. • Second copy of RAW shared among T1s. • CMS and LHCb distribute reconstructed data together (zipped) with RAW data. • No navigation between files to access RAW. • Space penalty, especially if RAW turns out to be larger than expected • Storing multiple versions of reconstruced data can become inefficient • Atlas distributes RAW immediately before reco • T1s could do processing in case of T0 backlog LHC Computing Models - F. Forti

Data Storage and Distribution • Number of copies of reco data varies • Atlas assumes ESD have 2 copies at T1s • CMS assumes a 10% duplication amount T1s for optimization reasons • Each T1 is responsible for permanent archival of its share of RAW and reconstructed data. • When and how to throw away old versions of reconstructed data is unclear • All AOD are distributed to all T1s • AOD are the primary source for data analysis LHC Computing Models - F. Forti

Calibration • Initial calibration is performed at the T0 on a subset of the events • It is then used in the first reconstruction • Further calibration and alignment is performed offline in the T1s • Results are inserted in the conditions database and distributed • Plans are still very vague • Atlas maybe a bit more defined LHC Computing Models - F. Forti

Reprocessing • Data need to be reprocessed several times because of: • Improved software • More accurate calibration and alignment • Reprocessing mainly at T1 centers • LHCb is planning on using the T0 during the shutdown – not obvious it is available • Number of passes per year LHC Computing Models - F. Forti

Analysis • The analysis process is divided in: • Organized and scheduled (by working groups) • Often requires large data samples • Performed at T1s • User-initiated (chaotic) • Normally on small, selected samples • Largely unscheduled, with huge peaks • Mainly performed at T2s • Quantitatively very uncertain LHC Computing Models - F. Forti

Analysis data source • Steady-state analysis will use mainly AOD-style data, but… • … initially access to RAW data in the analysis phase may be needed. • CMS and LHCb emphasize this need by storing raw+reco (or raw+rDST) data together, in streams defined by physics channel • Atlas relies on Event Directories formed by querying the TAG database to locate the events in the ESD and in the RAW data files LHC Computing Models - F. Forti

Simulation • Simulation is performed at T2 centers, dynamically adapting the share of CPU with analysis • Simulation data is stored at the corresponding T1 • Amount of simulation data planned varies: • Dominated by CPU power • 100% may be too much; 10% may be too little LHC Computing Models - F. Forti

GRID • The level of reliance/use of GRID middleware is different for the 4 experiments: • Alice: heavily relies on advanced, not yet available, Grid functionality to store and retrieve data, and to distribute CPU load among T1s and T2s • Atlas: the Grid is built in the project, but basically assuming stability of what is available now. • CMS: designed to work without Grid, but will make use of it if available. • LHCb: flexibility to use the grid, but not strict dependance on it. • Number of times the word “grid” appears in the computing model documents (all included) LHC Computing Models - F. Forti

@CERN • Computing at CERN beyond the T0 • Atlas: “CERN Analysis Facility” • but only for CERN-based people, not for the collaboration • CMS: T1 and T2 at CERN • but T1 has no tape since T0 does the storing • LHCb: unclear, explicit plan to use the event fileter farm during the shutdown periods • Alice: don’t need anything at CERN, the Grid will supply the computing power. LHC Computing Models - F. Forti

Overall numbers 2005 plan 2001 Review LHC Computing Models - F. Forti

Referee comments • Sum of comments from LHCC review and italian referees • We still need to interact with the experiments • We will compile a list of questions after today’s presentations • We plan to hold four phone meetings next week to discuss the answers • Some are just thing the experiment know they need to do • Stated here to reinforce them LHC Computing Models - F. Forti

LHCC Overall Comments • The committee was very impressed with the quality of the work that was presented. In some cases, the computing models have evolved significantly from the time of the Hoffmann review. • In general there is a large increase in the amount of disk space required. There is also an increase in overall CPU power wrt the Hoffmann Review. The increase is primarily at Tier-1's and Tier-2's. Also the number of Tier-1 and Tier-2 centers has increased. • The experiences from the recent data challenges have provided a foundation for testing the validity of the computing models. The tests are at this moment incomplete. The upcoming data challenges and service challenges are essential to test key features such as data analysis and network reliability. LHC Computing Models - F. Forti

LHCC Overall Comments II • The committee was concerned about the dependence on precise scheduling required by some of the computing models. • The data analysis models in all 4 experiments are essentially untested. The risk is that distributed user analysis is not achievable on a large scale. • Calibration schemes and use of conditions data have not been tested. These are expected to have an impact of only about 10% in resources but may impact the timing and scheduling. • The reliance on the complete functionality of GRID tools varies from one experiment to another. There is some risk that disk/cpu resource requirements will increase if key GRID functionality is not used. There is also a risk that additional manpower will be required for development, operations and support. LHC Computing Models - F. Forti

LHCC Overall Comments III • The contingency factors on processing times and RAW data size vary among the experiments. • The committee did not review the manpower requirements required to operate these facilities. • The committee did not review the costs. Will this be done? It would be helpful if the costing could be somewhat standardized across the experiments before it is presented to the funding agencies. • The committee listened to a presentation on networks for the LHC. A comprehensive analysis of the peak network demands for the 4 experiments combined is recommended (see below.) LHC Computing Models - F. Forti

LHCC Reccommendations • The committee recommends that the average and the peak computing requirements of the 4 experiments be studied in more detail. A month by month analysis of the CPU, disk, tape access and network needs for all 4 experiment is required. A clear statement on computing resources required to support HI running in CMS and ATLAS is also required. Can the peak demands during the shutdown period be reduced/smoothed?Plans for distributed analysis during the initial period should be worked out. • Dependence of the computing model on raw event size, reconstruction time, etc. should be addressed for each experiment. • Details of the ramp up (2006-2008) should be determined and a plan for the evolution of required resources should be worked out. • A complete accounting of the offline computing resources required at CERN is needed from (2006-2010). In addition to production demands, the resource planning for calibration, monitoring, analysis and code testing and development should be included - even though the resources may seem small.The committee supports the requests for Tier-1/Tier-2 functionality at CERN. This planning should be refined for the 4 experiments. LHC Computing Models - F. Forti

LHCC Conclusions • Aside from issues of peak capacity, the committee is reasonably certain that the computing models presented are robust enough to handle the demands of LHC production computing during early running (through 2010.) There is a concern about the validity of the data analysis components of the models. LHC Computing Models - F. Forti

Additional comments from INFN Referees • Basic parameters such as event size and reconstruction CPU time have very large uncertainties • Study the dependance on the computing models on these key parameters and determine what are the brick-wall limits • Data formats are not well defined • Some are better than others • Need to verify that the proposed formats are good for real life analysis For example: • can you do event display on AODs ? • can you run an alignment systematic study on ESDs ? LHC Computing Models - F. Forti

Additional Comments II • Many more people need to try and do analysis with the existing software and provide feedback • Calibration and condition database access have not sufficiently defined and can represent bottlenecks • No cost-benefit analysis has been performed so far • Basically the numbers are what the experiments would like to have • No optimization done yet on the basis of the available resources • In particular: amount of disk buffers; duplication of data; reuse of tapes LHC Computing Models - F. Forti

Additional Comments III • Are the models flexible enough ? • Given the large unknowns, will the models be able to cope with large changes in the parameters ? For example: • assuming all reconstructed data is on disk may drive the experiments (and the funding agencies) into a cost brick-wall if the size is larger than expected, or effectively limit the data acquisition rate. • evolution after 2008 is not fully charted and understood. Is there enough flexibility to cope with a resource limited world? • Are the models too flexible ? • Assuming the grid will optimize things for you (Alice) may be too optimistic • Buffers and safety factors aimed at flexibility are sometimes large and not fully justified LHC Computing Models - F. Forti

Addition Comments IV • The bandwidth is crucial • Peak in T0T1 need to be understood • The required bandwidth has not been fully evaluated, especially at lower levels and for “reverse” flow • T1T1,T2 (eg MC data produced at T2) • Incoming at CERN (not T0) of reprocessed data and MC • Need to compile tables with the same safety factors assumed LHC Computing Models - F. Forti

Specific comments on experiments • Coming from LHCC review • Not fully digested and not yet integrated by INFN referees • Useful to collect them here for future reference • Some duplication unavoidable. Your patience is appreciated. LHC Computing Models - F. Forti

ATLAS I • Impressed by overall level of thought and planning which have gone into the overall computing model so far. • In general fairly specific and detailed • Welcome thought being given to the process of and support for detector calibration and conditions database. • needs more work • looking forward to the DC3 and LCG Service Challenge results • An accurate, rapid calibration on 10% of data is crucial for the model LHC Computing Models - F. Forti

ATLAS II • Concern about the evidence basis and experience with several aspects of the computing model • large reduction factor assumed in event size and processing time, not really justified • data size and processing time variation with background and increasing luminosity • lead to large (acknowledged but somewhat hidden) uncertainties in estimates • Data size and number of copies, particularly for the ESD, have significant impact on the total costs. • We note that these are larger for Atlas than for other experiments. • Also very large number of copies of the AOD • Depend critically on analysis patterns which are poorly understood at this time and require a fair amount of resources LHC Computing Models - F. Forti

ATLAS III • Concern about the lack of practical experience with the distributed analysis model • especially if AOD are not the main data source at the beginning • need resources to develop the managerial software needed to handle the distributed environment (based on Grid MW),for example if Tier1s need to help in case of backlog at Tier0 • Need to include HI physics in the planning. • Availability of computing resources during the shutdown should not be taken for granted. • Real time data processing introduces a factor 2 extra resource requirement for reconstruction. • It is not clear that this assumption is justified/valid cf the ability to keep up with data taking on average. • The ATLAS TAG model is yet to be verified in practice. • We are unclear exactly how it will work. • Primary interface for physicists, need iterations to get it right. LHC Computing Models - F. Forti

ATLAS IV • Monte carlo • Agree that assumption of 20% fully reconstructed Monte Carlo is a risk and a larger number would be better/safer. • Trigger rates • We note that the total cost of computing scales with trigger rates. This is clearly a knob that can be turned. • The CERN Analysis Facility is more a mixture of a Tier-1 and Tier-2 • No doubt Atlas needs computing at CERN for calibration and analysis LHC Computing Models - F. Forti

CMS I • Uncertainty of factor ~2 on many numbers taken as input to the model • c.f. ATLAS assumptions • Event size 0.3 MB MC inflated to 1.5 MB • factor 2.5 for conservative thresholds/zero suppression at startup • Safety factor of 2 in the Tier-0 RECO resources should be made explicit • Should we try to use same factor for all four experiments? • Fully simulated Monte Carlo • 100% of real data rate seems like a reasonable goal • but so would 50% (Atlas assumes 20%) • Heavy Ion • Need a factor of 10 improvement in RECO speed wrt current performance • Ratio of CPU to IO means that this is possibly best done at Tier-2 sites! LHC Computing Models - F. Forti

CMS II • Use of "CMS" Tier-0 resources during 4-month shutdown? • Maybe needed for CMS and/or ALICE heavy ion RECO • Re-RECO of CMS pp data on Tier-0 may not be affordable? • We find clear justification for a sizable CERN-based “analysis” facility • Especially for detector-related (time critical) activities • monitoring, calibration, alignment • Is distinction between Tier-1 and Tier-2 at CERN useful? • c.f. ATLAS LHC Computing Models - F. Forti

CMS III • CMS attempt to minimize reliance on some of the currently least mature aspects of the Grid • e.g., global data catalogues, resource brokers, distributed analysis • Streaming by RECO physics objects • Specific streams placed at specific Tier-1 sites • RECO+RAW (FEVT full event) is the basic format for first year or two • Conservative approach, but in our view not unreasonably so • Some potential concerns: • More difficult to balence load across all Tier-1s • Politics: which Tier-1s get the most sexy streams? • Analysis at Tier-1 restricted largely to organized production activities • AOD production, dataset skimming, calibration/alignment jobs? • except perhaps for one or two "special" T1s LHC Computing Models - F. Forti

CMS IV • Specific baseline presented, but • A lot of thought has gone into considering alternatives • Model has some flexibility to respond to real life • Presented detailed resources for 2008 • Needs for 2007 covered by need to ramp up for 2008 • No significant scalability problems apparent for future growth • The bottom line: • Assumptions and calculation of needed resources seem reasonable • Within overall uncertainty of perhaps a factor ~2? LHC Computing Models - F. Forti

LHCb I • LHCb presented a computing model based on a significantly revised DAQ plan, with a planned output of 2 kHz • The committee did not try to evaluate the merit of the new data collection strategy, but tried to assess whether computing resources seem appropriate given the new strategy. • It’s notable that computing resources required for new plan are similar (within 50% except for disk) to those in the Hoffman report even though event rate is increased by an order of magnitude, largely because of reduction in simulation requirements in new plan. The committee was impressed by the level of planning that has gone into the LHCb computing model, and by the clarity and detail of the presentations. In general, the committee believes that LHCb presented a well reasoned plan with appropriate resources for their proposed computing model. LHC Computing Models - F. Forti

LHCb II Time variation of resource requirements. In the LHCb computing plan as presented, the peak cpu and network needs exceed the average by a factor of 2. This variation must be considered together with expected resource use patterns of other experiments. LHCb (and others) should consider scenarios to smooth out peaks in resource requirements. Monte Carlo. Even in the new plan, Monte Carlo production still consumes more than 50% of cpu resources. Any improvement in performance of MC or reduction in MC requirements would therefore have a significant impact on cpu needs. The group’s current MC estimates, while difficult to justify in detail, seem reasonable for planning. Event size. The committee was concerned about the LHCb computing model’s reliance on the small expected event size (25 kB). The main concern is I/O during reconstruction and stripping. LHCb believe that a factor of 2 larger event size would still be manageable. rDST size.The rDST size has almost as large an impact on computing resources as the raw event size. The committee recommends that LHCb develop an implementation of the rDST as soon as possible to understand whether the goal of 50kB (including raw) can be achieved. LHC Computing Models - F. Forti

LHCb III Event reconstruction and stripping strategy.The multi-year plan of event reconstruction and stripping seems reasonable, although 4 strippings per year may be ambitious. If more than 4 streams are written, there may be additional storage requirements. User analysis strategy.The committee was concerned about the use of Tier 1 centers as the primary user analysis facility. Are Tier 1 centers prepared to provide this level of individual user support? Will LHCb’s planned analysis activities interfere with Tier 1 production activities? Calibration. Although it is not likely to have a large impact on computing plans, we recommend that details of the calibration plan be worked out as soon as possible. Data challenges. Future data challenges should include detector calibration and user analysis to validate those parts of the computing model. Safety factors. We note that LHCb has included no explicit safety factors (other than prescribed efficiency factors) in computing needs given their model. This issue should be addressed in a uniform way among the experiments. LHC Computing Models - F. Forti

The Grid and the experiments • Use of Grid functionality will be crucial for the success of LHC computing. • Experiments in general and the italian community in particular need to ramp up their use of LCG in the data challenges • Verify the models • Feedback to developers • Strong interaction between experiments and LCG team mandatory to match requirements and implementation • Cannot accomodate large overheads due to lack of optimization of resource usage. LHC Computing Models - F. Forti

Conclusion and Roadmap • These computing models are one step on the way to LHC computing • Very good outcome, in general specific and concrete • Some interaction and refinement in the upcoming months • In the course of 2005: • Computing TDRs of the experiments. • Memorandum of understanding for the computing resources for LCG phase II. • Specific planning for CNAF and Tier2s in Italy. • Expect to start building up the capacity in 2006. LHC Computing Models - F. Forti

LHC Computing Models