ALICE Computing TDR Questions and Answers

ALICE Computing TDRQuestions and Answers Federico Carminati October 8, 2005 ALICE Computing

Q1: Milestones • Have you met your 2005 milestones? ALICE Computing

Q1: Milestones • MS1-May-2005: PDC05 - Start of event production (phase 1) • Started only in September to • Synchronise with SC3 • Improve the integration with LCG middleware and increase the usage of common components • Working with LCG via a combined taskforce toward a stable long-term solution for our distributed computing environment • Delayed May-September 2005 ALICE Computing

Q1: Milestones • MS2-June 2005: AliRoot framework release • Released summer 2005 in time for the PDC05 • FLUKA is the standard transport model • Detector geometry via the ROOT Geometrical Modeller (TGeo) • Calibration and Alignment framework and the prototype for the Condition infrastructure implemented • Milestone has been met as planned ALICE Computing

Q1: Milestones • MS3-June-2005: Computing TDR submitted to the LHCC • This milestone has been met as planned ALICE Computing

Q1: Milestones • MS4-July 2005: PDC05 – Start of combined test with SC3 (phase 2) • Goal • Distributed production and merging of signal and underlying events and the subsequent reconstruction of the merged event • No additional developments or new services are required for this phase • Delayed to December 05 (MS1) ALICE Computing

Q1: Milestones • MS5-September 2005: PDC05 – Start of distributed analysis (phase 3) • Goal • Non-organized distributed analysis of ESD data by many users • The delay has allowed us to further integrate with LCG • All components ready • User interface to the Storage Index (gShell), ROOT API for SI access and deployment of PROOF • Will be released to selected users end of 2005 or early 2006 • Tests are ongoing • Batch and interactive distributed analysis will be demoed at SC’ 05 • Delayed January 2006 ALICE Computing

Q1: Milestones • MS7-December 2005: Condition infrastructure deployed • Initial user requirements collected • Prototype of the condition and tag infrastructure demonstrated • Further development according to user feed-back ongoing • First release scheduled for December • No delay foreseen for this milestone ALICE Computing

Q1: Milestones • MS8-December 2005: Preliminary implementation of algorithms for alignment and calibration ready for all detectors • Alignment and calibration framework prototype available • Implementation of the detectors algorithm has started on the prototype • Good results obtained for the TPC • Milestone can be delayed ALICE Computing

Q2: CDC VII • Page 12 mentions the next Data Challenge (CDC VII) but the schedule on page 77 does not seem to mention it; when is this planned to take place? Also, the goals for CDC VII include testing new network technologies; what are these technologies? ALICE Computing

Q2: CDC VII • The planning could not be fixed at the time of the TDR • Network equipment to be purchased in 2005 after a large market survey • Planning discussed and agreed with IT • Nov-Dec '05: initial tests of the network equipment in the computing centre • April '06: generation of data in the DAQ at the experimental area; recording in the computing centre • Software will include • DATE V5, AliRoot, ROOT data formatting, algorithms from HLT, Linux SLC3 in 2005 (possibly SLC4 in 2006) • CASTOR2 • New technologies to be tested • 10 Gbit Ethernet router and Fibre Channel network for the storage ALICE Computing

Q3: Tag and Grid Collector • In Chapter 2, the TAG and Grid Collector indexing mechanism looks very similar to functionalities provided by a relational DB. What are the reasons for your choice? Do you have proof that this system is scalable? Do you have quantitative results on achieved performances from the DCs? ALICE Computing

Q3: Tag and Grid Collector • GC is based on compressed bitmap index technology • We do not need most of the RDBMS functionality • e.g. concurrent read and write access • Large gain in performance over classical RDBMS queries • Great benefits from a single I/O technology, i.e. root files • Scalability and performance demonstrated by STAR • We have performed several standalone tests • Suited to ALICE needs • Confirm STAR performance results • Framework developed in collaboration with ROOT and STAR • GC and Index Builder will be included in the PDC 06 ALICE Computing

Q3: Tag and Grid Collector • From J.Wu presentation at September 2005 ROOT workshop • http://agenda.cern.ch/askArchive.php?base=agenda&categ=a055638&id=a055638s2t10/transparencies ALICE Computing

Q4: PROOF requirements • Page 30: what are the hardware and software constrains associated with the PROOF system in the remote computing centres? Could you please give more details on the required architecture in an analysis centre? ALICE Computing

Q4: PROOF requirements • Software constraints • All components (ROOT, proofd, proofserv and xrootd) are part of ROOT • Hardware constraints • Dictated by the target performance of the system • PROOF scales linear up to a few hundred nodes • Nodes can simply be added to increase the performance • Commodity components • High end CPU's (top end P4 or AMD64) • Few hundred GB SATA disks • Few GB's RAM and GB Ethernet • A clusters of a several tens of nodes can already deliver considerable performance for ad-hoc analysis • ALICE plans to instrument the CERN AF as a PROOF-enabled cluster • Intention to test a large PROOF cluster with the SC 4 / ALICE PDC 06 ALICE Computing

Q4: PROOF requirements • Already in used at PHOBOS, See M.Ballintijn, presentation during the ROOT workshop • http://agenda.cern.ch/askArchive.php?base=agenda&categ=a055638&id=a055638s2t5/transparencies ALICE Computing

Q5: Monarc and Cloud models • How do you envisage the migration from the MONARC model to the cloud model as mentioned on page 63? Is this migration compatible with other LHC experiments' plans? The computing model described in the TDR seems to fit very well with the hierarchical model; why do you think that the cloud model is better? ALICE Computing

Q5: MONARC and Cloud models • ALICE still sees the cloud model as appealing • Redundancy and resilience to failures • Flexible in optimising the usage of resources • Tested successfully in PDC04 • The computing model follows a more “hierachised” pattern • The LCG infrastructure is developing in a hierarchical fashion • FA’s plan around the large T1’s • Resource evaluation and planning is easier • Strongly recommended to us by the LHCC during the Computing Model Review in January 2005 who suggested not to rely entirely on a cloud-enabled GRID and thus adopt a stricter hierarchical model ALICE Computing

Q6: First pass reconstruction • Page 65 states the first pass reconstruction will be done on Tier 0 at CERN for both pp and AA data. As a consequence the AA reconstructed data will not be available before at least 4 months after data taking. This delay is rather uncomfortable for fast physics feedback. Is it impossible to foresee a faster distributed first pass reconstruction on Tier 1's, maybe not in the first year when data will be scarce, but for the following years? ALICE Computing

Q6: First pass reconstruction • After HI run the T1’s will be busy to reprocess previous years’ data • pp and HI reconstruction, organised analysis • “Pushing” more computation outside CERN would penalise ongoing physics activities • And it will critically depend on the performance of the Grid! • In the current model enough time is available to provide feedback for the running conditions of the next heavy-ion run • First significant results will be obtained from a subset of the data allowing for early discovery • One of the goal of the CERN AF cluster ALICE Computing

Q7: Tier2 bandwith • The computing model described implies reconstruction will be mainly done at Tier 1 while analysis will be done at Tier 2. Page 70 says the data at Tier 2 will be copied (and hence deleted to make space) as required. However, the implications of this in terms of extra bandwidth do not seem to have been included fully; can they be estimated? ALICE Computing

Q7: Tier2 bandwidth • T1’s do second and third reconstruction passes and organised analysis • T2’s do MC generation / reconstruction and non scheduled analysis • T2’s export the data to near T1 MSS and keep the ESD/AOD • One copy of the current reconstruction pass till the new one is produced • The distributed analysis splits jobs to maximise data locality • Minimisation of additional data traffic • This has been taken into account in the estimation of the network traffic • It depends on our disk space requirements at T2’s to be satisfied ALICE Computing

Q8: MC data movement • Similarly, page 69 says a copy of every MC event will go to Tier 1 (for reconstruction) and then these are copied back to the Tier 2 where theywere produced. There will be further MC events moved around as they are analysed; while signal can be easily produced for a particular analysis, the large numbers of generic background events needed will have to be pooled and hence will be needed at all Tier 2s. Has the bandwidth required been included? If not, what would this add in terms of rate in and out of Tier 2 as well as disk space needs at Tier 2? ALICE Computing

Q8: MC data movement • T2’s do MC generation and reconstruction • Underlying events are generated and shipped to T1’s MSS • Signal events are generated on-the-fly and merged with underlying events from the “local” pool • MC ESD’s generated by the T2’s are shipped to T1 MSS but also kept at T2’s for subsequent analysis • T2-T2 traffic should be very low • This has been taken into account for the network traffic estimation ALICE Computing

Q9: Data volume in 2007 • Page 62 states the assumptions for 2007 are 40% pp and 20% AA of a standard year. However, the event rate will be kept at nominal by loosening the triggers, which will allow studies of them. What are the financial implications of this, in terms of resources at the Tier 0 (or elsewhere) which need to be purchased in time for the 2007 run rather than delayed until the 2008 run, when they will be cheaper? How much would be saved by e.g. a factor of two reduction in the event rate, which would still give large amounts of data to debug the detector with? Page 72 states the full rate of looser triggers is needed to allow the discovery physics to be done, which implies the triggers are not very selective and many important events will be lost with the nominal settings. This needs further justification; specify the critical physics which is essential to do during the 2007 run which could not be done with e.g. half the event sample. ALICE Computing

Q9: Data volume in 2007 • Event Samples & Triggers presented to LHCC • PPR vol 1 (CERN/LHCC 2003-049) • LHCC special session June 2002 (CERN/LHCC 2002-023, http://sks.home.cern.ch/sks/LHCbeamreq.ppt) • Large cross section processes, global event properties • Measured in MB or central events (5-10% of MB) • First physics (e.g. multiplicity distributions) can be extracted from several hundreds of events • Rare probes (e.g. -meson pt, charm mesons) or signals with a very small background ratio (e.g. -mesons at low pt, thermal photons) • Require many 107 MB and central events • Rare events with specific triggers in ALICE • J/ψ or  decays in the central detector and the muon arm • high pt photons, jets etc... • Require selective triggers, good DAQ lifetime and maximum integrated luminosity -- will take a longer time to address • Given the multiplicity ratio we need two orders of magnitude more MB pp than heavy-ion events for comparable statistical errors (signal dependent) ALICE Computing

Q9: Data volume in 2007 • Maximum rate limited by the SDD dead time to ~500 Hz • Can be 1 kHz (reducing SDD sampling rate) for pp • DAQ bandwidth (1.2 GB/s) limits HI rate • 100 Hz of MB Pb-Pb or 25 Hz of central, assume dN/dych = 4000 for central Pb-Pb • DAQ and trigger guarantee a good lifetime for rare trigger and fill the bandwidth with MB • We are not rate-limited • DAQ bandwidth is a compromise between technical / financial constraints and running time to accumulate a few 107 Pb-Pb (pp) MB and central events (109 pp) ALICE Computing

Q9: Data volume in 2007 • Short setup time in 2007 with cosmic triggers and single beams • Main systems (e.g. trigger scintillators and TPC) ready for MB events soon after • RHIC collected physics data within days of the first collisions and published less than four weeks later. We intend to do at least the same! • Initial low luminosity pp of particular interest • At L < 1029 we have no pile-up in the TPC -- cleaner and smaller events • We intend to take MB events (pp and if possible Pb-Pb) at the maximum possible DAQ rate for physics analysis • Even at ultra-low luminosities, rate will be limited by the experiment, not the machine • To limit the event rate is not an efficient usage of a detector expensive to build and a machine expensive to operate • This is a one-off chance • The LHC luminosity (and therefore the event pile-up) will increase • To limit the rate in 2007 would be harmful to the quality of the physics • The number of events, and the CPU requirements, depend on the initial LHC running time (pp and HI) • At 500 Hz we can collect 4 x 108 MB pp events (40% of a standard year) in ≤ 106 seconds ALICE Computing

Q10: AliEn • How much of the AliEn software is (or will become) common with LCG is not made clear. What is the overlap of these efforts and how do they mutually coordinate between them? How much effort is it to maintain AliEn? How much LCG code is foreseen to be incorporated into AliEn over the next two years? When is it expected that AliEn will be phased out completely? ALICE Computing

Q10: AliEn • Coherent set of modular services • Used in production 2001-2004 • Common Grid projects progressively offered the opportunity to replace some AliEn services • Consistent with the plan announced since 2001 by ALICE • This will continue as suitable components become available • ALICE is taking active part in the definition and testing of these components • Whenever possible, we will use “common” services • AliEn offers a single interface for ALICE users into the complex, heterogeneous (multiple grids and platforms) and fast-evolving Grid reality ALICE Computing

Q10: AliEn • AliEn interfaces to the LCG services • LCG data management components (LFC, SRM) • Workload Management System (Resource Broker) • gLite Data Management components (FTS) • Virtual Organisation Managment System (VOMS) • Common authentication model (GLOBUS) • Discovery service (planned) • Discussed by the BS WG, coordinated by the ALICE-LCG-TF, tested in the DC • Interface with ARC is in progress, we are discussing with OSG • The services provided by AliEn are • ALICE job database and related distributed tools and services • ALICE file and dataset catalogue and related distributed tools and services • ALICE specific monitoring services • Essential components for distributed data processing • Their functionality is ALICE-specific and not found elsewhere • They are integral part of the ALICE Computing Environment • We do not foresee to phase out these elements ALICE Computing

Services for SC3 timeframe ALICE Computing

ALICE Computing

ALICE Computing TDR Questions and Answers

ALICE Computing TDR Questions and Answers

Presentation Transcript

Questions and Answers

Questions Answers

Questions and Answers

Questions and Answers

Questions -Answers

ALICE Computing Model

ALICE Computing Resources

ALICE Computing TDR

ALICE Computing Model

QUESTIONS AND ANSWERS

Alice Computing

Questions and Answers

The ALICE Computing

HMPID readout upgrade for ALICE TDR run3

Questions and Answers

Questions and Answers

ALICE TDR Tier 2 requirements

ATLAS Computing TDR

The LHCb Computing TDR

LHCb Muon System TDR Questions