230 likes | 241 Views
This document contains the draft response to the questions posed in the PPRP document, focusing on priority areas for investment and potential impacts of funding at lower levels. It also includes inputs to scenario planning for GridPP3, specifically regarding changes in resource requirements for ATLAS, CMS, and LHCb experiments.
E N D
Collaboration Board MeetingDraft Response to the PPRP David Britton 25/Oct/06
10 Questions The 10 questions are contained in the document “PPRP Sep GridPP feedback.doc” that was circulated with the Agenda for this meeting. The draft answers (not all completed) are contained in the document “DraftResponseToPPRP_V0.4.doc” also circulated with the Agenda. For this meeting, propose that just the response to Q2 needs to be presented in detail (it covers some of the other questions) but comments or questions are invited on any of the others. GridPP3
Question-2 The Panel would like to explore the priorities and potential options for descope. If funding were only available to support, 30%, 50% or 70% of the total request, what would be the priority areas for investment in terms of obtaining the best UK science return? What would be the political and experimental impacts of funding at a much lower level? How would you prioritise the work packages? GridPP3
Input to Scenario Planning -Resources Changes in the LHC schedule have prompted another round of resource planning. New global resource requirements presented to CRRB yesterday (Oct 24th). New UK resource requirements have been derived and incorporated in the scenario planning. Hardware prices have been re-examined following recent Tier-1 purchase (CPU was much cheaper than expected). Decision to use a more aggressive prediction for future hardware costs (i.e. to use our “best empirical estimate” rather than “a conservative estimate”but to also increase the declared contingency on hardware spend from 15% to 25% over the lifetime of the project. GridPP3
Input to Scenario Planning - ATLAS The priority of the ATLAS-UK collaboration to ensure the best science return is the hardware and its operation. Within this, ATLAS notes that UK Tier-2 resources contribute directly to the UK output, whereas shortages in Tier-1 resources affect all ATLAS physicists globally. For Tier-1 resources, ATLAS regard 70% of the requested capacity as barely manageable, and 50% would do serious damage to the analysis capacity for the large UK physics community and it would also threaten the calibration and commissioning of the SCT. To reduce the Tier-2 hardware, cuts would have to be made in simulation, calibration, and then analysis capability but even the first of these will degrade physics output. Tier-2 cannot be cut below the 70%. ATLAS has derived the UK fraction of the global requirements by noting that UK authorship is 12.5% of the Global ATLAS Tier-1 authorship and that there are 4 out of 30 (13.3%) of ATLAS Tier-2s are in the UK. GridPP3
Input to Scenario Planning - CMS The priority of the CMS-UK collaboration is access to Tier-2 resources in the UK and access to Tier-1 resources preferably in the UK. CMS argue that, given the savings in hardware due to changes in the cost estimates and the change in the LHC schedule, the 70% scenario could be achieved with only a small reduction in the level of hardware compared to the request. This would be at the threshold for CMS to host a UK Tier-1. In the 50% scenario, the priority for CMS would be to protect their Tier-2 resources which would have to be hosted by a Tier-1 external to the UK. The revised CMS UK hardware request is based on a more detailed algorithm than a simple fraction of the global requirements. The scale is set by dual requirements of (a) a minimum size for a CMS Tier-1 of 50% of average CMS Tier-1 (~7% of global requirements) and (b) the UK fraction of Tier-1 authors (same bases at ATLAS) of ~8%. The details are calculated from the dual requirements to accept 4 out of CMS’s 50 data-streams (8%) and the need for the Tier-1 to serve an entire AOD dataset. GridPP3
Input to Scenario Planning - LHCb The LHCb collaboration has a somewhat different computing model from ATLAS and CMS with most analysis performed at the Tier-1 and the Tier-2 used predominantly for Monte Carlo simulation. LHCb prioritizes Tier-1 hardware and its operation, followed by Tier-2 hardware and its operation and finally support etc. As with the other experiments, the revised hardware requests from UK LHCb are based on the new global requirements presented to the CRRB on September 24th 2006. The UK fraction is calculated from the UK authorship fraction of 18.6% (revised from 16.6% at the time of the GridPP3 submission). The Tier-2 resource request also includes 18.6% of the global LHCb Tier-2 resource shortfall of 30% to give a total of about 24% of the global Tier-2 requirements. it is noted that any fall below the global authorship fraction of 18.6% at either the Tier-1 or Tier-2 would have to be negotiated in a global context. GridPP3
70% Scenario An example 70% scenario based on Experiment Inputs and a bottom-up examination of all posts. GridPP3
What has been lost in the 70% scenario? - 15% of Hardware • - Hardware at the Tier-1 and Tier-2 is reduced by 15%. • - Contributes to a global shortfall of Tier-1 resources for all three LHC • experiments. • - If cuts applied uniformly, this takes CMS to the threshold level for a UK Tier-1. • If cuts applied uniformly, this would reduce LHCb UK Tier-1 resources below • the UK authorship level (which will require negotiation within the collaboration • and may result in other costs). • The reduction of hardware directly impacts the ability of UK groups to • produce physics output and will be a competitive disadvantage. GridPP3
What has been lost in the 70% scenario? - 7% of Tier-1 Staff Effort • - Staffing effort at Tier-1 is barely adequate to meet MOU quality of service as it • is. Staffing effort does not scale linearly with hardware. • Cuts achieved by removing 3-FTE ramp-up of Tier-1 staff in the GridPP2+ period • and 1-FTE during the GridPP3 period (probably from incident response team). • The working allowance, previously included to address risk of failing to meet • MOU service levels, has also been removed. • Net result is a significant increase in the risk that the Tier-1 service levels • will not be met. GridPP3
What has been lost in the 70% scenario? - 11% of Tier-2 Staff Effort • The Tier-2 staff would be reduced by 1.75 FTE out of 14.75. • This is likely to contribute to either or both of: • (a) a reduction of Tier-2 resources levered from the institutes; • (b) a reduction in the service level achieved at the Tier-2s. • Net result is a increase in the risk that the Tier-2 resource and service levels will not be met. GridPP3
What has been lost in the 70% scenario? - 31% of Support Staff Effort The Data Management post (1 FTE) for Replica Optimisation is not funded. This work was judged as a potentially good investment to optimise the use of limited storage resources. Removing funding for this post removes the likelihood of much greater savings on the purchase of storage resource in the future as the use of storage will remain inefficient for a longer time. A reduction in data storage support (0.5 SY) reduces the flexibility to support multiple storage technology in the UK. (GridPP does not wish to support more storage technologies than necessary but recognises the possible need). Continuing support (0.5 FTE) for the GridPP Real Time Monitor would not be funded. The RTM is the face of the LCG/EGEE grid, is a highly visible and acclaimed demonstration show piece that has repeatedly illustrated the UK’s position as a major international player in this field. GridPP3
What has been lost in the 70% scenario? - 31% of Support Staff Effort A 1-FTE reduction in the support for the R-GMA information and monitoring system. This major UK contribution is deeply embedded in the EGEE/LCG stack. The ultimate success of this endeavour may now hinge on securing external support, significantly increasing the risk of turning a UK success story into a UK failure. The Security Vulnerability work (0.5 FTE) would be dropped. During GridPP2 the UK has pro-actively taken a leading international role developing security vulnerability policies and procedures. Support for GridSite would be reduced by 0.5 FTE. The GridSite security toolkit developed by GridPP, is embedded in the EGEE/LCG middleware and used as the basis for the GridPP and other websites together with the GridSiteWiki. A Networking post in the GridPP3 proposal designed to help network provision and network monitoring would be reduced to 50%. This reduces the network support at a time when the network will be coming under intense stress and production standards are required. GridPP3
What has been lost in the 70% scenario? 12% of Operations; 10% of Management; 25% of Outreach. Support for the UK Grid Operations Centre in GridPP3 would be reduced from 3 to 2 FTE. The current manpower is 5.5 funded by EGEE. This increases the risk that the Grid Operations Centre on which GridPP relies to provide Grid monitoring, ticketing and accounting, would not function effectively. In the 70% scenario the task of managing the project is likely to be as great, if not greater, than for the full proposal. Nevertheless, management effort would be reduced by not buying out 25% FTE as User Board Chair. There is a risk that the User Board would not be as pro-active at collecting or presenting the User’s requirements and concerns, as we had desired. The 0.5 FTE requested for Industrial Liaison would be dropped. This means that we are unlikely to establish much industrial outreach. GridPP3
50% Scenario An example 50% scenario based on Experiment Inputs and a bottom-up examination of all posts. GridPP3
What has been lost in the 50% scenario? - 40% of Tier-1 Hardware 40% of the Tier-1 HW will be lost. All three LHC Experiments will need to negotiate the consequences of providing significantly less Tier-1 resources than their UK Author fraction. The UK could no longer host a CMS Tier-1 centre and special arrangements would need to be made to provide UK CMS Tier-2s, access to resources and support at a non-UK Tier-1. For ATLAS and LHCb, this level of Tier-1 resource would do serious damage to the analysis capacity for the large UK physics communities and for ATLAS it would also threaten the calibration and commissioning of the SCT. GridPP3
What has been lost in the 50% scenario? - 30% of Tier-2 Hardware 30% of the Tier-1 HW will be lost. The physics output for all three experiments would be reduced. Competitive advantage would be completely lost. ATLAS would apply reductions to simulation, calibration, and then analysis capability but even the first of these will degrade physics output. LHCb would reduce Monte Carlo simulation, similarly compromising physics output. As CMS’ sole UK resource, the reduction would directly scale the CMS physics output. GridPP3
What has been lost in the 50% scenario? 22% of Tier-1 Staff; 23% of Tier-2 Staff Tier-1 staff would be further reduced from 17 to 14 FTE. Comparing this with the current level of 13 FTE it is quite apparent that the Tier-1 level of service defined in the MOU signed by PPARC, could not be met. There would need to be international negotiations as to whether the Tier-1 could function as such for either of the remaining two experiments. Tier-2 staff would be further reduced from 13 to 11 FTE. This is likely to contribute to either or both of (a) a reduction of Tier-2 resources levered from the institutes; (b) a reduction in the service level achieved at the Tier-2s. GridPP3
What has been lost in the 50% scenario? - 66% of the Support Staff lost. The support post for generic metadata issues would be lost and all support would have to be via the experiments. Support for grid storage technologies would be reduced from 7 SY to 2SY over the project. This would (probably) be limited to Castor support at CCLRC. Institutes would need to look elsewhere for support on the technologies likely to be deployed therein. The portal work would be stopped leaving the smaller or future experiments with a higher hurdle to getting on the Grid. The testing and performance monitoring work associated with the Work Load Management system would stop. This is an area where there is strong European pressure to continue and is of potentially direct benefit to UK physics by providing knowledge about the current condition of the Grid on a site-by-site basis. GridPP3
What has been lost in the 50% scenario? - 66% of the Support Staff lost. Support for information and monitoring systems would be reduced to 1FTE. (R-GMA could not be supported; this post would have to help the transition to whatever new system evolved internationally). Security support would be reduced to 1 FTE. This would be split as deemed appropriate at the time between VOMS support and Operational Security. The support for GridSite (an international obligation) would be dropped. The LCG/EGEE middleware stack would be at risk. The networking support post for monitoring and provision would be lost. This would be in a regime where the need for Network support has become more critical with at least one of the major experiments attempting to use a non-UK Tier-1. GridPP3
What has been lost in the 50% scenario? - 30% of the Operations Staff; 25% of Management; all dissemination (except GridPP2+ period). Support for the Grid Operations Centre would be further reduced from 2 to 1.5 FTE, further increasing the risk that the GOC on which GridPP relies to provide Grid monitoring, ticketing and accounting, would not function effectively. One of the four Tier-2 coordinators would be lost. This increases the risk of failure of part of the Tier-2 organisation; reduces the deployment team; and increase the likelihood that delays to upgrades at some sites will reduce the available resources with a direct impact on physics output. Management would be further reduced (this would have to be optimised). There is a risk that the management becomes less engaged and therefore less effective. All dissemination and outreach activities would be stopped after the GridPP2+ phase is complete. GridPP3
30% Scenario GridPP has examined the original PPARC call and has determined that it is unable to form a proposal that meets any of the criteria listed with funding at the 30% level: 2. a) Underpin the particle physics programme by delivering the functional Tier 1 centre for the LHC experiments and for the other experiments where UK groups will require computing GRID access and facilities. The 50% scenario presented above already fails to meet this criterion because the Tier-1 would be sub-threshold for at least one of the LHC experiments. At the 30% funding level there could only be a Tier-1 for (probably) one LHC experiment. Most likely, in a 30% scenario there would be no Tier-1 and the resources would be used as a Tier-2 (though it is not clear what to do about LHCb). Etc….. (see document) GridPP3
Summary GridPP has taken input from the 3 large LHC experiments as guidance in an attempt to design a GridPP3 project in 70% and 50% funding scenarios. The outcome is a 74% funding scenario that preserves 85% of the hardware (just about the threshold for a UK CMS Tier-1) and a 55% scenario that basically doesn’t work (it does not respect the criteria of the call and there are large political and financial unknowns associated with delivering less than a pro-rata share of LHC hardware). Given the latter, we did not attempt to address the 30% scenario in a detailed manner. We do not regard the fine details of these scenarios as fixed but they are proposed as the starting points to address reduced funding outcomes. GridPP asks the CB to endorse this approach for presentation to the PPRP. GridPP3