250 likes | 263 Views
Climbing Hills. Prof. David Britton GridPP Project leader University of Glasgow. GridPP25 Collaboration Meeting 26 th August 2010. Grid Growth. “Last Month”. “Last Week”. “Last 5 years”. “Last Year”. A Grid for All. “Last Quarter”. Step by Step. “Next 5 years”. Finishing Line.
E N D
Climbing Hills Prof. David BrittonGridPP Project leader University of Glasgow GridPP25 Collaboration Meeting 26th August 2010 IET, Oct 09
Grid Growth “Last Month” “Last Week” “Last 5 years” “Last Year” GridPP25
A Grid for All “Last Quarter” GridPP25
Step by Step “Next 5 years” Finishing Line GridPP1: “From Web to Grid” GridPP2: “From Prototype to Production” “Last 5 years” GridPP3: “From Production to Exploitation” GridPP4: “Computing In the LHC era” GridPP25
GridPP4 Time Line _________ Eyjafjallajökull • Nov 5th – invitation to bid. • Dec 10th – Face-to-face PMB to agree structure • Dec 11th – CB meeting to agree structure • Jan 15th – Face-to-face PMB to agree draft v6 • Jan 21st – CB meeting to discuss v6. • Jan 28th – Submission of draft to Oversight Committee • Feb 4th – Meeting with Oversight Committee + STFC • Feb 12th – Near-final draft incorporating feedback • Feb 22nd – Final comments/typos/corrections done • Feb 24th – Submitted! • Mar 4th – Last possible submission date • Apr 15th - PPRP • May 14th – PPRP Visiting Panel. 12 Weeks 3 Weeks GridPP24
PPRP Question-2 2.What consideration has been given to the impact of the revised LHC schedule announced by CERN following the Chamonix meeting in January 2010? Can more information be provided as to why you require the manpower and the hardware on the proposed schedule and at the level requested with respect to the new LHC schedule? Chamonix 2010 (Jan 2010) UK Collaboration sizes LHC Schedule Global Resource Requirements UK Resource Requirements Experiment Computing Models UK Resource Request Hardware Costings Scrutiny by C-RSG (April 2010), CRRB, LHCC (May 2010). Experience with real data (March+ 2010) • Requirements have, and will continue to, evolve. GridPP planning now reflects latest info. • Chamonix has made a significant change but financial impact mainly in final year of GridPP3. • Additional complication with GridPP3 funding profile requires dove-tailing with GridPP4. PPRP
PPRP Question-3 3. Can you please clearly explain the current situation with regard to cooperation with and support from the other Tier-1s. How does this proposal benefit from collaboration with other Tier-1 centers (for example joint work, best practices etc.)? Strategically: Tier-1 benefits through common policies agreed at MB and GDB level (plus working groups). DB, JG. GDB MB Tactically: Tier-1 benefits by learning of problems/solutions/best-practice via the bi-weekly Tier-1 Service Coordination Meetings and other technical forums such as HEPiX and it’s working sub-groups. Tier-1 Tier-1 Operationally: Tier-1 benefits from immediate feedback via the Daily Operations Meeting. RAL Tier-1 Tier-1 Tier-1 Cooperation and collaboration between Tier-1s also happens at the experiment level via the computing operations teams and by pairing, wherein RAL has special relationships with specific Tier-1s to exchange custodial data. Tier-1 Tier-1 Biweekly S. C. M. Daily Ops Meeting PPRP
PPRP Question-4 4. The Panel would like to understand how the UK is performing in comparison with the wider wLCG effort. Has a performance comparison been done for Tier-1s (globally) and Tier-2s (both nationally and globally)? How was this done (e.g. what performance metrics were used) and how has the output been used? A variety of measures were presented in the written document (thumbnails below). The basic message is that the UK Tier-1 and Tier-2s perform above average to excellent when compared globally. The Tier-2s are compared nationally by wLCG and the experiments and these were used to inform the choices made in the GridPP4 proposal. PPRP
PPRP Questions 5 + 6 5. Please explain how usage of GridPP by the wider community has been considered? What inputs from non LHC experiments have been considered (e.g. T2K, SUPERNEMO)? 6. Please clarify why you were unable to arrive at reasonable resource estimates for non-LHC Particle Physics users? Please explain how you arrived at the 10% additional resource requested for non-LHC experiments. • The wider community were invited to provide written input. • Inputs received from BaBar/SuperB, H1, ILC, MINOS, NA62, PhenoGrid, and UKQCD. • No input received from MICE, SuperNemo, T2K, or SNO+ … • GridPP took this partial input and factored in our observations that (a) resources use by the wider community was unlikely to fall over GridPP4; (b) that 15% of resources had been used by non-LHC experiments in 2009; (c) that 28% of UK particle physicists likely to be doing data-analysis were on non-LHC experiments, to arrive at a minimum reasonable request of 10% of the LHC resources for the wider community. PPRP
PPRP Questions 7 + 13 7. What are the targeted benefits to gain from European collaborations, such as EGI, other EU projects/infrastructures? What are the risks and implications in case EGI will not be successful? 13. Can you please clearly explain your dependency on other infrastructure and initiatives (e.g. JISC, EGI and wLCG) and how it will be managed should dependencies not be met • Targeted benefits of EGI: • Strengthen the GridPP Operational Security team. • Expand and develop our operational management (GOCDB) and accounting (APEL) of the Grid to broaden opportunities for future support. • Expand UK distributed operations support team by levering matching posts. • Harmonization of collaborative computing operations in Europe to ensure longevity. • Reduce load on wLCG Tier-1s by enabling alternative support of Tier-2s in countries without a Tier-1. • Targeted benefits of JISC and GEANT: Deployment and operation of the UK academic network, JANET, and the OPN across Europe (PC and RT on high level committees). • Targeted benefits of EMI: Ensuring the wLCG middleware is integrated into the future European strategy. • Targeted benefits of wLCG: Fundamental dependence on, and fully integrated partner of, wLCG; represented at the highest levels (GDB, MB, OB). PPRP
PPRP Questions 7 + 13 NGS4 and EGI funded GridPP4 funded EGI not funded NGS4 not funded RCUK e-Science Review
PPRP Question-8 8. There seems to be marked difference between the manpower required for Tier-1 and Tier-2 centres when scaled for comparison against size of operation. Please can you justify? The Panel would like to understand what impact it will have on the UK contribution if the manpower is reduced in those centres where manpower is currently higher than the norm. Clarification on point 8: For the first bit of the question we are referring to Tier2. The question for the second part should read as follows: "The size of operation at the Tier-2 sites varies considerably. The Panel would like to explore the requested manpower for operation and maintenance of the Tier-2 sites and how a reduced effort could be accommodated by pooling expertise at neighbouring sites Comparisons of “size of operation” need to consider the level-of-service delivered and the type of resources, in addition to the capacity. In particular, large disk storage systems; tape robot infrastructure; and large Oracle databases, are all significantly more challenging than CPU. We are confident that the estimate of 26.6 FTE for the Tier-1 is a robust estimate of the effort required, based both on an international survey and on our own experience. To descope the project by 20% the advantages/disadvantages of pooling Tier-2 effort were explored. The Tier-2 roles were distinguished in terms of Group Analysis, Simulation, and User Analysis and manpower optimise to reflect these roles. We believe this is the optimum balance between pooling effort and gaining advantage from a distributed system (local support; leverage of institutional support and resources; mitigation of risk; developing future options). PPRP
PPRP Question-9 9. Recognising the transition of GridPP to a production phase, can you outline what plans /steps you will be taking to seek ongoing efficiency improvements? • UK Grid must triple in capacity over lifetime of GridPP4: Efficiency improvements are central to our delivery plan with a flat manpower profile. • Three areas: • Efficiency improvements in deploying/managing/operating the hardware. This will be achieved by continuing to develop our tools and procedures; by adopting best practice from our international partners; and by identifying and disseminating best practice at the Tier-2 sites in the UK. This will be delivered by actively participating in all the relevant national and international meetings; by monitoring and comparing performance; and by reviewing progress. • Increasing the efficiency with which the experiment soft/middle-ware can use the Grid infrastructure. This will be delivered by the posts requested in WP-D. • Increasing the efficiency with which data is handled at all levels from the basic i/o to worker-nodes up to the handling of file transfers and metadata. This will be delivered by the data management posts in WP-C. PPRP
PPRP Question-10 10. Can you provide more information about the hardware experts that you are hiring? Only limited new hiring due to redistribution of Tier-2 posts. Roles are described in Appendix-A of proposal. Requires much more than hardware expertise at both the Tier-1 and the Tier-2. At the latter, the roles are particularly multifaceted. Maintain a list of publications (best-effort basis, so incomplete) on the GridPP website at: http://www.gridpp.ac.uk/papers/. Currently 220 publications (2001 – 2009). • 2009 papers: 22 papers,105 (~70 unique) authors PPRP
PPRP Question-11 11. We understand that the request of 3.5k of travel per year, per person, is based on GridPP3 experience; however, given the current funding climate, is it really essential to allocate travel at this level and have you considered reducing travel costs by increasing the use of video/audio conferencing tools? GridPP works in an international collaboration (wLCG) but does not have staff abroad – some travel is required and might expect this to be commensurate with the (non LTA) travel of the experiments. In practice, GridPP manages travel at it’s current level by requesting co-funding from experiments for some trips. This dual-key approach ensures engagement with the experiments in a relevant way. GridPP uses video/audio conferencing extensively and has significant expertise and experience in this area. Our website contains recommendations and the UK contributed significantly to the LHC report on collaborative tools. However, there are times when national and international travel cannot be replaced by alternatives. To do so would reduce the influence and effectiveness of GridPP, limit engagement and compromise technical progress. We believe the proposal as submitted requests a reasonable and responsible travel budget that we believe is necessary for the functioning of the project. PPRP
PPRP Question-12 12. Tier-2 contributions by some host institutes are not high proportionally. Please provide details of how you intend to involve UK Universities in the Tier-2 centre investment and increase contributions to infrastructure and operation. What is the timeframe for this involvement? The collective investment from the Tier-2 institutes in GridPP is extremely large. Although a bottom-up estimate on an institute-by-institute basis is not possible, a top-down order-of-magnitude estimate of contributions in 2005-2009 is as follows: Capital Costs (machine rooms etc): £10.7m (extrapolated from specific examples) Hardware (above that funded by GridPP): £3.3m (from looking at resources delivered to EGEE) Electricity costs: £2.5m (based on average power costs) Manpower not funded by GridPP: £1.9m (based on GridPP quarterly reports) ----------------------------------------------------------------- Total non-GridPP Tier-2 Investment: £18.4m ----------------------------------------------------------------- This compares with GridPP investment of: £9.7m (staff and hardware) This investment has come through multiple paths including JIF, SRIF, HEFCE, SFC, Regional-Development Grants, etc. We believe all institutes have contributed to infrastructure and operational costs. This is a significant (and probably unique) contribution to the costs of the project, demonstrating the involvement of the institutions, and for which we are very grateful. PPRP
PPRP Question - 14 14. Can you expand on how you intend to promote Technology or Knowledge Transfer? For example, with regard to the exploitation of middleware/security capability ? Build on the recognised successes of GridPP3 – formalising the current ad-hoc process by introducing a steering group with members from within and beyond GridPP. This would help a more targeted approach to complement the current opportunistic and reactive environment. The KT activities would be better linked to external bodies (E.g. Digital Systems KTN, Scalable Computing, NGS/NGI/EGI, Impact QM!, etc) with a more structured approach. In the middleware/security area, there are two jewels preserved in the GridPP4 proposal: GANGA (WP-D posts), which has been taken up widely and has the potential to attract new interest; and the GridSite security toolkit (WP-C post) that is embedded in the gLite middleware and also has some uptake as a website-construction toolkit. GridPP has also led the development of security policy for wLCG and EGEE and this has applicability to many Grids. PPRP
Descoping The GridPP4 proposal as submitted had already undergone an extensive process of reduction to assimilate cuts of 20% required in December 2009. During that process, all the potential options for descoping were explored and all investments were prioritised based on our past experience and our understanding of the future requirements. The submitted proposal was carefully balanced and left no scope for further reduction by simply prioritising one work package over another; there are no optional extras included in the bid. We have attempted to respond to the PPRP question on further reductions, at two levels: In order to go from a 20% to a 25% reduction, we have provided a detailed list of additional items that we would consider removing in order to save about £1.5m. For the other scenarios, we give examples of areas where the scope of the project might be reduced but presume this would require consideration by PPAN as this is essentially a science prioritisation exercise. PPRP
£1.5m Reduction Hardware re-planning - depends on GridPP3: £330k Reduced travel (WP-G) – This will reduce operational efficiency, experiment engagement, international influence, and impede technical progress: £150k Reduced project management (in 2013-14 WP-E) – Would not re-appoint retiring deputy project leader in 2012. Some risk that this reduces our potential international impact and removes some high-level oversight at the Tier-1: £140k Reduced Impact/KE activity (WP-F) - We will not be able to respond to the increased emphasis on Impact: £147k Reduced support for non-LHC experiments (WP-D) – Community will not be able to fully capitalize on the investment in a Grid infrastructure: £180k The 20% cuts have already reduced effort to a critical level in WP-A and B. This was addressed by the W.A. It makes no sense to cut effort in WP-A and –B further whilst preserving the W.A: Remove W.A: £491k Total Reduction: £1.44m. Project would now have been reduced by 25% but tries to preserve the core-mandate to deliver a computing Grid to the LHC experiments in an internationally competitive way. PPRP
Working Allowance and Contingency 2 years of effort to address Risk-1 (CASTOR) and Risk-6 (Tier-1 Service Level). 4 years of effort to address Risk-9 (Tier-2 Service Level and EGI end). Hardware costings Risk-11,19, 20 (Tier-1) Risk-14 and 21 (Tier-2s) Risk-18 (both). 4 years of effort to address Risk-15 (EGI/NGI transition) and Risk-22 (EGI funding) 3 years of effort to address Risk-12 (NGS4) PPRP
Larger Reductions • To make larger reductions, the project scope would need to be redefined. It was assumed this would have to be done at the PPAN level. Two possible scenarios were considered: • Remove all support for non-LHC experiments: • 10% of Tier-1 hardware: £580k • 0.5 FTE at Tier-1: £170k • 10% of Tier-2 hardware: £250k • 0.5 FTE at Tier-2: £160k • 0.5 – 1.0 FTE of support: £180k - £350k • ------------------------------------------------------------------------------- • Total Reduction: up to £1.5m • Reduce support for LHC experiments (example scenario): • Remove ALICE support for 11/12 £100k • 20% reduction in ATLAS group analysis (1.5 FTE) £475k • 20% reduction in CMS group analysis (0.75 FTE) £237k • Comparable reduction in LHCb activities £158k • Reduction in Tier-2 hardware £300k • Reduction in data-support (1 FTE) £325k • ---------------------------------------------------------------------------------- • Total Reduction: £1.6m PPRP
PPAN Feedback PPAN feedback received August 11th: “The GridPP proposal was considered by the Particle Physics, Astronomy and Nuclear Physics Science Committee (PPAN) at the meeting held on 20 July 2010. PPAN has recommended support for the proposal. This is at a reduced level to the original request, but broadly in line with the advice received from the Projects Peer Review Panel (PPRP).” STFC Council Science Board PPAN PPRP “However, while agreed in principle, STFC is unfortunately not able to make the recommended commitment in full at this time due to the funding uncertainties arising from the challenging CSR 2010 exercise now underway. Consequently, it has been decided to make an interim award, pending the CSR outcome” Which means some, but not all, of the money for the first two years. GridPP25
What is the recommendation? • There are two types of money – Capital (most of the hardware) and Resource (everything else). The balance of these has been fixed by PPAN, which is an additional constraint. • PPAN’s recommendation is basically our first (£1.5m) reduction scenario plus a hybrid of the two additional scenarios we proposed for larger reductions. • There is reduced support (but not zero) for non-LHC experiments. • There is a 10% reduction (not 20%) in the group analysis support for the LHC experiments. GridPP25
Implementation • In some instances the PPAN/PPRP feedback has explicitly cut posts and/or roles. Group leaders have received this information. • Implementation of the other cuts is being/will be discussed with the Experiments and Institutes concerned guided by strategic decisions at Monday’s PMB meeting. • Following all discussions, a revised GridPP4 plan will be prepared in early September; ratified by the CB; and presented to STFC for final approval. • It is hoped that the uncertainty generated by the CSR will be resolved before GridPP4 commences. • We should not lose sight of the fact that it is a major success to secure this level of funding at this difficult time. Although things will be challenging, I believe we can deliver the Grid that is required. GridPP25
Top 10 Challenges Funding Manpower Data Management Data Storage Evolving computing models Hardware management Hardware provision EGI/NGI Security The unexpected Moving Roger to the A-Team by 2011 GridPP25