370 likes | 454 Views
GridPP Overview. Tony Doyle. Contents. Technical Design Reports Timescales Oversight Committee Summary Current concerns Actions (and how these were addressed) Feedback from the July 1 (OC7) meeting “Get Fit” Plan and Problem Solving Beyond GridPP2. June Reports.
E N D
GridPP Overview Tony Doyle GridPP13 Collaboration Meeting
Contents • Technical Design Reports • Timescales • Oversight Committee Summary • Current concerns • Actions (and how these were addressed) • Feedback from the July 1 (OC7) meeting • “Get Fit” Plan and Problem Solving • Beyond GridPP2.. GridPP13 Collaboration Meeting
June Reports Computing Technical Design Reports: http://doc.cern.ch/archive/electronic/cern/ preprints/lhcc/public/ ALICE: lhcc-2005-018.pdf ATLAS: lhcc-2005-022.pdf CMS: lhcc-2005-023.pdf LHCb: lhcc-2005-019.pdf LCG: lhcc-2005-024.pdf LCG Baseline Services Group Report: http://cern.ch/LCG/peb/bs/BSReport-v1.0.pdf Contains all you (probably) need to know about LHC computing GridPP13 Collaboration Meeting
Timescales • Service Challenges – UK deployment plans GridPP13 Collaboration Meeting
Functionality Fits on a page. Concentrate on robustness and scale. Experiments have assigned priorities. GridPP13 Collaboration Meeting
July Documents • PPARC Oversight Committee Papers • Seventh GridPP Oversight Committee (July 2005) • Executive Summary • Project Map • Link to Project MapDatabase (Excel) Version (v2) • Resource Report • LCG Report • EGEE Report • Deployment Report • Middleware/Security/Network Report • Applications Report • User Board Report • Tier-1/A Report • Tier-2 Report • Dissemination Report • UK Analysis • Metrics and Deployment • Middleware Planning • Experiment engagement questionnaire • See http://www.gridpp.ac.uk/docs/oversight/ Addressed various concerns of the OC GridPP13 Collaboration Meeting
Exec2 Summary • GridPP2 has already met 21% of its original targets with 86% of the metrics within specification • “Get fit” plan described (requested by OC) • gLite 1 was released in April as planned but components have not yet been deployed or their robustness tested by the experiments • Service Challenge (SC) 2 addressing networking was a success at CERN and the Tier-1 • SC3 addressing file transfers for the experiments is about to commence • Long-term concern: hardware at the Tier-1 in 2007-08 • Short-term concerns: under-utilisation of resources and the deployment of Tier-2 resources GridPP13 Collaboration Meeting
RAL joins labs worldwide in successful Service Challenge 2 • The GridPP team at Rutherford Appleton Laboratory (RAL) in Oxfordshire recently joined computing centres around the world in a networking challenge that saw RAL transfer 60 terabytes of data over a ten-day period. A home user with a 512 kilobit per second broadband connection would be waiting 30 years to complete a download of the same size. GridPP13 Collaboration Meeting
gLite 1 GridPP13 Collaboration Meeting
100 green sites sitting on a grid • Thu 16 Jun 2005 • Last week the UK CIC-on-duty team celebrated the milestone of having 100 sites passing the Sites Functional Test. Thanks to all the sites who acted promptly to trouble tickets raised by the UK team during their shift. GridPP13 Collaboration Meeting
Current concern 1. under-utilisation • Under -utilisation of existing Tier-1/A resources • improving overall and w.r.t. Grid fraction from 2004 to 2005 Non-Grid Grid GridPP13 Collaboration Meeting
Current concern2. under-delivery • The current situation is somewhat better than these 2005 Q1 numbers indicate • Some late procurements (OK given under-utilisation) • Technical problems (being overcome) GridPP13 Collaboration Meeting
Longer-Term concern: allocations Starting point: fair shares input to BaBar and LHC MoUs GridPP13 Collaboration Meeting
Metrics and Deployment • GridPP is a significant contributor to EGEE (20%) • CPU utilisation is low • Disk utilisation is climbing (but very low) GridPP13 Collaboration Meeting
Metrics and Deployment • Sites upgrade improvements – quarterly upgrades within 3 weeks • gradual improvement in site configuration and stability • Reflects systematic approach and measurable improvements in deployment GridPP13 Collaboration Meeting
GridPP Deployment Status 2/7/05 (9/1/05) Measurable Improvements GridPP13 Collaboration Meeting
Actions GridPP to submit the proposal for LCG phase 2 funding to the Committee prior to its submission to Science Committee (minute 4.9). • Done. 27 page report inc. input from OC at http://www.gridpp.ac.uk/docs/gridpp2/SC_GridPP2_LCG_1.0.docunfunded GridPP to clarify the situation with regard to ATLAS production run tests for the next physics workshop (minute 5.3). • See News Item http://www.gridpp.ac.uk/news/-1119651840.463358.wlg • (and slide) GridPP to provide an update on progress resolving problems caused by mismatches between local batch systems and the capabilities of the grid Resource broker (minute 6.3). • (See slide) GridPP to more fully document its alignment with each of the individual experiments (minute 15.2). • An experiment engagement questionnaire has been used (initial input in February and further [updated] input in June). See http://www.gridpp.ac.uk/eb/workdoc/gridusebyexpts_0605.doc GridPP13 Collaboration Meeting
ATLAS steps up Grid production GridPP13 Collaboration Meeting
RB Action GridPP to provide an update on progress resolving problems caused by mismatches between local batch systems and the capabilities of the grid Resource broker (minute 6.3). • The problem of connecting the local CE to a batch queue is largely overcome – many (all shared) sites now do this. • There were problems subsequently deploying the accounting system (APEL) to point to the local batch system. • Overcome (13 ex 18 sites), but not as straightforward as it could be. • The JDL from the job is not passed to the local system. Hence there is no way for the local scheduler to use info from the Grid scheduler. • This is a limitation from a (shared) site viewpoint (attempting to balance Grid and local jobs). • The short term solution is to set up separate batch queues. • It is not a limitation for the experiments (affects efficiency). • It is noted as a requirement and it is intended that this will be delivered in Year 2 of JRA1 for the WMS. GridPP13 Collaboration Meeting
Actions GridPP to define its usage policy with respect to Tier-1 allocations(minute 15.4). • See http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-57-Tier1A_1.0.doc and documents within (“fair shares” using PPARC Form X information) GridPP to produce an updated risk register (minute 15.5). • Incorporated in the new Project Map at (with 7 “high” risks) http://www.gridpp.ac.uk/pmb/ProjectManagement/GridPP2_ProjectMap_2.htm GridPP to produce a “get-fit” plan for production metrics (minute 15.6). • See Metrics and Deployment document http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-64-Metrics.doc and its incorporation into the Project Map GridPP to define its metrics for job success (minute 15.7). • Adopted EGEE-wide definition at http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php (See slides) GridPP to produce a statement of intent regarding its adoption of gLite (minute 15.8). • See Middleware Selection document http://www.gridpp.ac.uk/docs/oversight/GridPP-PMB-65-Middleware.doc GridPP13 Collaboration Meeting
Metrics Action GridPP to define its metrics for job success (minute 15.7). • GridPP adopts the EGEE-wide definition at http://ccjra2.in2p3.fr/EGEE-JRA2/QAmeasurement/showstatsVO.php The (web-based) QA system accounts for Workload Management System registered job successes (that can then be categorised by Virtual Organisation or Resource Broker) Before introducing the figures it should be understood that there are caveats: • It only measures what the WMS “sees” • doesn't catch failure of WMS to register job in the first place (but this is a rare occurrence) • if a job half way through the script fails (for example tries but fails to copy a file) but the script completes successfully then WMS sees everything as OK. • If a VO (e.g. LHCb) deploys an agent then the WMS only registers the success of the initial (python) script: strategy enables higher overall LHCb performance (combined push-PULL model). (This currently leads to other problems in overall accounting should contention become an issue). • Overall: an end user may see either: • 1. a worse efficiency • failed job for other hidden e.g. data management problems • 2. a better efficiency by • choosing selected sites according to the Site Functional Test performance index; • deploying an agent to initiate real jobs at sites where the agent succeeded. • Physicists are “smart” and now “see” > 90% efficiency but the definition here is one defined within a given VO adopting their own methods (and from informed input from people currently submitting jobs to the system). GridPP13 Collaboration Meeting
Overview Integrated over all VOs and RBs for first half of 2005 Successes/Day 13806 Success %64% • Key point: Improving from 42% to 78% during 2005 [For the UK RB (lcgrb01.gridpp.rl.ac.uk) Successes/Day 319 Success %69%] GridPP13 Collaboration Meeting
LHC VOs ALICE ATLAS CMS LHCb Successes/Day N/A 2796 452 3463 Success %42% 83% 61% 68% GridPP13 Collaboration Meeting
Other VOs BaBar CDF D0 BioMed Successes/Day 37 1 207 1074 Success %76% 30% 84% 76% PMB request: please enable the BioMed VO at your site GridPP13 Collaboration Meeting
Interlude.. Angels & Demons introduces the character of Robert Langdon, professor of religious iconology and art history at Harvard University. As the novel begins, he's awakened in the middle of the night by a phone call from Maximilian Kohler, the director of CERN, the world's largest scientific research facility in Geneva, Switzerland. One of their top physicists, Lenoardo Vetra, had been murdered, with his chest branded with the word "Illuminati.” Lenoardo Vetra created antimatter in canisters to simulate the Big Bang. Vetra's murder, though, allows one of the canisters to be stolen. Langdon and Vittoria Petra are quickly sent off to Rome and Vatican City, to help find the canister and return it to CERN before it explodes at midnight... GridPP13 Collaboration Meeting
Agents and Daemons GridPP13 Collaboration Meeting
The future for the experiments? GridPP13 Collaboration Meeting
OC Preliminary Feedback ALL earlier actions were considered as “done” from OC perspective GridPP to investigate alternative procurement strategies in order to improve Tier-1/A utilisation Actions: Tier-1/A Board I. evaluate alternative approaches User Board – THIS MEETING • improve experiment estimates GridPP to associate more resources for technical documentation (for end users and system administrators) Actions: • Internal advertising: is anyone within GridPP willing/able to take up the role of “Documentation Officer”? • (There will be an incentive for this) • If this fails, to advertise the post using role description (being drafted) Deployment Board – THIS MEETING GridPP13 Collaboration Meeting
OC Preliminary Feedback • GridPP to develop a deployment model that works for smaller T2 centres in association with CERN • GridPP to provide a gap analysis for LCG (using the baseline services and the [classified] experiment components as described in the TDRs) • GridPP to address UB questionnaire outcomes (perceptions as well as actual shortcomings) • GridPP to document the high-level "value" GridPP is adding/delivering (using Project Map) • OC8 in February 2006 “important” (not “G8 on Wednesday”) GridPP13 Collaboration Meeting
The “Get Fit” Plan • Set SMART (Specific Measurable Achievable Realistic Time-phased) Goals GridPP13 Collaboration Meeting
“I take it plea bargaining is out of the question?” • See Dave’s talk GridPP13 Collaboration Meeting
Our 14 problems… • 0.104: Number of LCG/EGEE job slots published by the UK. The current total is 2477 and the target was 3000. • 0.105: Number of LCG/EGEE jobs slots used. The current fraction is 19% compared to a target of 70%. This demonstrates that 0.104 above is clearly not an issue but that usage is presently low. • 0.106: GridPP KSI2K available: By the end of March 2005 the combined Tier-1 and Tier-2 CPU power was expected to be 5184 KSI2K compared to 2277 KSI2K achieved. This number is dominated by the 4397 KSI2K expected from the Tier-2s which has been slowly becoming available. • 0.108: GridPP disk storage available: Similar to 0.106 above. Only 280TB available compared to 968TB anticipated but the situation is improving. • 0.111: GridPP tape storage made available to LCG/EGEE. At present the tape storage is being used but not really via the Grid route. • 0.112: Fraction of available KSI2K used in quarter: at present a rough estimate shows about 42% of the available CPU was used compared to a target value of 70%. • 0.113: Fraction of available disk used in quarter: This is estimated at 64% compared to the target of 70%. • 0.114: Fraction of available Tape used in quarter: This is estimated at 61% compared to the target of 70%. • 0.131: Tier-1 service disaster recovery plans up to date: This has not been updated within the last 6 months. • 0.143: Accumulated scheduled downtime in the last quarter: The current value of 418 days is almost identical to the current) target of 411 days. The metric expects the 25% figure to reduce to 5% by the third year. • 3.6.3: LCG Deployment evaluation reports: first report due in March 05 was delayed to the second quarter. • 5.2.4. Tier-2 Hardware realisation: This flags the same issue as 0.106 and 0.108 above. Tier-2 hardware has been delayed but the situation is improving. • 5.2.7 Quarterly reports received within 1 month of the end of the quarter: The 05Q1 reports were received late. Some of the delay was due to the unfortunate timing of EGEE meetings. • 6.2.11: Non-HEP applications tested on the GridPP Grid (submitted via the NGS submission mechanism). The NGS submission mechanism is not yet adequate. GridPP13 Collaboration Meeting
The “Get Fit” Plan • … not (yet) “The Final Solution” • We hope this drives the right behaviour • Plea bargaining is (probably) OK.. GridPP13 Collaboration Meeting
Some Problem Solving Strategies GridPP13 Collaboration Meeting
Beyond GridPP2.. LHC EXPLOITATION PLANNING REVIEW Input is requested from the UK project spokespersons, for ATLAS and CMS for each of the financial years 2008/9 to 2011/12, and for LHCb, ALICE and GridPP for 2007/8 to 2011/12. Physics programme Please give a brief outline of the planned physics programme. Please also indicate how this planned programme could be enhanced with additional resources. In total this should be no more than 3 sides of A4. The aim is to understand the incremental physics return from increasing resources. Input will be based upon PPAP roadmap input E-Science and LCG-2 (26 Oct 2004) and feedback from CB (12 Jan & 7 July 2005) GridPP13 Collaboration Meeting
Problem Solving andImproved Communication • “Communication, in essence, is the shift of a particle from one part of space to another part of space. A particle is the thing being communicated. It can be an object, a written message, a spoken word or an idea. In its crudest definition, this is communication. • This simple view of communication leads to the full definition: • Communication is the consideration and action of impelling an impulse or particle from source-point across a distance to receipt-point, with the intention of bringing into being at the receipt-point a duplication and understanding of that which emanated from the source-point..” • from The Scientology Handbook This may be a clue to how we will overcome our problems But we can always improve this.. GridPP13 Collaboration Meeting
Summary • LHC Technical Design Reports define an endpoint • Responsive-mode deployment/development • Timescales for LHC are soon – first cosmics data taken • Oversight Committee – improve “efficiency” • Some particular issues: • Tier-1/A utilisation • Documentation Officer • “Get Fit” plan endorsed by OC • requires support from everyone to improve metrics • There are 14 deployment problems (some interdependency) that need to be solved • Many areas are now quantifiable (significant progress here) • Service Challenges will help focus attention • Improved communication and documentation (become a scientologist?!) • Aim: measured end-to-end performance improvements during 2005 • Beyond GridPP2: input required over the summer to PPARC LHC exploitation planning review GridPP13 Collaboration Meeting