190 likes | 305 Views
Tier 1 status, a summary based upon a internal review. Volker Gülzow DESY. Information sources. Input : Review of Tier 1 readiness June 8th 2006 @ Cern Reviewers: John Gordon (RAL), Volker Gülzow (DESY) Chair, Alessandro de Salvo (INFN Rome), Jeff Templon (NIKHEF), Frank Würthwein (UCSD)
E N D
Tier 1 status, a summary based upon a internal review Volker Gülzow DESY LHCC comprehensive review 2006
Information sources Input: • Review of Tier 1 readiness June 8th 2006 @ Cern • Reviewers: John Gordon (RAL), Volker Gülzow (DESY) Chair, Alessandro de Salvo (INFN Rome), Jeff Templon (NIKHEF), Frank Würthwein (UCSD) • From a questionnaire to Tier 1‘s, from questions to the Experiments (Tier 1‘s, Middleware, Interoperability) • From documents from MB, CRRB • CTDR’s + supplement • Tier 1 milestone plans • LCG-wiki’s • From other sources LHCC comprehensive review 2006
Review Process I Mandate: (Discussed in MB) “… review pays specific attention to the following topics: • state of readiness of CERN and the Tier-1 centres, including operational procedures and expertise, 24 X 7 support, resource planning to provide the required capacity and performance, site test and validation programme; • the essential components and services missing in SC4 and the plans to make these available in time for the initial LHC service; • the EGEE-middleware deployment and maintenance process, including the relationship between the development and deployment teams, and the steps being taken to reduce the time taken to deploy a new release; • the plans for testing the functionality, reliability and performance of the overall service; • interoperability between the LCG sites in EGEE, OSG and NDGF;” http://www.cern.ch/lcg/documents/mb/service_review_mandate_jun06.doc LHCC comprehensive review 2006
Tier1/2 Summary Table • 40 Tier2 centres have their data included in above table. • 9 more centres plan to join as soon as possible. Source: Chris Eck, CRRB April 2006 LHCC comprehensive review 2006
Overall Comments to Tier 1‘s The Tier 1 requirements are currently changing due to accelerator time schedule, new resource planning from the experiments will show up in October A lot of diversity among the Tier-1’s i.e. • Background • Technology • Funding • Staffing • # of experiments, size LHCC comprehensive review 2006
Overall Comments to Tier 1‘s (June06) • Not all the Tier-1’s have reached the level of readiness, which is required for LHC start-up. • Key-factors are organisational gaps in implementing off-hour service, funding problems, communication with experiments (two sided problem) • There are severe risks with the scalability of the resources. • The manpower situation on the Tier 1‘s was not always transparent during the review LHCC comprehensive review 2006
Source: Les Robertson LHCC comprehensive review 2006
Overall Comments to Tier 1‘s • The overall monitoring of the Tier 0/1/2 complex is of very great importance. • The Tier 2 associations are not completely clear. This needs immediate clarification • The support concept for Tier 2/Tier 3 centres by Tier 1’s is not well determined. This is partly because of unclear requirements from the experiments. • At this stage, one should no longer make distinction between production and SC4 infrastructure (experiments complain) LHCC comprehensive review 2006
milestone planshttps://twiki.cern.ch/twiki/pub/LCG/MilestonesPlans LHCC comprehensive review 2006
„Communication“ • Clear (and redundant) contact persons (e.g. liaison officers) have to be nominated on both sides. • Clear/precise information from the experiments, well structured. • Web based monitoring pages for operational issues should be made available by the experiments. LHCC comprehensive review 2006
„Communication“ • Operations meetings OPS/SCM/RSM are important -> mandate etc. reviewed by MB • GGUS is a well accepted tool and should be used as the main tracking tool. Further improvements are needed (e.g. GUI, amount of mails, support for full set of problem categories, “when can a case be declared closed?”) LHCC comprehensive review 2006
„24x7“ • A full 24x7 in the sense of live monitoring and alarming and for a certain class of problems „immediate“ reaction is required. A „on call“-Service still has to be setup at many sites. It‘s required to • have the right tools, which are often not sufficient. For the setup of tools, a initiative (eg via HEPIX) should be started to sharpen the tool set, which is helpful for Tier 2 and Tier 3‘s as well. • Have adequate staff available -> management. In the focus of MB. LHCC comprehensive review 2006
„Management issues“ • The funding situation is not clear at every centre. A revised ramp up planning may help. This has to be followed carefully. • Clear, up to date and realistic requirements from the Exp. would help the Tier 1‘s to acquire on time. • At some centres critical work is carried out by temporary staff, depending on the country this can cause severe problems. LHCC comprehensive review 2006
„Middleware“ • The introduction of gLite 3 was a bit “bumpy”, people were somewhat confused. • Many emotions prior to real experience were expressed, which was not helpful. • There were lots of complaints but only very little error reporting. • The “post mortem” analysis of the process was very much appreciated. LHCC comprehensive review 2006
„Middleware“ • Sites were not able to meet the tight time constraints. • Reasons were (and are) • lack of manpower, • lack of understanding, • Site localization • coordination with needs of non-LHC experiments. LHCC comprehensive review 2006
„Middleware“ • Stable production environments have to be the no. 1 goal today. Worry about effort diverted on side projects. • The Software was not mature enough, we need to find ways to guarantee readiness of software when released. • The representation of operational issues in the TCG is not adequate, the Tier 1’s should be better represented, their input has to be taken. • The TCG should include operational issues in the priority list and allow sites to influence the ranking. • Full VOMS needed! • The error reporting from the users has to improve. • The middleware urgently needs proper operational interfaces: • Logging • Diagnostics • Service operation interfaces LHCC comprehensive review 2006
„Interoperability“ • The experiments should make the importance of the problem clear. • Interoperability of the grids needs more attention and manpower as there is today if required • Can we expect uniform testing (SFT’s), monitoring, accounting, and metrics for ALL WLCG sites? LHCC comprehensive review 2006
Conclusion: • Excellent work was done at the Tier 1’s on many tasks • The cultural gap has to be bridged • The 24x7 case is almost open • Monitoring of sites strongly recommended • the funding and staffing situation needs careful attention • Middleware robustness and operational hooks needed • More binding acting in certain areas is required (on all Tier levels) • The new ramp up does not allow to lean back LHCC comprehensive review 2006