1 / 49

LHC Computing Grid Project Launch Workshop Conclusions

LHC Computing Grid Project Launch Workshop Conclusions. HEPCCC – CERN, 22 March 2002 les robertson IT Division CERN 15 March 2002 www.cern.ch/lcg. Disclaimer. The workshop finished one week ago Much has still to be digested and interpreted These are my personal conclusions.

aline-walsh
Download Presentation

LHC Computing Grid Project Launch Workshop Conclusions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. LHC Computing Grid ProjectLaunch Workshop Conclusions HEPCCC – CERN, 22 March 2002 les robertson IT Division CERN 15 March 2002 www.cern.ch/lcg

  2. Disclaimer The workshop finished one week agoMuch has still to be digested and interpreted These are my personal conclusions

  3. Launching Workshop CERN – March 11-15 Goal: Arrive at an agreement within the community of the main elements of a high level work-plan • Define scope and priorities of the project • Agree major priorities for next 15-18 months • Develop coordination mechanisms • Foster collaboration

  4. Funding & Timing • The LCG Project has two facets • management of the resources at CERN • CERN base budget and special pledges • materials and people • coordination of the overall activities • important external resources – materials and personnel • regional centres • grid projects • experiment applications effort • Phase I – is probably now 2002-2005 • extension does not significantly change the integral of the budget • rather smaller prototype, delayed ramp-up • staff later than expected in arriving

  5. identify establish commitment to the project Project Resources Five components • CERN personnel and materials (including those contributed by member and observer states) • openlab industrial contributions • Experiment resources • Regional centres • Grid projects

  6. Funding at CERN – preliminary planning will evolve as the requirements and priorities are defined Special funding staff status arrived – 3 contract – 8 in-process – 4-5 Recruitment under way FZK - 8 PPARC – up to 15 EPSRC – up to 5 Italy - ~5 Funded by EU-Datagrid – 7

  7. Materials funding at CERN

  8. The LHC Computing Grid Project Structure LHCC Common Computing RRB (funding agencies) Reviews Resources Reports Project Overview Board ProjectExecutionBoard Software andComputingCommittee(SC2) Requirements, Monitoring

  9. Workflow around the organisation chart WPn PEB SC2 RTAGm SC2 Sets the requirements mandate PEB develops workplan Prioritised requirements ~2 mths requirements Updated workplan Project plan Workplan feedback SC2 approves the workplan PEB manages LCG resources Release 1 Status report time ~4 mths Review feedback PEB tracks progress SC2 reviews the status Release 2 borrowed from John Harvey

  10. Project Management • Applications – Torre Wenaus • infrastructure, tools, libraries • common framework, applications • Fabric – Wolfgang von Rüden • CERN Tier 0 + 1 centre • autonomous systems administration and operation • fault tolerance • basic technology watch • Grid Technology – Fabrizio Gagliardi • provision of grid middleware • drawn from many different grid projects • common/interoperable across US & European grid project “domains” • Grid Deployment • regional centre policy – Mirco Mazzucato • building the LHC Computing Facility – t.b.a.

  11. Status • The SC2 has created several RTAGs • limited scope, two-month lifetime with intermediate report • one member per experiment + IT and other experts • RTAGs started - • data persistency • software process and support tools • maths library • common Grid requirements • Tier 0 mass storage requirements • RTAG being defined – Tier 0-1-2 requirements and services • PEB • Input collected on applications requirements, experiment priorities, data challenges • Re-planning of the Phase I CERN prototype, and relationship to CERN production services • Work towards common/interoperable Trans-Atlantic Grids • HICB, DataTAG, iVDGL, GLUE • High-level planning initiated following workshop • First plan to be delivered to LHCC for July meeting

  12. Applications foils by John Harvey

  13. Scope of Applications Area • Application software infrastructure • Common frameworks for simulation and analysis • Support for physics applications • Grid interface and integration • Physics data management Issues: • ‘How will applications area cooperate with other areas?’ • ‘Not feasible to have a single LCG architect to cover all areas.’ • Need mechanisms to bring coherence to the project

  14. Issues related to organisation • There is a role for a Chief Architect in overseeing a coherent architecture for Applications software and infrastructure • This is Torre’s role • ‘We need people who can speak for the experiments – should be formalised in organisation of the project’ • Experiment architects have key role in getting project deliverables integrate into experiment software systems • They must be involved directly in execution of the project • Mechanisms need to be found to allow for this e.g. • Regular monthly meetings of the Applications subproject that must be attended by architects at which decisions are taken • Difficult decisions are flagged and referred to a separate meeting attended by experiment architects and coordinators • Ultimate responsibility and authority is with the Chief Architect Proposal now formulated & sent to SC2 for endorsement

  15. Approach to making workplan • “Develop a global workplan from which the RTAGs can be derived” • Considerations for the workplan: • Experiment need and priority • Is it suitable for a common project • Is it a key component of the architecture e.g. object dictionary • Timing: when will the conditions be right to initiate a common project • Do established solutions exist in the experiments • Are they open to review or are they entrenched • Availability of resources • Allocation of effort • Is there existing effort which would be better spent doing something else • Availability, maturity of associated third party software • E.g. grid software • Pragmatism and seizing opportunity. A workplan derived from a grand design does not fit the reality of this project

  16. Candidate RTAGs by activity area • Application software infrastructure • Software process; math libraries; C++ class libraries; software testing; software distribution; OO language usage; benchmarking suite • Common frameworks for simulation and analysis • Simulation tools; detector description, model; interactive frameworks; statistical analysis; visualization • Support for physics applications • Physics packages; data dictionary; framework services; event processing framework • Grid interface and integration • Distributed analysis; distributed production; online notebooks • Physics data management • Persistency framework; conditions database; lightweight persistency

  17. Candidate RTAG timeline

  18. Initial Workplan – 1st priority level Initiate: First half 2002 • Establish process and infrastructure • Nicely covered by software process RTAG • Address core areas essential to building a coherent architecture • Object dictionary – essential piece • Persistency - strategic • Interactive frameworks - also driven by assigning personnel optimally • Address priority common project opportunities • Driven by a combination of experiment need, appropriateness to common project, and ‘the right moment’ (existing but not entrenched solutions in some experiments) • Detector description and geometry model • Driven by need and available manpower • Simulation tools

  19. Fabric

  20. Scope of Fabric Area • CERN Tier 0+1 centre • management automation • evolution & operation of CERN prototype • integration as part of the LCG grid • Tier 1,2 centre collaboration • exchange information on planning and experience • look for areas of collaboration and cooperation • Grid integration middleware • Common systems management package • not required for large independent centres (Tier 1, 2) • may be important for Tier 3 .. – but this is on a different scale from Tier 1 tools – and there may be suitable standard packages – e.g. Datagrid WP4 tools, ROCKS • Technology assessment (PASTA) • evolution, cost estimates

  21. Potential RTAGs • Definition of Tier 0, Tier 1 and Tier 2 centres • requirements • services high priority - to be initiated in April • Tier 0 mass storage requirements • high priority for ALICE • RTAG constituted • Monitoring services • grid, application, system performance, error logging, …………

  22. PASTA III • Technology survey, evolution assessment, cost estimates • To be organised by Dave Foster (CTO) in next weeks • Experts from external institutes, experiments, CERN/IT • Report in summer

  23. Information Exchange • Use HEPiX • Already a “Large Cluster Special Interest Group” • Large Cluster Workshop organised at FNAL in 2001 • Encourage Tier 1 and Tier 2 centres to participate • Next meeting in Catania in April

  24. Short-term goals at CERN • configuration management testbed • LCFG, rapid reconfiguration • replacement of SUE functions • then expand to LXSHARE • low-level control & monitoring • evaluation of industrial controls approach (SCADA) • deploy LXSHARE as node in the LCG grid • mass storage for ALICE CDC

  25. Grid Technology

  26. Grid Technology: Summary • What is “Grid Technology”? • It’s the bit between the user’s or experiment’s application and the (Grid-enabled) computer system (“fabric”) at a particular institute or laboratory; • For users, it’s the bit we don’t want to have to worry about, provided it’s there! • Note the analogy with electric-power grid: you plug your device into the socket in the wall; you do not care, or want to care, about the power stations, distribution system, HV cables,…… N.McCubbin, L.Perini

  27. Web Services (Foster) • “Web services” provide • A standard interface definition language (WSDL) • Standard RPC protocol (SOAP) [but not required] • Emerging higher-level services (e.g., workflow) • Nothing to do with the Web (!) • Useful framework/toolset for Grid applications? • See proposed Open Grid Services Architecture (OGSA) • Represent a natural evolution of current technology • No need to change any existing plans • Introduce in phased fashion when available • Maintain focus on hard issues: how to structure services, build applications, operate Grids For more info: www.globus.org/research/papers/physiology.pdf N.McCubbin, L.Perini

  28. What do users expect…? Even if users/applications “don’t want to know” the gory details, we expect: - Data Management across the Grid; - Efficient “scheduling”; - Access to information about what’s going on, either on the Grid itself, or on the component computer systems (fabric layer). - …and we know that someone will have to worry about SECURITY, but we don’t want to! N.McCubbin, L.Perini

  29. Grid Technology and the LCG • The LCG is a Grid deployment project • So LCG is a consumer of Grid technology, rather than a developer • There are many Grid technology projects, with different objectives, timescales and spheres of influence • The LCG Grid Technology Area Manager is responsible for ensuring the supply of Grid middleware/services • that satisfy the LHC requirements (RTAG starting up) • and can be deployed at all LHC regional centres (either the same middleware, or compatible middleware) • and has good solid support and maintenance behind it

  30. Grid Technology: Issues (1)adapted from McCubbin & Perini Co-ordination of Grid efforts: • Users/experiments must see a common world-wide interface; Grid Technology must deliver this. • Note that more than one approach in Grid Technology is healthy at this stage. It takes time to learn what should be common! • Global Grid Forum (GGF) has a role in coordinating all Grid activity, not just HEP, and is starting a standardisation process • HENP Intergrid Collaboration Board (HICB) addresses co-ordination for HEP and Nuclear Physics. • GLUE (Grid Laboratory for a Universal Environment) initiated at February HICB meeting to select (develop) a set of Grid services which will inter-operate world-wide (at least HEP-wide). • Funding for GLUE is/will come from iVDGL (US) and DataTAG (EU). • iVDGL and Datagrid committed to implementing GLUE recommendations • LCG first services will be based on GLUE recommendations

  31. Grid Technology: Issues (2)adapted from McCubbin & Perini • If HICB did not exist, LCG would have to do this for LHC. LCG still has responsibility for defining the Grid requirements of the LHC experiments (“RTAG” about to start), but will obviously hope to use the GLUE recommendations. • LCG should, nevertheless, keep a watchful eye on the various ‘middleware’ projects to minimise risk of divergences. • LCG does not expect to devote significant new resources to middleware development. But continued financial support for EDG, PPDG, GriPhyN… and their successors is vital. • LCG does expect to devote significant resources to deploying the middleware developed by these projects.

  32. Grid Technology: Issues (3) adapted from McCubbin & Perini • It is absolutely essential that the LHC experiments test, as realistically as possible, the Grid Technology as it becomes available. This has already started, but will increase, particularly in the context of Data Challenges. This is vital not just for the usual reason of incremental testing, but also to discover the potential of the Grid for analysis. (Rene Brun) • This ‘realistic’ testing generates “tension” between developing new features of Grid Technology, and supporting versions used by the experiments. • The implications of the Web Services evolution are not yet clear

  33. Grid Deployment

  34. Deployment issues • Data Challenges are integral part of experiment planning • A large effort from the experiments, almost like a testbeam • Planning and commitment are essential from all sides • LCG, IT at CERN, remote centres and the experiments This has to be a priority for the project • GRID (mostly GLOBUS / Condor) already facilitate the day-to-day work of physicists • Now we need to deploy a pervasive, ubiquitous, seamless GRID testbed between the LHC sites Federico Carminati

  35. GRID issues • Testbed need close collaboration between all parties • Support is very labour intensive • Planning and resource commitment is necessary • Need development and Production testbeds in parallel • And stability, stability, stability • Experiments are buying into the common testbed concept • This is straining existing testbeds • But they should not be deceived • MW from different GRID projects is at risk of diverging • From bottom-up SC2 has launched a common use case RTAG for LHC experiments • From top-down there are coordination bodies • LCG has to find an effective way to contribute and follow these processes Federico Carminati

  36. GRID issues • Interoperability is a key issue (common components or different interoperating) • On national / regional and transatlantic testbeds • This world-wide close coordination has never been done in the past • We need mechanisms here to make sure that all centres are part of a global planning • And these mechanisms should be complementary and not parallel or conflicting to existing ones Federico Carminati

  37. Application testbed requirements DataGRID conference • Three main priorities Stability Stability Stability Federico Carminati

  38. So many conflicts! } Future development Current development Production TB Short term / long term • LCG needs a testbed from now! • Let’s not be hung-up by the concept of production • The testbed must be stable and supported • The software is unstable and we know • Consolidation of existing testbeds is necessary • A small amount of additional effort may be enough! • Several things that can be done now • Setup support hierarchy / chain • Consolidate authorisation / authentication • Consolidate / extend VO organisation • Test EDG / VDT interopearbility • This has beenpioneered by DataGRID • … and we need development testbed(s) • John Gordon’s trident Federico Carminati

  39. developers’ testbed development testbed Production Grid • Grid projects - • responsible for a facility on which developers can build and test their software • “testbeds” for demos and Beta testing LCG - responsible for operating a production Grid – - 24 X 7 service - as easy to use as LXBATCH

  40. Short-term user expectation is out of line with the resources and capabilities of the project

  41. Current state • Experiments can do (and are doing) their event production using distributed resources with a variety of solutions • classic distributed production – send jobs to specific sites, simple bookkeeping • some use of Globus, and some of the HEP Grid tools • other integrated solutions (ALIEN) • The hard problem for distributed computing is ESD and AOD analysis • chaotic workload • unpredictable data access patterns • First data is in 2007 – why is there such a hurry? • Crawl  walk  run • use current solutions for physics productions • consolidate (stabilise, maintain) middleware • learn what a “production grid” really means

  42. This year - Prototype Production Grids Goals • to understand what “production” involves, train staff, build confidence • “productisation” of the Datagrid Testbed 1 • in close collaboration with Datagrid -- • Datagrid toolkit • start with one (?) experiment • start with a few sites • Creative Simplification : cut-down requirements – simplified, realisable • focus on packaging, process • operational infrastructure • support infrastructure • prototype – do not confuse with production service • does not replace the Datagrid testbed • does not replace the iVDGL testbed

  43. Next year - First production service • be realistic • resources appearing slowly, key people to be hired • first consolidate the existing testbeds • the planning process will define the date for the first worldwide production grid – but expect a 12-month timescale • need mature, supported middleware • need converged/interoperable US/European technology • need to know how to manage a production grid • need to develop solid infrastructure services • need to set up support services

  44. First LCG Production Grid Global Grid Service • deploy a sustained 24 X 7 service • based on “converged” toolkit emerging from GLUE • ~10 sites – qll the Tier 1s and a few Tier 2s sites – including sites in Europe, Asia, North America • > 2 times CERN prototype capacity (= ~1,500 processors, ~1,500 disks) • permanent service for all experiments • achieve target levels for – physics event generation/processing rate, reliability, availability

  45. Fabric – Grid - Deployment – decision making • Must be able to make practical, binding community agreements • experiments • regional centre management resources, schedules, security, usage policies, …. software environment (op.sys., data services, grid services, ….) operating standards, procedures …. operation of common infrastructure • Grid Deployment Management Board Membership (still under discussion) • experiment data & analysis challenge coordinators • regional centre delegates – Tier1, Tier2 (empowered, with responsibility, but not one per regional centre!) • fabric, grid technology, grid deployment area coordinators Planning, strategy brought to SC2 for endorsement

  46. Important issues for this year • Defining the overall architecture of the project • Applications, fabric & grid deployment • Grid technology • Computing and analysis model • Thorough re-assessment of the requirements, technology, costs • LHCC re-assessment of physics rates • new technology survey (Pasta) –particularly the tape issues • revised computing and analysis model

  47. Cautionary remarks • Staff arriving at CERN slowly – • staff also leaving CERN • To what extent will external and experiment resources be committed to the project • Importance of agreeing priorities within the communitytaking realistic account of timing and resources – • Do not expect to solve all of the problems • Do not expect immediate results • Concentrate on a few priority areas, while staff numbers and experience build up • Long-term support • importance of ownership - ensuring that developments belong to long-term structures • temporary staff – must be replaced by longer term staff – unsolved problem given budget crisis

  48. Final remark Additional staff at CERN will be extremely useful but I hope that the main benefit of the project will be collaborating on the common goal of LHC data analysis

More Related