490 likes | 611 Views
LHC Computing Grid Project Launch Workshop Conclusions. HEPCCC – CERN, 22 March 2002 les robertson IT Division CERN 15 March 2002 www.cern.ch/lcg. Disclaimer. The workshop finished one week ago Much has still to be digested and interpreted These are my personal conclusions.
E N D
LHC Computing Grid ProjectLaunch Workshop Conclusions HEPCCC – CERN, 22 March 2002 les robertson IT Division CERN 15 March 2002 www.cern.ch/lcg
Disclaimer The workshop finished one week agoMuch has still to be digested and interpreted These are my personal conclusions
Launching Workshop CERN – March 11-15 Goal: Arrive at an agreement within the community of the main elements of a high level work-plan • Define scope and priorities of the project • Agree major priorities for next 15-18 months • Develop coordination mechanisms • Foster collaboration
Funding & Timing • The LCG Project has two facets • management of the resources at CERN • CERN base budget and special pledges • materials and people • coordination of the overall activities • important external resources – materials and personnel • regional centres • grid projects • experiment applications effort • Phase I – is probably now 2002-2005 • extension does not significantly change the integral of the budget • rather smaller prototype, delayed ramp-up • staff later than expected in arriving
identify establish commitment to the project Project Resources Five components • CERN personnel and materials (including those contributed by member and observer states) • openlab industrial contributions • Experiment resources • Regional centres • Grid projects
Funding at CERN – preliminary planning will evolve as the requirements and priorities are defined Special funding staff status arrived – 3 contract – 8 in-process – 4-5 Recruitment under way FZK - 8 PPARC – up to 15 EPSRC – up to 5 Italy - ~5 Funded by EU-Datagrid – 7
The LHC Computing Grid Project Structure LHCC Common Computing RRB (funding agencies) Reviews Resources Reports Project Overview Board ProjectExecutionBoard Software andComputingCommittee(SC2) Requirements, Monitoring
Workflow around the organisation chart WPn PEB SC2 RTAGm SC2 Sets the requirements mandate PEB develops workplan Prioritised requirements ~2 mths requirements Updated workplan Project plan Workplan feedback SC2 approves the workplan PEB manages LCG resources Release 1 Status report time ~4 mths Review feedback PEB tracks progress SC2 reviews the status Release 2 borrowed from John Harvey
Project Management • Applications – Torre Wenaus • infrastructure, tools, libraries • common framework, applications • Fabric – Wolfgang von Rüden • CERN Tier 0 + 1 centre • autonomous systems administration and operation • fault tolerance • basic technology watch • Grid Technology – Fabrizio Gagliardi • provision of grid middleware • drawn from many different grid projects • common/interoperable across US & European grid project “domains” • Grid Deployment • regional centre policy – Mirco Mazzucato • building the LHC Computing Facility – t.b.a.
Status • The SC2 has created several RTAGs • limited scope, two-month lifetime with intermediate report • one member per experiment + IT and other experts • RTAGs started - • data persistency • software process and support tools • maths library • common Grid requirements • Tier 0 mass storage requirements • RTAG being defined – Tier 0-1-2 requirements and services • PEB • Input collected on applications requirements, experiment priorities, data challenges • Re-planning of the Phase I CERN prototype, and relationship to CERN production services • Work towards common/interoperable Trans-Atlantic Grids • HICB, DataTAG, iVDGL, GLUE • High-level planning initiated following workshop • First plan to be delivered to LHCC for July meeting
Applications foils by John Harvey
Scope of Applications Area • Application software infrastructure • Common frameworks for simulation and analysis • Support for physics applications • Grid interface and integration • Physics data management Issues: • ‘How will applications area cooperate with other areas?’ • ‘Not feasible to have a single LCG architect to cover all areas.’ • Need mechanisms to bring coherence to the project
Issues related to organisation • There is a role for a Chief Architect in overseeing a coherent architecture for Applications software and infrastructure • This is Torre’s role • ‘We need people who can speak for the experiments – should be formalised in organisation of the project’ • Experiment architects have key role in getting project deliverables integrate into experiment software systems • They must be involved directly in execution of the project • Mechanisms need to be found to allow for this e.g. • Regular monthly meetings of the Applications subproject that must be attended by architects at which decisions are taken • Difficult decisions are flagged and referred to a separate meeting attended by experiment architects and coordinators • Ultimate responsibility and authority is with the Chief Architect Proposal now formulated & sent to SC2 for endorsement
Approach to making workplan • “Develop a global workplan from which the RTAGs can be derived” • Considerations for the workplan: • Experiment need and priority • Is it suitable for a common project • Is it a key component of the architecture e.g. object dictionary • Timing: when will the conditions be right to initiate a common project • Do established solutions exist in the experiments • Are they open to review or are they entrenched • Availability of resources • Allocation of effort • Is there existing effort which would be better spent doing something else • Availability, maturity of associated third party software • E.g. grid software • Pragmatism and seizing opportunity. A workplan derived from a grand design does not fit the reality of this project
Candidate RTAGs by activity area • Application software infrastructure • Software process; math libraries; C++ class libraries; software testing; software distribution; OO language usage; benchmarking suite • Common frameworks for simulation and analysis • Simulation tools; detector description, model; interactive frameworks; statistical analysis; visualization • Support for physics applications • Physics packages; data dictionary; framework services; event processing framework • Grid interface and integration • Distributed analysis; distributed production; online notebooks • Physics data management • Persistency framework; conditions database; lightweight persistency
Initial Workplan – 1st priority level Initiate: First half 2002 • Establish process and infrastructure • Nicely covered by software process RTAG • Address core areas essential to building a coherent architecture • Object dictionary – essential piece • Persistency - strategic • Interactive frameworks - also driven by assigning personnel optimally • Address priority common project opportunities • Driven by a combination of experiment need, appropriateness to common project, and ‘the right moment’ (existing but not entrenched solutions in some experiments) • Detector description and geometry model • Driven by need and available manpower • Simulation tools
Scope of Fabric Area • CERN Tier 0+1 centre • management automation • evolution & operation of CERN prototype • integration as part of the LCG grid • Tier 1,2 centre collaboration • exchange information on planning and experience • look for areas of collaboration and cooperation • Grid integration middleware • Common systems management package • not required for large independent centres (Tier 1, 2) • may be important for Tier 3 .. – but this is on a different scale from Tier 1 tools – and there may be suitable standard packages – e.g. Datagrid WP4 tools, ROCKS • Technology assessment (PASTA) • evolution, cost estimates
Potential RTAGs • Definition of Tier 0, Tier 1 and Tier 2 centres • requirements • services high priority - to be initiated in April • Tier 0 mass storage requirements • high priority for ALICE • RTAG constituted • Monitoring services • grid, application, system performance, error logging, …………
PASTA III • Technology survey, evolution assessment, cost estimates • To be organised by Dave Foster (CTO) in next weeks • Experts from external institutes, experiments, CERN/IT • Report in summer
Information Exchange • Use HEPiX • Already a “Large Cluster Special Interest Group” • Large Cluster Workshop organised at FNAL in 2001 • Encourage Tier 1 and Tier 2 centres to participate • Next meeting in Catania in April
Short-term goals at CERN • configuration management testbed • LCFG, rapid reconfiguration • replacement of SUE functions • then expand to LXSHARE • low-level control & monitoring • evaluation of industrial controls approach (SCADA) • deploy LXSHARE as node in the LCG grid • mass storage for ALICE CDC
Grid Technology: Summary • What is “Grid Technology”? • It’s the bit between the user’s or experiment’s application and the (Grid-enabled) computer system (“fabric”) at a particular institute or laboratory; • For users, it’s the bit we don’t want to have to worry about, provided it’s there! • Note the analogy with electric-power grid: you plug your device into the socket in the wall; you do not care, or want to care, about the power stations, distribution system, HV cables,…… N.McCubbin, L.Perini
Web Services (Foster) • “Web services” provide • A standard interface definition language (WSDL) • Standard RPC protocol (SOAP) [but not required] • Emerging higher-level services (e.g., workflow) • Nothing to do with the Web (!) • Useful framework/toolset for Grid applications? • See proposed Open Grid Services Architecture (OGSA) • Represent a natural evolution of current technology • No need to change any existing plans • Introduce in phased fashion when available • Maintain focus on hard issues: how to structure services, build applications, operate Grids For more info: www.globus.org/research/papers/physiology.pdf N.McCubbin, L.Perini
What do users expect…? Even if users/applications “don’t want to know” the gory details, we expect: - Data Management across the Grid; - Efficient “scheduling”; - Access to information about what’s going on, either on the Grid itself, or on the component computer systems (fabric layer). - …and we know that someone will have to worry about SECURITY, but we don’t want to! N.McCubbin, L.Perini
Grid Technology and the LCG • The LCG is a Grid deployment project • So LCG is a consumer of Grid technology, rather than a developer • There are many Grid technology projects, with different objectives, timescales and spheres of influence • The LCG Grid Technology Area Manager is responsible for ensuring the supply of Grid middleware/services • that satisfy the LHC requirements (RTAG starting up) • and can be deployed at all LHC regional centres (either the same middleware, or compatible middleware) • and has good solid support and maintenance behind it
Grid Technology: Issues (1)adapted from McCubbin & Perini Co-ordination of Grid efforts: • Users/experiments must see a common world-wide interface; Grid Technology must deliver this. • Note that more than one approach in Grid Technology is healthy at this stage. It takes time to learn what should be common! • Global Grid Forum (GGF) has a role in coordinating all Grid activity, not just HEP, and is starting a standardisation process • HENP Intergrid Collaboration Board (HICB) addresses co-ordination for HEP and Nuclear Physics. • GLUE (Grid Laboratory for a Universal Environment) initiated at February HICB meeting to select (develop) a set of Grid services which will inter-operate world-wide (at least HEP-wide). • Funding for GLUE is/will come from iVDGL (US) and DataTAG (EU). • iVDGL and Datagrid committed to implementing GLUE recommendations • LCG first services will be based on GLUE recommendations
Grid Technology: Issues (2)adapted from McCubbin & Perini • If HICB did not exist, LCG would have to do this for LHC. LCG still has responsibility for defining the Grid requirements of the LHC experiments (“RTAG” about to start), but will obviously hope to use the GLUE recommendations. • LCG should, nevertheless, keep a watchful eye on the various ‘middleware’ projects to minimise risk of divergences. • LCG does not expect to devote significant new resources to middleware development. But continued financial support for EDG, PPDG, GriPhyN… and their successors is vital. • LCG does expect to devote significant resources to deploying the middleware developed by these projects.
Grid Technology: Issues (3) adapted from McCubbin & Perini • It is absolutely essential that the LHC experiments test, as realistically as possible, the Grid Technology as it becomes available. This has already started, but will increase, particularly in the context of Data Challenges. This is vital not just for the usual reason of incremental testing, but also to discover the potential of the Grid for analysis. (Rene Brun) • This ‘realistic’ testing generates “tension” between developing new features of Grid Technology, and supporting versions used by the experiments. • The implications of the Web Services evolution are not yet clear
Deployment issues • Data Challenges are integral part of experiment planning • A large effort from the experiments, almost like a testbeam • Planning and commitment are essential from all sides • LCG, IT at CERN, remote centres and the experiments This has to be a priority for the project • GRID (mostly GLOBUS / Condor) already facilitate the day-to-day work of physicists • Now we need to deploy a pervasive, ubiquitous, seamless GRID testbed between the LHC sites Federico Carminati
GRID issues • Testbed need close collaboration between all parties • Support is very labour intensive • Planning and resource commitment is necessary • Need development and Production testbeds in parallel • And stability, stability, stability • Experiments are buying into the common testbed concept • This is straining existing testbeds • But they should not be deceived • MW from different GRID projects is at risk of diverging • From bottom-up SC2 has launched a common use case RTAG for LHC experiments • From top-down there are coordination bodies • LCG has to find an effective way to contribute and follow these processes Federico Carminati
GRID issues • Interoperability is a key issue (common components or different interoperating) • On national / regional and transatlantic testbeds • This world-wide close coordination has never been done in the past • We need mechanisms here to make sure that all centres are part of a global planning • And these mechanisms should be complementary and not parallel or conflicting to existing ones Federico Carminati
Application testbed requirements DataGRID conference • Three main priorities Stability Stability Stability Federico Carminati
So many conflicts! } Future development Current development Production TB Short term / long term • LCG needs a testbed from now! • Let’s not be hung-up by the concept of production • The testbed must be stable and supported • The software is unstable and we know • Consolidation of existing testbeds is necessary • A small amount of additional effort may be enough! • Several things that can be done now • Setup support hierarchy / chain • Consolidate authorisation / authentication • Consolidate / extend VO organisation • Test EDG / VDT interopearbility • This has beenpioneered by DataGRID • … and we need development testbed(s) • John Gordon’s trident Federico Carminati
developers’ testbed development testbed Production Grid • Grid projects - • responsible for a facility on which developers can build and test their software • “testbeds” for demos and Beta testing LCG - responsible for operating a production Grid – - 24 X 7 service - as easy to use as LXBATCH
Short-term user expectation is out of line with the resources and capabilities of the project
Current state • Experiments can do (and are doing) their event production using distributed resources with a variety of solutions • classic distributed production – send jobs to specific sites, simple bookkeeping • some use of Globus, and some of the HEP Grid tools • other integrated solutions (ALIEN) • The hard problem for distributed computing is ESD and AOD analysis • chaotic workload • unpredictable data access patterns • First data is in 2007 – why is there such a hurry? • Crawl walk run • use current solutions for physics productions • consolidate (stabilise, maintain) middleware • learn what a “production grid” really means
This year - Prototype Production Grids Goals • to understand what “production” involves, train staff, build confidence • “productisation” of the Datagrid Testbed 1 • in close collaboration with Datagrid -- • Datagrid toolkit • start with one (?) experiment • start with a few sites • Creative Simplification : cut-down requirements – simplified, realisable • focus on packaging, process • operational infrastructure • support infrastructure • prototype – do not confuse with production service • does not replace the Datagrid testbed • does not replace the iVDGL testbed
Next year - First production service • be realistic • resources appearing slowly, key people to be hired • first consolidate the existing testbeds • the planning process will define the date for the first worldwide production grid – but expect a 12-month timescale • need mature, supported middleware • need converged/interoperable US/European technology • need to know how to manage a production grid • need to develop solid infrastructure services • need to set up support services
First LCG Production Grid Global Grid Service • deploy a sustained 24 X 7 service • based on “converged” toolkit emerging from GLUE • ~10 sites – qll the Tier 1s and a few Tier 2s sites – including sites in Europe, Asia, North America • > 2 times CERN prototype capacity (= ~1,500 processors, ~1,500 disks) • permanent service for all experiments • achieve target levels for – physics event generation/processing rate, reliability, availability
Fabric – Grid - Deployment – decision making • Must be able to make practical, binding community agreements • experiments • regional centre management resources, schedules, security, usage policies, …. software environment (op.sys., data services, grid services, ….) operating standards, procedures …. operation of common infrastructure • Grid Deployment Management Board Membership (still under discussion) • experiment data & analysis challenge coordinators • regional centre delegates – Tier1, Tier2 (empowered, with responsibility, but not one per regional centre!) • fabric, grid technology, grid deployment area coordinators Planning, strategy brought to SC2 for endorsement
Important issues for this year • Defining the overall architecture of the project • Applications, fabric & grid deployment • Grid technology • Computing and analysis model • Thorough re-assessment of the requirements, technology, costs • LHCC re-assessment of physics rates • new technology survey (Pasta) –particularly the tape issues • revised computing and analysis model
Cautionary remarks • Staff arriving at CERN slowly – • staff also leaving CERN • To what extent will external and experiment resources be committed to the project • Importance of agreeing priorities within the communitytaking realistic account of timing and resources – • Do not expect to solve all of the problems • Do not expect immediate results • Concentrate on a few priority areas, while staff numbers and experience build up • Long-term support • importance of ownership - ensuring that developments belong to long-term structures • temporary staff – must be replaced by longer term staff – unsolved problem given budget crisis
Final remark Additional staff at CERN will be extremely useful but I hope that the main benefit of the project will be collaborating on the common goal of LHC data analysis