380 likes | 394 Views
GridPP Status Report. Tony Doyle. Contents. M/S/N Middleware Food chains Applications Dissemination The UK mountain climb Summary. What was GridPP1? What is GridPP2? Challenges abound LCG Issues Deployment Status ( 9 - 28 - 30 /1/05 ) Tier-1/A, Tier-2, NGS. What was GridPP1?.
E N D
GridPP Status Report Tony Doyle GridPP12 Collaboration Meeting
Contents • M/S/N • Middleware • Food chains • Applications • Dissemination • The UK mountain climb • Summary • What was GridPP1? • What is GridPP2? • Challenges abound • LCG • Issues • Deployment Status (9-28-30/1/05) • Tier-1/A, Tier-2, NGS GridPP12 Collaboration Meeting
What was GridPP1? • A team that built a working prototype grid of significant scale > 2,000 (9,000) CPUs > 1,000 (5,000) TB of available storage > 1,000 (6,000) simultaneous jobs • A complex project where 88% of the milestones were completed and all metrics were within specification A Success “The achievement of something desired, planned, or attempted” GridPP12 Collaboration Meeting
Executive Summary I • “The GridPP1 Project is now complete: following 3 years of development, a prototype Grid has been established, meeting the requirements of the experiments and fully integrated with LCG, currently the World’s largest Grid. Starting from this strong foundation, a more complex project, GridPP2, has now started, with an extended team in the UK working towards a production Grid deployed for the benefit of all experiments by September 2007.” • We achieved (almost exactly) what we stated we would do in building a prototype… GridPP12 Collaboration Meeting
Executive Summary II • “2004 was a pivotal year, marked by extraordinary and rapid change with respect to Grid deployment, in terms of scale and throughput. The scale of the Grid in the UK is more than 2000 CPUs and 1PB of disk storage (from a total of 9,000 CPUs and over 5PB internationally), providing a significant fraction of the total resources required by 2007. A peak load of almost 6,000 simultaneous jobs in August, with individual Resource Brokers able to handle up to 1,000 simultaneous jobs, gives confidence that the system should be able to scale up to the required 100,000 CPUs by 2007. A careful choice of sites leads to acceptable (>90%) throughput for the experiments, but the inherent complexity of the system is apparent and many operational improvements are required to establish and maintain a production Grid of the required scale. Numerous issues have been identified that are now being addressed as part of GridPP2 planning in order to establish the required resource for particle physics computing in the UK.” • Most projects fail in going from prototype to production… • There are many issues: methodical approach reqd. GridPP12 Collaboration Meeting
What is GridPP2? Structures agreed and in place (except LCG phase-2) • 253 Milestones, 112 Monitoring Metrics at present. • Must deliver a “Production Grid”: robust, reliable, resilient, secure, stable service delivered to end-user applications. • The Collaboration aims to develop, deploy and operate a very large Production Grid in the UK for use by the worldwide particle physics community. GridPP12 Collaboration Meeting
What are the Grid challenges? • Must • share databetween thousands of scientists with multiple interests • link major (Tier-0 [Tier-1]) and minor (Tier-1 [Tier-2])computer centres • ensure all data accessible anywhere, anytime • grow rapidly, yet remainreliablefor more than a decade • cope withdifferent management policiesof different centres • ensure data security • be up and running routinely by2007 GridPP12 Collaboration Meeting
What are the Grid challenges? 2. Software efficiency 1. Software process 3. Deployment planning 4. Link centres 10. Policies 5. Share data Data Management, Security and Sharing 9. Accounting 8. Analyse data 7. Install software 6. Manage data GridPP12 Collaboration Meeting
What are the limits on Data?Advanced Areal Density Trends 1 PetaBit/in2 !! Technical ProgressTechnology Boundaries Atom Surface Density 1000000 Atom Level Storage 100000 1 Terabit/in2 ! Volumetric Optical 10000 Probe Contact Area Viability ) 2 1000 ? Probe ? 100 Superparamagnetic Effect Areal Density (Gb/in ? Magnetic Disk 10 Tape Demos ? Optical Disk 1 Currently disk capacity doubles every year (or so) for unit cost. Serpentine Longitudinal Tape 0.1 Helical Tape 0.01 Parallel Track Longitudinal Tape 0.001 1987 1992 1997 2002 2007 2012 2017 2022 LHC era Year GridPP12 Collaboration Meeting M. Leonhardt 4-9-02
What are the limits on CPU?Moore’s Law No Exponential is Forever … but We Can Delay 'Forever‘ ftp://download.intel.com/ research/silicon/ Gordon_Moore _ISSCC_021003.pdf Technical ProgressTechnology Boundaries Currently CPU performance doubles every two years (or so) for unit cost. LHC era GridPP12 Collaboration Meeting
Applies to our problem?(See Dave’s talk) Step-1.. financial planning Step-2.. Compare to (e.g. Tier-1) expt. requirements Ian Foster / Carl Kesselman: "A computational Grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities." Step-3.. Conclude that more than one centre is needed Step-4.. A Grid? Currently network performance doubles every year (or so) for unit cost. GridPP12 Collaboration Meeting
How do I start? http://www.gridpp.ac.uk/start/ • Getting started as a Grid user • Quick start guide for LCG2GridPP guide to starting as a user of the Large Hadron Collider Computing Grid. • Getting an e-science certificateIn order to use the Grid you need a Grid certificate. This page introduces the UK e-Science Certification Authority, which issues cerficates to users. You can get a certificate from here. • Using the LHC Computing Grid (LCG)CERN's guide on the steps you need to take in order to become a user of the LCG. This includes contact details for support. • LCG user scenarioThis describes in a practical way the steps a user has to follow to send and run jobs on LCG and to retrieve and process the output successfully. • Currently being improved.. DTEAM GridPP12 Collaboration Meeting
Where do we start? Issues Overall efficiency ~60% “LCG-2 MIDDLEWARE PROBLEMS AND REQUIREMENTS FOR LHC EXPERIMENT DATA CHALLENGES” First large-scale Grid production problems being addressed… at all levels ¼ of the problems ¾ of the problems https://edms.cern.ch/file/495809/2.2/LCG2-Limitations_and_Requirements.pdf GridPP12 Collaboration Meeting
GridPP Deployment Status (9-28-30/1/05) GridPP deployment is part of LCG (Currently the largest Grid in the world) The future Grid in the UK is dependent upon LCG releases Three Grids on Global scale in HEP (similar functionality) sites CPUs • LCG (GridPP) 90 (16) 9000 (2242) • Grid3 [USA] 29 2800 • NorduGrid 30 3200 GridPP12 Collaboration Meeting
UK Tier-1/A Centre Rutherford Appleton Laboratory Grid Resource Discovery Time = 8 Hours • High quality data services • National and international role • UK focus for international Grid development 1000 CPU 200 TB Disk 60 TB Tape (Capacity 1PB) 2004 Disk Use 2004 CPU Utilisation Peak Utilisation Fall-off in Q4 GridPP12 Collaboration Meeting
UK Tier-2 Centres The whole is better than the sum of the parts.. GridPP12 Collaboration Meeting
* Leeds Manchester * * DL * Oxford RAL * • In future will include services to facilitate collaborative (grid) computing • Authentication (PKI X509) • Job submission/batch service • Resource brokering • Authorisation • Virtual Organisation management • Certificate management • Information service • Data access/integration • (SRB/OGSA-DAI/DQPS) • National Registry (of registry’s) • Data replication • Data caching • Grid monitoring • Accounting Level-2 Grid GridPP12 Collaboration Meeting
Middleware Development Network Monitoring Configuration Management Grid Data Management • Deployment Area • LCFG • Generic • Quattor Storage Interfaces Information Services Security GridPP12 Collaboration Meeting
Prototype MiddlewareStatus & Plans (I) • Workload Management • AliEn TaskQueue • EDG WMS (plus new TaskQueue and Information Supermarket) • EDG L&B • Computing Element • Globus Gatekeeper + LCAS/LCMAPS • Dynamic accounts (from Globus) • CondorC • Interfaces to LSF/PBS (blahp) • “Pull components” • AliEn CE • gLite CEmon (being configured) Blue: deployed on development testbed Red: proposed LHCC Comprehensive Review – November 2004 19
Storage Element Existing SRM implementations dCache, Castor, … FNAL & LCG DPM gLite-I/O (re-factored AliEn-I/O) Catalogs AliEn FileCatalog – global catalog gLite Replica Catalog – local catalog Catalog update (messaging) FiReMan Interface RLS (globus) Data Scheduling File Transfer Service (Stork+GridFTP) File Placement Service Data Scheduler Metadata Catalog Simple interface defined (AliEn+BioMed) Information & Monitoring R-GMA web service version; multi-VO support Prototype MiddlewareStatus & Plans (II) LHCC Comprehensive Review – November 2004 20
Security VOMS as Attribute Authority and VO mgmt myProxy as proxy store GSI security and VOMS attributes as enforcement fine-grained authorization (e.g. ACLs) globus to provide a set-uid service on CE Accounting EDG DGAS(not used yet) User Interface AliEn shell CLIs and APIs GAS Catalogs Integrate remaining services Package manager Prototype based on AliEn backend evolve to final architecture agreed with ARDA team Prototype MiddlewareStatus & Plans (III) LHCC Comprehensive Review – November 2004 21
Middleware & OGSA-complianceWe need an “open” “grid” “services” “architecture” • Infrastructure Services that enable communication between disparate resources (computer, storage, applications, etc.), removing barriers associated with shared utilization. • Resource Management Services that enable the monitoring, reservation, deployment, and configuration of grid resources based on quality of service requirements • Data Services that enable the movement of data where it is needed – managing replicated copies, query execution and updates, and transforming data into new formats if required. • Context Services that describe the required resources and usage policies for each customer that utilizes the grid – enabling resource optimization based on service requirements. • Information Services that provide efficient production of, and access to, information about the grid and its resources, including status and availability of a particular resource. • Self-Management Services that support the attainment of stated levels of service with as much automation as possible, to reduce the costs and complexity of managing the system. • Security Services that enforce security policies within a virtual organization, promoting safe resource-sharing and appropriate authentication and authorization of users. • Execution Management Services that enable both simple and more complex workflow actions to be executed, including placement, provisioning, and management of the task lifecycle GridPP12 Collaboration Meeting
Oasis WS-RF & WS-I+ • WS-RF (the Oasis standard) WS-I+ (implementation?) • UK e-Science Core programme services (July 2004): WS-I+ • WS-I Basic Profile (XSD, WSDL 1.1, SOAP 1.1, UDDI) • WS-I Basic Security Profile (parts of WS-Security) • BPEL • WS-Addressing (to be replaced the ongoing W3C activity). • WS-ReliableMessaging • WS-Eventing • A service built with WS-RF will not interoperate with WS-I+ client… UK e-Science meeting today GridPP12 Collaboration Meeting
gLite & ARDA Metadata gLite (a standard?) ARDA (an implementation?) End-user throughput or standards driven? GSOAP optimisation important Early days.. Some overlapping functionality – missing extensibility in gLite APIs differ Testing ongoing: middle ground – adapt to gLite interfaces (e.g. AMI-gLite), test ARDA implementation GridPP12 Collaboration Meeting
The Oasis:OGSA:WSRF:WSI+:gLite:ARDA:experiment experiment:ARDA:gLite:WSI+:WSRF:OGSA:Oasisfood chain? 2. A virtuous? circle 1. A hierarchy? Depends on your World view… Only works if there is sufficient decomposition… Discussion required GridPP12 Collaboration Meeting
Workshop on eInfrastructures (Internet and Grids)Best practices and challenges Conference xxx - August 2003 Fabrizio Gagliardi DataGrid Project Manager and EGEE designated Project Director CERN Geneva Switzerland Need to relate high level plan to what is required on the ground
LCG Robustnesse.g. data management • LCG File Catalog (LFC) developed to address the performance and scalability problems seen in the 2004 Data Challenges • Features include hierarchical namespace, transactions, cursors, timeouts & retries, GSI security, ACLs... • Performance testing almost complete • Tests of insert, query and delete rates up to 40,000,000 entries and 10 clients / 100 concurrent threads • Insert rates almost independent of number of entries in LFC, much more scalable than EDG RLS. • Higher delete rate than EDG RLS • Query rate lower than Globus but higher than EDG.. but LFC retrieves much more information with query so matches user patterns better • Scales well to many replicas and LFNs per GUID, and to many concurrent users http://ppewww.ph.gla.ac.uk/~caitrian/LFC GridPP12 Collaboration Meeting
Testing & Documentatione.g. data management • lcg-aa "add-alias" adds an alias in RMC for a given GUID. • lcg-cp "copy" copies a grid file to a specific location on UI area. • lcg-cr "copy-and-register" copy a file to the SE and registers the file in the SE's LRC. • lcg-del "delete" deletes a file. • lcg-gt "get-turl" gets the TURL for a given SURL + transfer protocol. • lcg-infosites "list-all sites information" lists important information for all sites on the grid. • lcg-la "list-aliases" lists all the aliases for a given LFN, GUID or SURL. • lcg-lg "list-GUID" lists the GUID for a given LFN or SURL. • lcg-lr "list-replicas" lists the replicas for a given LFN, GUID or SURL. • lcg-ra "remove-alias" removes an alias in RMC for a given GUID. • lcg-rep "replicate" copies a file from one SE to another SE and registers it in the destination SE's LRC. • lcg-rf "register-file" registers in LRC a file residing on an SE. • lcg-uf "unregister-file" unregisters in LRC a file residing on an SE. Preliminary tests completed for all 91 data management Commands Simple additional Documentation added http://ppewww.ph.gla.ac.uk/~fergusjk/ GridPP12 Collaboration Meeting
Application Development ATLAS LHCb CMS SAMGrid (FermiLab) BaBar (SLAC) QCDGrid PhenoGrid GridPP12 Collaboration Meeting
ApplicationsThere is a (slightly wonky?) wheelUse it to get to where you need to be • ZEUS uses LCG • needs the Grid to respond to increasing demand for MC production • up to 6 million Geant events per week on Grid since August 2004 • The system developed for the large LHC experiments works (more) effectively for other (less resource-intensive) applications • Experiments need to work together with deployment team/sites • The de-facto deployment standard is LCG – it ~works. We can add components as required, to meet each experiment’s needs GridPP12 Collaboration Meeting
Disseminationmuch has happened.. more people are reading about it.. GridPP2 gets its first term report Fri 28 Jan 2005 BaBar UK moves into the Grid era Tue 11 Jan 2005 LHCb-UK members get up to speed with the Grid Wed 5 Jan 2005GridPP in Pittsburgh Thu 9 Dec 2004GridPP website busier than ever Mon 6 Dec 2004Optorsim 2.0 released Wed 24 Nov 2004ZEUS produces 5 million Grid events Mon 15 Nov 2004CERN 50th anniversary reception Tue 26 Oct 2004GridPP at CHEP'04 Mon 18 Oct 2004LHCb data challenge first phase a success for LCG and UK Mon 4 Oct 2004Networking in Nottingham - GLIF launch meeting Mon 4 Oct 2004GridPP going for Gold - website award at AHM Mon 6 Sep 2004GridPP at the All Hands Meeting Wed 1 Sep 2004R-GMA included in latest LCG release Wed 18 Aug 2004LCG2 administrators learn tips and tricks in Oxford Tue 27 Jul 2004Take me to your (project) leader Fri 2 Jul 2004ScotGrid's 2nd birthday: ScotGrid clocks up 1 million CPU hours Fri 25 Jun 2004Meet your production manager Fri 18 Jun 2004GridPP10 report and photographs Wed 9 Jun 2004CERN recognizes UK's outstanding contribution to Grid computing Wed 2 Jun 2004UK particle physics Grid takes shape Wed 19 May 2004A new monitoring map for GridPP Mon 10 May 2004Press reaction to EGEE launch Tue 4 May 2004GridPP at the EGEE launch conference Tue 27 Apr 2004LCG2 released Thu 8 Apr 2004University of Warwick joins GridPP Thu 8 Apr 2004 Grid computing steps up a gear: the start of EGEE Thu 1 Apr 2004EDG gets glowing final review Mon 22 Mar 2004Grids and Web Services meeting, 23 April, London Tue 16 Mar 2004EU DataGrid Software License approved by OSI Fri 27 Feb 2004GridPP Middleware workshop, March 4-5 2004, UCL Fri 20 Feb 2004Version 1.0 of the Optorsim grid simulation tool released by EU DataGrid Tue 17 Feb 2004Summary and photographs of the 9th GridPP Collaboration Meeting Thu 12 Feb 2004 138,976 hits in December GridPP12 Collaboration Meeting
The UK mountain climb has started.. Annual data storage: 2.4-2.8 PetaBytes per year? (~20%) CD stack (~ 4 km) 10 Million SPECint2000 step-by-step plan in place… For the Ben Nevis climb? In production terms, left base camp • 10,000 PCs (3 GHz Pentium 4) Quantitatively, we’re ~10% of the way there in terms of UK CPU (~1,000 ex ~10,000) and disk (~1 ex ~10 PB) We are here (0.4 km) GridPP12 Collaboration Meeting
The Grid is a reality A project was/is needed Under control LCG2 support: SC case presn. 3/2/05 16 UK sites are on the Grid MoUs, planning, deployment, monitoring each underway as part of GridPP2 Developments estd., R-GMA deployed gLite designed inc. web services Interfaces developed, testing phase Area transformed Incorporation in HEP programme.. Introduction Project Management Resources LCG Deployment Tier-1/A production Tier-2 resources M/S/N EGEE Applications Dissemination Beyond GridPP2 SummaryGRIDPP-PMB-40-EXEC GridPP12 Collaboration Meeting
Top 10 Issues? • Issues are the ones that your oversight committee tells you are issues? • Issues are long-term (endemic) problems -they were around 3 years ago? • Issues are wider than this? The ones you thought might be problems at the start? (but they were called challenges) GridPP12 Collaboration Meeting
PPARC Oversight Committee Issues 1. GridPP may be underestimating the difficulty of engaging with each of the experiment teams. 2. Document with a plan to support UK physics analysis community in 2007 is needed. 3. Tier-1 allocation policy - define usage policy. i.e. what is the absolute scale? Are we under/over-committing from PPARC perspective? 4. Need to update GridPP2 Risk Register. 5. OC requires the LCG funding case to be put to them before going to Science Committee. (This has been done) 6. Get-fit plan on Production Metrics. How do we move from 60% to >90% and how will this be monitored in the UK. 7. Nail down the metrics - no sensible values yet established. Iterations are required. 8. Clarify probable direction of GridPP in terms of middleware. GridPP12 Collaboration Meeting
2002 Challenges • Complete rollout of TB-1 and plan future upgrades • Reconvened ATF to work closely with applications • Make TB-2 a success • Deploy and exploit Tier-1/A • Applications to make good use of testbeds • Solve interoperability issues • We are part of many larger collaborations/structures/groupings - we need to collaborate/discuss engage here, and • Focus on implementation in the UK… this will tell us what works (and what doesn’t) at any given point. Tony Doyle - University of Glasgow
What are the Grid challenges? 2. Software efficiency 1. Software process 3. Deployment planning 4. Link centres 10. Policies 5. Share data Data Management, Security and Sharing 9. Accounting 8. Analyse data 7. Install software 6. Manage data GridPP12 Collaboration Meeting
Top 10 Issues? • Three methods to identify issues: • "If you cannot measure it, you cannot improve it." • Need to quantify end-to-end throughput… measurements are important… • Tackle the issues as they present themselves • In a timely way… LHC data is imminent… • Is there a GridPP top 10? • Answer?: No (probably) GridPP12 Collaboration Meeting