370 likes | 489 Views
A Grid For Particle Physics. From testbed to production. 3 rd September 2004 All Hands Meeting – Nottingham, UK. Jeremy Coles J.Coles@rl.ac.uk. Contents. Review of GridPP1 and the European Data Grid Project. The middleware components of the testbed. Lessons learnt from the project.
E N D
A Grid For Particle Physics From testbed to production 3rd September 2004 All Hands Meeting – Nottingham, UK Jeremy Coles J.Coles@rl.ac.uk
Contents • Review of GridPP1 and the European Data Grid Project • The middleware components of the testbed • Lessons learnt from the project • Status of the current operational Grid • Future plans and challenges • Summary
The physics driver TheLHC 1 Megabyte (1MB) A digital photo 1 Gigabyte (1GB) = 1000MB A DVD movie 1 Terabyte (1TB) = 1000GB World annual book production 1 Petabyte (1PB) = 1000TB Annual production of one LHC experiment 1 Exabyte (1EB) = 1000 PB World annual information production • 40 million collisions per second • After filtering, 100-200 collisions of interest per second • 1-10 Megabytes of data digitised for each collision = recording rate of 0.1-1 Gigabytes/sec • 1010 collisions recorded each year • = ~10 Petabytes/year of data CMS LHCb ATLAS ALICE les.robertson@cern.ch
Data handling simulation CERN reconstruction event filter (selection & reconstruction) detector analysis processed data event summary data raw data batch physics analysis event reprocessing analysis objects (extracted by physics topic) event simulation interactive physics analysis les.robertson@cern.ch
The UK response GridPP GridPP – A UK Computing Grid for Particle Physics 19 UK Universities, CCLRC (RAL & Daresbury) and CERN Funded by the Particle Physics and Astronomy Research Council (PPARC) GridPP1 - Sept. 2001-2004 £17m "From Web to Grid“ GridPP2 – Sept. 2004-2007 £16(+1)m "From Prototype to Production"
The project People 500 registered users 12 Virtual Organisations 21 Certificate Authorities >600 people trained 456 person-years of effort Application Testbed ~20 regular sites > 60,000 jobs submitted (since 09/03, release 2.0) Peak >1000 CPUs 6 Mass Storage Systems Software > 65 use cases 7 major software releases (> 60 in total) > 1,000,000 lines of code Scientific Applications 5 Earth Obs institutes 10 bio-medical apps 6 HEP experiments http://eu-datagrid.web.cern.ch/eu-datagrid/
Contents • The middleware components of the testbed • Lessons learnt from the project
The infrastructure developed UI JDL AA server (VOMS) Resource broker (C++ Condor MM libraries, Condor-G for submission) Logging & Book keeping MySQL DB – stores job state info Berkely Database Information Index Replica catalogue per VO (or equiv.) gridFTP Gatekeeper (Perl script) + Scheduler Batch workers Job submission Python – default Java – GUI APIs (C++,J,P) NFS, Tape, Castor User Interface Computing Element Storage Element
Integration (MDS +) BDII Or R-GMA • Data services • RLS • RC Much time spent on • Controlling the direct and indirect interplay of the various integrated components • Addressing stability issues (often configuration linked) and bottlenecks in a non-linear system • Predicting (or failing to predict) where the next bottleneck will appear in the job processing network
The storage element The Grid Storage Element interfaces “Handlers” TAPE storage (or disk) File Metadata Access Control • Manages storage and provides common interfaces to Grid clients. • Higher level data management tools use replica catalogues & metadata about files to locate, and optimise which replica to use • Since EDG work has provided the SE with an SRM 1 • Interface. SRM 2.1 with added functionality will be • available soon. • The SRM interface is a file control interface, there is also an interface for publishing information. Internally, “handlers” ensure modularity and flexibility. Lessons learnt • Separating file control (e.g. staging, pinning) from data transfer is useful (different nodes better performance) • Can be used for load balancing, redirection, etc • Easy to add new data transfer protocols • However, files in cache must be releasedby the client or time out
Based on the (simple model of the) Grid Monitoring Architecture (GMA) from the GGF For Relational Grid Monitoring Architecture (R-GMA): hide Registry mechanism from the user Producer registers on behalf of user Mediator (in Consumer) transparently selects the correct Producer(s) to answer a query Information & monitoring Producer Registry/ Schema Consumer • Users just think in terms of Producers and Consumers • Use relational model (R of R-GMA) • Facilitate expression of queries over all the published information Lessons learnt • Release working code early • Distributed Software System testing is hard – private WP3 testbed was very useful • Automate as much as possible (CruiseControl always runs all tests!)
The security model high frequency low frequency CA CA CA host cert(long life) service user crl update user cert(long life) VO-VOMS registration registration VO-VOMS voms-proxy-init VO-VOMS proxy cert(short life) service cert(short life) VO-VOMS authz cert(short life) authz cert(short life) Mutual authentication & authorization info LCAS Local Centre Authorisation Service
The security model (2) • Authentication - GridPP led the EDG/LCG CA infrastructure (trust) • Authorisation • VOMS for global policy • LCAS for local site policy • GACL (fine grained access control) and GridSite for http • LCG/EGEE security policy led by GridPP Lessons learned • Be careful collecting requirements (integration is difficult) • Security must be an integral part of all development (from the start) • Building and maintaining “trust” between projects and continents takes time • Integration of security into existing systems is complex • There must be a dedicated activity dealing with security • EGEE benefited greatly – now has separate activity
Networking • A network transfer “cost” estimation service to provide applications and middleware with the costs of data transport • Used by RBs for optimized matchmaking (getAccessCost), and also directly by applications (getBestFile) • GEANT network tests campaign • Network Quality Of Service • High-Throughput Transfers • Close collaboration with DANTE • Set-up of the testbed • Analysis of results • Access granted to all internal GEANT monitoring tools • Network monitoring is a key activity, both for provisioning and to provide accurate aggregate function for global grid schedulers. • The investigations on network QoS carried out have led to a much greater understanding of how to utilise the network to benefit Grid operations • Benefits resulted from close contact with DANTE and DataTAG, both at technical and management level
Project lessons learnt • Formation of Task Forces (applications+middleware) was a very important step midway in project. Applications should have played a larger role in architecture discussions from the start • Loose Cannons (team of 5) were crucial to all developments. Worked across experiments and work packages • Site certificationneeds to be improved. and validation needs to be automated and run regularly. Misconfigured sites may cause many failures • Important to provide astable environment to attract users but get at the start get working code out to known users as quickly as possible • Qualityshould start at the beginning of the project for all activities with defined Procedures, standards and metrics • Security needs to be an integrated part from the very beginning
Contents • Status of the current operational Grid
Our grid is working … NorthGrid **** Daresbury, Lancaster, Liverpool, Manchester, Sheffield SouthGrid * Birmingham, Bristol, Cambridge, Oxford, RAL PPD, Warwick ScotGrid * Durham, Edinburgh, Glasgow LondonGrid *** Brunel, Imperial, QMUL, RHUL, UCL
… and is part of LCG • Rutherford Laboratory together with a site in Taipei is currently providing the Grid Operations Centre. It will also run the UK/I EGEE Regional Operations Centre and Core Infrastructure Centre • Resources are being used for data challenges • Within the UK we have some VO/experiment Memorandum of Understandings in place • Tier-2 structure is working well
Scale GridPP prototype Grid > 1,000 CPUs • 500 CPUs at the Tier-1 at RAL > 500 CPUs at 11 sites across UK organised in 4 Regional Tier-2s > 500 TB of storage > 800 simultaneous jobs • Integrated with international LHC Computing Grid (LCG) > 5,000 CPUs > 4,000 TB of storage > 70 sites around the world > 4,000 simultaneous jobs • monitored via Grid Operations Centre (RAL) Picture yesterday (hyperthreading enabled on some sites) http://goc.grid.sinica.edu.tw/gstat/
Past upgrade experience at RAL Previously utilisation of new resources grew steadily over weeks or months.
Tier-1 update 27-28th July 2004 Hardware Upgrade With the Grid we see a much more rapid utilisation of newly deployed resources.
Contents • Future plans and challenges
GridPP2 management Project Map Collaboration Board Project Leader Project Manager Production Manager Dissemination Officer Project Management Board Risk Register GGF, LCG, EGEE, UK e-Science, Liaison Deployment Board User Board
There are still challenges • Middleware validation • Improving Grid “efficiency” • Meeting experiment requirements with the Grid • Provision of work group computing • Distributed file (and sub-file) management • Experiment software distribution • Provision of distributed analysis functionality • Production accounting • Encouraging an open sharing of resources • Security
Middleware validation JRA1 SA1 CERTIFICATION TESTING APP INTEGR SERVICES DEPLOY Integrate HEP EXPTS Basic Functionality Tests BIO-MED DEVELOPMENT & INTEGRATION UNIT & FUNCTIONAL TESTING Run Certification Matrix OTHER TBD DEPLOYMENT PREPARATION PRE-PRODUCTION PRODUCTION Run tests C&T suites Site suites APPS SW Installation Release candidate tag Certified release tag Deployment release tag Production tag Dev Tag Is starting to be addressed through a Certification and Testing testbed…
Distributed analysis • AliEn (ALICE Grid) provided a pre-Grid implementation [Perl scripts] • ARDA provides a framework for PP application middleware
Software distribution Physics Models Monte Carlo Truth Data Trigger System Detector Simulation Data Acquisition Run Conditions Level 3 trigger MC Raw Data Calibration Data Raw Data Trigger Tags Reconstruction Reconstruction Event Summary Data ESD Event Tags MC Event Summary Data MC Event Tags • ATLAS Data Challenge to validate world-wide computing model • Packaging, distribution and installation:Scale:one release build takes 10 hours produces 2.5 GB of files • Complexity:500 packages, Mloc, 100s of developers and 1000s of users • ATLAS collaboration is widely distributed:140 institutes, all wanting to use the software • needs ‘push-button’ easy installation.. Step 1: Monte Carlo Data Challenges Step 2: Real Data
Production accounting GOC aggregates data across all sites. http://goc.grid-support.ac.uk/ROC/docs/accounting/accounting.php
Deployment Metrics Accounting and Monitoring Documentation Support Procedures Security Middleware Stable fabric Porting to new platforms…
Grevolution BaBarGrid BaBar EGEE SAMGrid CDF D0 ATLAS EDG LHCb ARDA GANGA LCG ALICE CMS LCG CERN Tier-0 Centre CERN Prototype Tier-0 Centre CERN Computer Centre UK Tier-1/A Centre UK Prototype Tier-1/A Centre RAL Computer Centre 4 UK Tier-2 Centres 19 UK Institutes 4 UK Prototype Tier-2 Centres Separate Experiments, Resources, Multiple Accounts Prototype Grids 'One' Production Grid 2004 2007 2001
Contents • Summary
Summary • The Large Hadron Collider data volumes make Grid computing a necessity • GridPP1 with EDG developed a successful Grid prototype • GridPP members have played a critical role in most areas – security, work load management, monitoring & operations • GridPP involvement continues with the Enabling Grids for e-Science in Europe (EGEE) project – driving the federating of Grids • As we move towards a full production service we face many challenges in areas such as deployment, accounting and true open sharing of resources Or to see a possible analogy of developing a Grid follow this link! http://www.fallon.com/site_layout/work/clientview.aspx?clientid=12&projectid=85&workid=25784
Useful links GRIDPP and LCG: • GridPP collaboration http://www.gridpp.ac.uk/ • Grid Operations Centre (inc. maps) http://goc.grid-support.ac.uk/ • The LHC Computing Grid http://lcg.web.cern.ch/LCG/ Others • PPARC http://www.pparc.ac.uk/Rs/Fs/Es/intro.asp • The EGEE project http://egee-intranet.web.cern.ch/egee-intranet/index.html • The European Data Grid final review http://eu-datagrid.web.cern.ch/eu-datagrid/