770 likes | 996 Views
World wide LHC Computing Grid WLCG. Markus Schulz LCG Deployment 14 January 2009. Outline. LHC, the computing challenge Data rate, computing , community Grid Projects @ CERN WLCG, EGEE gLite Middleware Code Base Software life cycle EGEE operations Outlook and summary.
E N D
World wide LHC Computing Grid WLCG Markus Schulz LCG Deployment 14 January 2009 Markus Schulz, CERN, IT Department
Outline • LHC, the computing challenge • Data rate, computing , community • Grid Projects @ CERN • WLCG, EGEE • gLite Middleware • Code Base • Software life cycle • EGEE operations • Outlook and summary Markus Schulz, CERN, IT Department
View of the ATLAS detector (under construction) 150 million sensors deliver data … … 40 million times per second
The LHC Computing Challenge • Signal/Noise 10-9 • Data volume • High rate * large number of channels * 4 experiments • 15 PetaBytes of data each year • Compute power • Event complexity * Nb. events * thousands users • 200 k of (today's) fastest CPUs • Worldwide analysis & funding • Computing funding locally in major regions & countries • Efficient analysis everywhere • GRID technology
Timeline: LHC Computing ATLAS (or CMS) requirementsfor first year at design luminosity LHC approved 7x107 MIPS1,900 TB disk (140 MSi2K) 55x107 MIPS70,000 TB disk 107 MIPS100 TB disk ATLAS&CMSCTP “Hoffmann”Review ComputingTDRs LHCb approved ATLAS & CMS approved ALICEapproved LHC start Ever increasing requirements
637 70 4603 87 22 538 55 27 10 LHC Over 6000 LHC Scientists world wide Outdated!!! > 9500 CERN “Users” Europe: 267 Institutes, 4603 Users Other: 208 Institutes, 1632 Users Markus Schulz, CERN, IT Department
Flow to the CERN Computer Center 10Gbit 10Gbit 10Gbit 10Gbit Markus Schulz, CERN, IT Department
Flow out of the center Total of 1.5 Gbyte/sec required Markus Schulz, CERN, IT Department
LHC Computing Grid project (LCG) Nordic Data GRID Facility • Dedicated 10Gbit links between T0 & T1s • Tier-0: • Data acquisition & initial processing • Long-term data curation • Distribution of data Tier-1 centres • Tier-1 (11): • Managed Grid Mass Storage • Data-heavy analysis • National, regional support • Tier-2: ~200 in ~35 countries • Simulation • End-user analysis – batch and interactive
LHC DATA ANALYSIS HEP code key characteristics • modest memory requirements • 2GB/job • performs well on PCs • independent eventstrivial parallelism • large data collections (TB PB) • shared by very large user collaborations For all four experiments • ~15 PetaBytes per year • ~200K processor cores • > 6,000 scientists & engineers
CERN LHC Computing Multi-science • 1999 - MONARC project • First LHC computing architecture – hierarchical distributed model • 2000 – growing interest in grid technology • HEP community main driver in launching the DataGrid project • 2001-2004 - EU DataGrid project • middleware & testbed for an operational grid • 2002-2005 – LHC Computing Grid – LCG • deploying the results of DataGrid to provide a production facility for LHC experiments • 2004-2006 – EU EGEE project phase 1 • starts from the LCG grid • shared production infrastructure • expanding to other communities and sciences • 2006-2008 – EU EGEE project phase 2 • expanding to other communities and sciences • Scale and stability • Interoperations/Interoperability • 2008-2010 – EU EGEE project phase 3 • More communities • Efficient operations • Less central coordination
WLCG Collaboration • The Collaboration • 4 LHC experiments • ~250 computing centres • 12 large centres (Tier-0, Tier-1) • 38 federations of smaller “Tier-2” centres • Growing to ~40 countries • Grids: EGEE, OSG, Nordugrid (NDGF) • Technical Design Reports • WLCG, 4 Experiments: June 2005 • Memorandum of Understanding • Agreed in October 2005 • Resources • 5-year forward look • Relies on EGEE and OSG • and other regional efforts like NDGF
The EGEE project • EGEE • Started in April 2004, now in second phase with 91 partners in 35 countries • Now in it’s 3rd phrase (2008-2010) • Objectives • Large-scale, production-quality grid infrastructure for e-Science • Attracting new resources and users from industry as well asscience • Maintain and further improve“gLite” Grid middleware Markus Schulz, CERN, IT Department
Registered Collaborating Projects Infrastructures geographical or thematic coverage Support Actions key complementary functions Applications improved services for academia, industry and the public 25 projects have registered as of September 2007:web page Markus Schulz, CERN, IT Department
Collaborating infrastructures Markus Schulz, CERN, IT Department
Virtual Organizations Total VOs: 204Registered VOs: 116Median sites per VO: 3 Total Users: 5034Affected People: 10200Median members per VO: 18
Archeology • Astronomy • Astrophysics • Civil Protection • Comp. Chemistry • Earth Sciences • Finance • Fusion • Geophysics • High Energy Physics • Life Sciences • Multimedia • Material Sciences • … >250 sites 48 countries >50,000 CPUs >20 PetaBytes >10,000 users >200VOs >550,000 jobs/day Markus Schulz, CERN, IT Department
For more information: www.opensciencegrid.org www.eu-egee.org www.cern.ch/lcg www.gridcafe.org www.eu-egi.org/
Grid Activity CPU Hours 380 Million kSpecInt2000 hours in 2008 26% non LHC
Grid Activity Jobs 600K Jobs/day 150 Million Jobs in 2008 13% non LHC
CPU Contributions NDGF Tier 2s > 85% of CPU Usage is external to CERN Distribution between T1s and T2s is task depending
Data Transfers 2.8 Gbyte/sec CERN Sites Sites Sites
Site Site Data Transfers (CMS) Many, many sites
Grid Computing at CERN • Core grid infrastructure services (~300 nodes) • CA, VOMS servers, monitoring hosts, information system, testbeds • Grid Catalogues • Using ORACLE clusters as backend DB • 20+ instances • Workload management nodes • 20+ WMS (different flavours, not all fully loaded) • 15+CEs (for headroom) • Worker Nodes • LSF managed cluster • 16000 cores, currently adding 12000 cores (2GB/core) • We use node disks only as scratch space and for OS installation • Extensive use of fabric management • Quattor for install and config, Lemon+Leaf for fabric monitoring Markus Schulz, CERN, IT Department
Grid Computing at CERN • Storage (CASTOR-2) • Disk caches : 5 Pbyte (20k disks) mid 2008 additional 12k disks (16 PB) • Linux boxes with RAID disks • Tape storage: 18 PB (~30k cartidges) • We have to add 10 PB this year ( the robots can be extended) • 700GB/cartridge • Why tapes? • still 3 times lower system costs • long time stability is well understood • The gap is closing • Networking • T0 -> T1 dedicated 10Gbit links • CIXP Internet exchange point for links to T2 • Internal: 10Gbit infrastructure Markus Schulz, CERN, IT Department
www.glite.org Markus Schulz, CERN, IT Department
LCG-2 gLite 2004 prototyping prototyping product 2005 product 2006 gLite 3.0 gLite Middleware Distribution • Combines components from different providers • Condor and Globus (via VDT) • LCG • EDG/EGEE • Others • After prototyping phases in 2004 and 2005 convergence with LCG-2 distribution reached in May 2006 • gLite3.0 • gLite 3.1 ( 2007) • Focus on providing a deployable MW distribution for EGEE production service
gLite Services gLite offers a range of services
Middleware structure Applications • Applications have access both to Higher-level Grid Services and to Foundation Grid Middleware • Higher-Level Grid Services are supposed to help the users building their computing infrastructure but should not be mandatory • Foundation Grid Middleware will be deployed on the EGEE infrastructure • Must be complete and robust • Should allow interoperation with other major grid infrastructures • Should not assume the use of Higher-Level Grid Services Higher-Level Grid Services Workload Management Replica Management Visualization Workflow Grid Economies ... Foundation Grid Middleware Security model and infrastructure Computing (CE) and Storage Elements (SE) Accounting Information and Monitoring Overview paper http://doc.cern.ch//archive/electronic/egee/tr/egee-tr-2006-001.pdf Markus Schulz, CERN, IT Department
EGEE needs to interoperate with other infrastructures: To provide users with the ability to access resources available on collaborating infrastructures The best solution is to have common interfacesthrough the development and adoption of standards. The gLite reference forum for standardization activities is the Open Grid Forum Many contributions (e.g. OGSA-AUTH, BES, JSDL, new GLUE-WG, UR, RUS, SAGA, INFOD, NM, …) Problems: Infrastructures are already in production Standards are still in evolution and often underspecified OGF-GIN follows a pragmatic approach balance between application needs vs. technology push GIN Standards Markus Schulz, CERN, IT Department 35
gLite code base Markus Schulz, CERN, IT Department
gLite code details Markus Schulz, CERN, IT Department
gLite code details 10K 5K 2K 1K Markus Schulz, CERN, IT Department
gLite code details 2K The list is not complete. Some components are provided as binaries and are only packaged by the ETICS system Markus Schulz, CERN, IT Department
Complex Dependencies Markus Schulz, CERN, IT Department
Data Management Markus Schulz, CERN, IT Department
Component based software life cycle process Weekly Releases
The Process is monitored Markus Schulz, CERN, IT Department To spot problems and manage resources
Change Management Almost constant rate Markus Schulz, CERN, IT Department This is a challenge About 50 bugs/week 40 patches/ months
Some Middleware components www.glite.org Markus Schulz, CERN, IT Department
Authentication • gLite authentication is based on X.509 PKI • Certificate Authorities (CA) issue (long lived) certificates identifying individuals (much like a passport) • Commonly used in web browsers to authenticate to sites • Trust between CAs and sites is established (offline) • In order to reduce vulnerability, on the Grid user identification is done by using (short lived) proxies of their certificates • Support for Short-Lived Credential Services (SLCS) • issue short lived certificates or proxies to its local users • e.g. from Kerberos or from Shibboleth credentials (new in EGEE II) • Proxies can • Be delegated to a service such that it can act on the user’s behalf • Be stored in an external proxy store (MyProxy) • Be renewed (in case they are about to expire) • Include additional attributes
Authorization • VOMS is now a de-facto standard • Attribute Certificates provide users with additional capabilities defined by the VO. • Basis for the authorization process • Authorization: currently via mapping to a local user on the resource • glexec changes the local identity (based on suexec from Apache) • Designing an authorization service with a common interface agreed with multiple partners • Uniform implementation of authorization in gLite services • Easier interoperability with other infrastructures • Prototype being prepared now
Common AuthZ interface CREAM Pilot job on Worker Node (both EGEE and OSG) OSG EGEE pre-WS GT4 gk, gridftp, opensshd GT4 gatekeeper,gridftp, (opensshd) gt4-interface edg-gk dCache glexec edg-gridftp LCAS + LCMAPS Prima + gPlazma: SAML-XACML L&L plug-in: SAML-XACML Common SAML XACML library SAML-XACML Query Q: map.user.to.some.pool R: Oblg: user001, somegrp <other obligations> SAML-XACML interface Common SAML XACML library GPBox Site Central: LCAS + LCMAPS Site Central: GUMS (+ SAZ) LCMAPS plug-in L&L plug-ins
Information System • The information system is used for: • Service discovery ( what kind of services are around) • Service state monitoring ( up/down, resource utilization) • It is the nervous system of LCG • Availability and scalability are the key issues • gLite uses the GLUE schema (version 1.3) • abstract modeling for Grid resources and mapping to concrete schemas that can be used in Grid Information Services • The definition of this schema started in April 2002 as a collaboration effort between EU-DataTAG and US-iVDGLprojects • The GLUE Schema is now an official activity of OGF • Starting points are the Glue Schema 1.3 the Nordugrid Schema and CIM (used by NAREGI) • GLUE 2.0 has been standardized and will be introduced during the next year(s)
Information System Architecture FCR Top BDII Top BDII DNS Round Robin Alias DNS Round Robin Alias One Many (80) Query Query Site BDII Site BDII >260 Resource BDII Resource BDII Resource BDII Resource BDII >260 * 5 Provider Provider Provider Provider Markus Schulz, CERN, IT Department
Performance Improvements Log Scale! Markus Schulz, CERN, IT Department
EGEE/LCG Data Management VO Frameworks User Tools Data Management lcg_utils FTS Cataloging Storage Data transfer GFAL InformationSystem/Environment Variables Vendor Specific APIs (Classic SE) LFC gridftp (RLS) SRM RFIO