390 likes | 499 Views
eScience – Grid Computing Graduate Lecture. 2 nd March 2011 Robin Middleton – PPD/RAL/STFC. I am indebted to the EGEE, EGI, LCG and GridPP projects and to colleagues therein for much of the material presented here. eScience Graduate Lecture.
E N D
eScience – Grid ComputingGraduate Lecture 2nd March 2011 Robin Middleton – PPD/RAL/STFC I am indebted to the EGEE, EGI, LCG and GridPP projects and to colleagues therein for much of the material presented here.
eScience Graduate Lecture A high-level look at some aspects of computing for particle physics today • What is eScience, what is the Grid ? • Essential grid components • Grids in HEP • The wider picture • Summary
What is eScience ? • …also : e-Infrastructure, cyberinfrastructure, e-Research, … • Includes • grid computing (e.g. WLCG, EGEE, EGI, OSG, TeraGrid, NGS…) • computationally and/or data intensive; highly distributed over wide area • digital curation • digital libraries • collaborative tools (e.g. Access Grid) • …other areas • Most UK Research Councils active in e-Science • BBSRC • NERC (e.g. climate studies, NERC DataGrid) • ESRC (e.g. NCeSS • AHRC (e.g. studies in collaborative performing arts) • EPSRC (e.g. eMinerals, MyGrid, …) • STFC (e.g. GridPP)
eScience – year ~2000 • Professor Sir John Taylor, former (1999-2003) Director General of the UK Research Councils, defined eScience thus: • science increasingly done through distributed global collaborations enabled by the internet, using very large data collections, terascale computing resources and high performance visualisation’. • Also quotes from Professor Taylor… • ‘e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it.’ • ‘e-Science will change the dynamic of the way science is undertaken.’
What is Grid Computing ? • Grid Computing • term invented in 1990s as metaphor for making computer power as easy to access as the electric power grid(Foster & Kesselman - "The Grid: Blueprint for a new computing infrastructure“) • combines computing resources from multiple administrative domains • CPU and storage…loosely coupled • serve the needs of one of more virtual organisations (e.g. LHC experiments) • different from Cloud Computing, Volunteer Computing (SETI@home)
Essential Grid Components • Middleware • Information System • Workload Management; Portals • Data Management • File transfer • File catalogue • Security • Virtual Organisations • Authentication • Authorisation • Accounting
Information System • At the heart of the Grid • Hierarchy of BDII (LDAP) servers • GLUE information schema • LDAP (Lightweight Directory Access Protocol) • tree structure • DN: Distinguished Name o = grid (root of the DIT) c= US c=UK c=Spain st = Chilton or = STFC ou = PPD ou = ESC
Workload Management • For example - composed of the following parts: • User Interface (UI) : access point for the user to the WMS • Resource Broker (RB) : the broker of GRID resources, responsible to find the “best” resources where to submit jobs • Job Submission Service (JSS) : provides a reliable submission system • Information Index (BDII) : a server (based on LDAP) which collects information about Grid resources – used by the Resource Broker to rank and select resources • Logging and Bookkeeping services (LB) : store Job Info available for users to query • However, you are much more likely to use a portal to submit work… • Executable = “gridTest”; • StdError = “stderr.log”; • StdOutput = “stdout.log”; • InputSandbox = {“/home/robin/test/gridTest”}; • OutputSandbox = {“stderr.log”, “stdout.log”}; • InputData = “lfn:testbed0-00019”; • DataAccessProtocol = “gridftp”; • Requirements = other.Architecture==“INTEL” && \ other.OpSys==“LINUX” && other.FreeCpus >=4; • Rank = “other.GlueHostBenchmarkSF00”; Example JDL
Job Definition & Management • Implemented in Python • Extensible – plug-ins • Used ATLAS, LHCb & non-HEP • http://ganga.web.cern.ch/ganga/index.php Portals - Ganga
Data Management • Storage Element (SE) • >1 implementation; all are accessed through SRM (Storage Resource Manager) • DPM – Disk Pool Manager (disk only) • secure: authentication via GSI, authorisation via VOMS • full POSIX ACL support with DN (userid) and VOMS groups • disk pool management (direct socket interface) • storage name space (aka. storage file catalog) • DPM can act as a site local replica catalog • SRMv1, SRMv2.1 and SRMv2.2 • gridFTP, rfio • dCache (disk & tape) – developed at DESY • ENSTORE – developed at Fermilab • CASTOR – devloped at CERN • Cern Advanced STORage manager • HSM – Hierarchical Storage Manager • disk cache & tape
File Transfer Service • File Transfer Service is a data movement fabric service • multi-VO service, balance usage of site resources according to VO and site policies • uses SRM and gridFTP services of an Storage Element (SE) • Why is it needed ? • For the user, the service it provides is the reliable point to point movement of Storage URLs (SURLs) among Storage Elements • For the site manager, it provides a reliable and manageable way of serving file movement requests from their VOs • For the VO manager, it provides ability to control requests coming from users(re-ordering, prioritization,...)
File Catalogue • LFC – LHC File Catalogue - a file location service • Glossary • LFN = Logical File Name; GUID = Global Unique ID; SURL = Storage URL • Provides a mapping from one or more LFN to the physical location of file • Authentication & authorisation is via a grid certificate • Provides very limited metadata – size, checksum • Experiments usually have a metadata catalogue layered above LFC • e.g. AMI – ATLAS Metadata Interface
Grid Security • Based around X.509 certificates – Public Key Infrastructure (PKI) • issued by Certificate Authorities • forms a hierarchy of trust • Glossary • CA – Certificate Authority • RA – Registration Authority • VA – Validation Authority • How it Works… • User applies for certificate with public key at a RA • RA confirms user's identity to CA which in turn issues the certificate • User can then digitally sign a contract using the new certificate • User identity is checked by the contracting party with VA • VA receives information about issued certificates by CA
Virtual Organisations • Group of individuals sharing use of (distributed) resources to a common end under an agreed set of policies • a semi-informal structure orthogonal to normal institutional allegiances • e.g. A HEP Experiment • Grid Policies • Acceptable use; Grid Security; New VO registration; • http://proj-lcg-security.web.cern.ch/proj-lcg-security/security_policy.html • VO specific environment • experiment libraries, databases,… • resource sites declare which VOsit will support
The Three As • Authentication • verifying that you are who you say you are • your Grid Certificate is your “passport” • Authorisation • knowing who you are, validating what you are permitted to do • e.g. submit analysis jobs as a member of LHCb • e.g. VO capability to manage production software • Accounting (auditing) • local logging what you have done – your jobs ! • aggregated into grid-wide respository • provides • usage statistics • information source in event of security incident
Grids in HEP • LCG; EGEE & EGI Projects • GridPP • The LHC Computing Grid • Tiers 0,1,2 • The LHC OPN • Experiment Computing Models • Typical data access patterns • Monitoring • Resource providers view • VO view • End-user view
LCGEGEE->EGI • LCG LHC Computing Grid • Distributed Production Environment for Physics Data Processing • World’s largest production computing grid • In 2011 : >250,000 CPU cores, 15PB/Yr, 8000 physicist, ~500 institutes • EGEE Enabling Grids for E-sciencE • Starts from LCG infrastructure • Production Grid in 27 countries • HEP, BioMed, CompChem, • Earth Science, … • EU Support
Integrated within the LCG/EGI framework UK Service Operations (LCG/EGI) Tier-1 & Tier-2s HEP Experiments @ LHC, FNAL, SLAC GANGA (LHCb & ATLAS) Working with NGS informing the UK NGI for EGI • Phase 1 : 2001-2004 • Prototype (Tier-1) • Phase 2 : 2004-2008 • “From Prototype to Production” • Production (Tier-1&2) • Phase 3 : 2008-2011 • “From Production to Exploitation” • Reconstruction, Monte Carlo, Analysis • Phase 4 : 2011-2014… • routine operation during LHC running GridPP Tier-1 Farm Usage
LCG – The LHC Computing Grid • Worldwide LHC Computing Grid - http://lcg.web.cern.ch/lcg/ • Framework to deliver distributed computing forLHC experiments • Middleware / Deployment • (Service/Data Challenges) • Security (operations & policy) • Applications (Experiment) Software • Distributed Analysis • Private Optical Network • Experiments Resources MoUs • Coverage • Europe EGI • USA OSG • Asia Naregi, Taipei,China… • Other…
Lab m Uni x regional group CERN Tier 1 Uni a UK USA Lab a France Tier 1 Tier3 (physics department) Uni n Tier2 ………. Italy Desktop Lab b Germany ………. Lab c Uni y Uni b physics group LHC Computing Model The LHC Computing Centre CERN Tier 0
LHCOPN – Optical Private Network • Principle means to distribute LHC data • Primarily linking Tier-0 and Tier-1s • Some Tier-1 to Tier-1 Traffic • Runs over leased lines • Some resilience • Mostly based on10 Gigabit technology • Reflects Tierarchitecture
LHC Experiment Computing Models • General (ignoring experiment specifics) • Tier-0 (@CERN) • 1st pass reconstruction (including initial calibration) • RAW data storage • Tier-1 • Re-processing; some centrally organised analysis; • Custodial copy of RAW data, some ESD, all AOD, some SIMU • Tier-2 • (chaotic) user analysis; simulation • some AOD (depends on local requirements) • Event sizes disk buffers at experiments & Tier-0 • Event formats (RAW, ESD, AOD, etc); placement (near analysis); replicas • Data streams – physics specific, debug, diagnostic, express, calibration • CPU & storage requirements • Simulation
Typical Data Access Patterns Typical LHC particle physics experiment One year of acquisition and analysis of data Access Rates (aggregate, average) 100 Mbytes/s (2-5 physicists) 500 Mbytes/s (5-10 physicists) 1000 Mbytes/s (~50 physicists) 2000 Mbytes/s (~150 physicists) Raw Data ~1000 Tbytes Reco-V1 ~1000 Tbytes Reco-V2 ~1000 Tbytes ESD-V1.1 ~100 Tbytes ESD-V1.2 ~100 Tbytes ESD-V2.1 ~100 Tbytes ESD-V2.2 ~100 Tbytes AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB AOD ~10 TB
Monitoring • A resource provider’s view
Monitoring • Virtual Organisation specifics
Monitoring • Virtual Organisation view • e.g. ATLAS dashboard
Monitoring • For the end user • available through dashboard
The wider Picture • What some other communities do with Grids • The ESFRI projects • Virtual Instruments • Digital Curation • Clouds • Volunteer Computing • Virtualisation
What are other communities doing with grids ? • Astronomy & Astrophysics • large-scale data acquisition, simulation, data storage/retrieval • Computational Chemistry • use of software packages (incl. commercial) on EGEE • Earth Sciences • Seismology, Atmospheric modeling, Meteorology, Flood forecasting, Pollution • Fusion (build up to ITER) • Ion Kinetic Transport, Massive Ray Tracing, Stellarator Optimization. • Computer Science • collect data on Grid behaviour (Grid Observatory) • High Energy Physics • four LHC experiments, BaBar, D0, CDF, Lattice QCD, Geant4, SixTrack, … • Life Sciences • Medical Imaging, Bioinformatics, Drug discovery • WISDOM – drug discovery for neglected / emergent diseases(malaria, H5N1, …)
ESFRI Projects(European Strategy Forum on Research Infrastructures) • Many are starting to look at their e-Science needs • some at a similar scale to the LHC (petascale) • project design study stage • http://cordis.europa.eu/esfri/ Cherenkov Telescope Array
Virtual Instruments • Integration of scientific instruments into the Grid • remote operation, monitoring, scheduling, sharing… • GridCC - Grid enabled Remote Instrumentation with Distributed Control and Computation CR: build workflows to monitor & control remote instruments in real-time CE, SE, ES , IS & SS: as in a “normal” grid Monitoring services Instrument Element (IE) - interfaces for remote control & monitoring • CMS run control includes an IE…but notreally exploited (yet) ! • DORII – Deployment Of Remote Instrumentation Infrastructure • Consolidation of GridCC with EGEE, g-Eclipse,Open MPI, VLab • The Liverpool Telescope - robotic • not just remote control, but fully autonomous • scheduler operates on basis of observingdatabase • (http://telescope.livjm.ac.uk/)
Digital Curation • Preservation of digital research data for future use • Issues • media; data formats; metadata; data management tools; reading (FORTRAN); ... • digital curation lifecycle - http://www.dcc.ac.uk/digital-curation/what-digital-curation • Digital Curation Centre - http://www.dcc.ac.uk/ • NOT a repository ! • strategic leadership • influence national (international) policy • expert advice for both users and funders • maintains suite of resources and tools • raise levels of awareness and expertise
JADE (1978-86) • New results from old data • new & improved theoretical calculations & MC models; optimised observables • better understanding of Standard Model (top, W, Z) • re-do measurements – better precision, better systematics • new measurements, but at (lower) energies not available today • new phenomena – check at lower energies • Challenges • rescue data from (very) old media; resurrect old software; data management; implement modern analysis techniques • but, luminosity files lost – recovered from ASCII printout in an office cleanup • Since 1996 • ~10 publications (as recent as 2009) • ~10 conference contributions • a few PhD Theses (ack S.Bethke)
What is HEP doing about it ? • ICFA Study Group on Data Preservation and Long Term Analysis in High Energy Physicshttps://www.dphep.org/ • 4 Workshops so far intermediate report to ICFA • 5th Workshop at Fermilab in May 2011 • Initial recommendations December 2009 • “Blueprint for Data Preservation in High Energy Physics” to follow
(Ack: Bob Jones – former EGEE Project Director) Grids, Clouds, Supercomputers, … • Grids • Collaborative environment • Distributed resources (political/sociological) • Commodity hardware (also supercomputers) • (HEP) data management • Complex interfaces (bug not feature) • Supercomputers • Expensive • Low latency interconnects • Applications peer reviewed • Parallel/coupled applications • Traditional interfaces (login) • Also SC grids (DEISA, Teragrid) • Clouds • Proprietary (implementation) • Economies of scale in management • Commodity hardware • Virtualisation for service provision and encapsulating application environment • Details of physical resources hidden • Simple interfaces (too simple?) • Volunteer computing • Simple mechanism to access millions CPUs • Difficult if (much) data involved • Control of environment check • Community building – people involved in Science • Potential for huge amounts of real work Bob Jones - October 2009 36
Clouds / Volunteer Computing • Clouds are largely commercial • Pay for use • Interfaces from grids exist • absorb peak demands(e.g. before a conference !) • CernVM images exist • Volunteer Computing • LHC@Home • SixTrack – study particle orbitstability in accelerators • Garfield – study behaviour of gas-based detectors
Virtualisation • Virtual implementation of a resource – e.g. a hardware platform • a current buzzword, but not new – IBM launched VM/370 in 1972 ! • Hardware virtualisation • one or more virtual machines running an operating system within a host system • e.g. run Linux (guest) on a virtual machine (VM) with Microsoft Windows (host) • independent of hardware platform; migration between (different) platforms • run multiple instances on one box; provides isolation (e.g. against rogue s/w) • Hardware-assisted virtualisation • not all machine instructions are “virtualisable” (e.g. some privileged instructions) • h/w-assist traps such instructions and provides hardware emulation of them • Implementations • Zen, VMware, VirtualBox, Microsoft Virtual PC, … • Interest to HEP ? • the above + opportunity to tailor to experiment needs (e.g. libraries, environment) • CernVM – CERN specific Linux environment - http://cernvm.cern.ch/portal/ • CernVM-FS – network filesystem to access experiment specific software • Security – certificate to assure origin/validity of VM
Summary • What is eScience about and what are Grids • Essential components of a Grid • middleware • virtual organisations • Grids in HEP • LHC Computing GRID • A look outside HEP • examples of what others are doing