1 / 56

The EGEE Production Grid

The EGEE Production Grid. Dr. Ian Bird EGEE Grid Operations & Management Leader IT Department, CERN. EGEE. Flagship grid infrastructure project co-funded by the European Commission Now in 2 nd phase with 91 partners in 32 countries. Objectives

garth-wall
Download Presentation

The EGEE Production Grid

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The EGEE Production Grid Dr. Ian Bird EGEE Grid Operations & Management Leader IT Department, CERN

  2. EGEE • Flagship grid infrastructure project co-funded by the European Commission • Now in 2nd phase with 91 partners in 32 countries Objectives Large-scale, production-quality grid infrastructure for e-Science Attracting new resources and users from industry as well asscience Maintain and further improvegLite Grid middleware Ian Bird - OGF/EGEE User Forum - May 9th 2007

  3. Outline EGEE infrastructure & services • How we got to this point • Overview of services • Status • Middleware • Training etc. Applications • Some key successes Interoperation/interoperability • … and related projects EGEE and standards … Open issues What next? Ian Bird - OGF/EGEE User Forum - May 9th 2007

  4. Evolution of production grid • Starts from LCG • - Shared production infrastructure • - Extended production service to other applications • - Growth from 40 to 190 sites Globus Condor Middleware & test-beds for an operational grid Continued expansion of resources and applications communities Deploying results of EDG to provide 1st production service for LHC 2006 2001 2002 2004 Ian Bird - OGF/EGEE User Forum - May 9th 2007

  5. Applications • Many applications from a growing number of domains • Astrophysics • Computational Chemistry • Earth Sciences • Financial Simulation • Fusion • Geophysics • High Energy Physics • Life Sciences • Multimedia • Material Sciences • … ~200 Virtual Organisations Applications list: https://edms.cern.ch/file/722132/3/EGEE-II-DNA4.2.1-722132-v2.5-1.pdf Ian Bird - OGF/EGEE User Forum - May 9th 2007

  6. Test-beds & Services Operations Coordination Centre Production Service Pre-production service Regional Operations Centres Certification test-beds Global Grid User Support EGEE Network Operations Centre Operational Security Coordination Team Security & Policy Groups Joint Security Policy Group EuGridPMA (& IGTF) Grid Security Vulnerability Group Operations Advisory Group The EGEE Infrastructure Support Structures & Processes Training activities Training infrastructure Ian Bird - OGF/EGEE User Forum - May 9th 2007

  7. Growth Ian Bird - OGF/EGEE User Forum - May 9th 2007

  8. CPU, countries, sites • 35000 CPU • 45 countries (31 partner countries) • 237 sites (131 partner sites) Ian Bird - OGF/EGEE User Forum - May 9th 2007

  9. Workload 98000 jobs/day 13000 jobs/day Ian Bird - OGF/EGEE User Forum - May 9th 2007

  10. CPU time delivered 14000 CPU-month/month 3600 CPU-month ~ 1/3 of total Ian Bird - OGF/EGEE User Forum - May 9th 2007

  11. Overall load • 19.6 million jobs run in 1st year of EGEE-II • 56000 per day sustained average • Peak of 98000 • Non-LHC 13500 /day • Level of total in EGEE in 2005 • 8400 CPU-years delivered in 1 year • ~1/3 of total available sustained over the year • Peak of 50% of available in Feb ’07 • ~1/3 of total was non-LHC in Dec ‘06 Ian Bird - OGF/EGEE User Forum - May 9th 2007

  12. Grid Middleware • Higher-Level Grid Services • Additional functionality • Foundation Grid Middleware • Robustness • Coexistence • Interoperability Applications Higher-Level Grid Services Workload Management Replica Management Visualization Workflow Grid Economies ... Foundation Grid Middleware Security model and infrastructure Computing (CE) and Storage Elements (SE) Accounting Information and Monitoring Ian Bird - OGF/EGEE User Forum - May 9th 2007

  13. gLite Grid Middleware Services Access CLI API Security Information & Monitoring Authorization Auditing Information &Monitoring Application Monitoring Authentication Data Management Workload Management MetadataCatalog File & ReplicaCatalog JobProvenance PackageManager Accounting StorageElement DataMovement ComputingElement WorkloadManagement Site Proxy Overview paper http://doc.cern.ch//archive/electronic/egee/tr/egee-tr-2006-001.pdf Ian Bird - OGF/EGEE User Forum - May 9th 2007

  14. Middleware and Certification • Test-beds • Virtual test-beds for individual testers ( ~5 ) • Dynamically allocated test nodes ( > 50 nodes) • Central certification test-bed • Distributed test-beds for specific functions • The goal is to produce a middleware distribution that can be deployed widely • Certification testing: • Installation and configuration • Component (service) functionality • System testing (trying to emulate real workloads and stress testing) Ian Bird - OGF/EGEE User Forum - May 9th 2007

  15. Pre-production service • Pre-production service is now ~ 27 sites in 16 countries • Provides access to some 3000 CPU • Some sites allow access to their full production batch systems for scale tests • Sites install and test different configurations and sets of services • Services may be initially demonstrated in this environment • Before further development • New VO-s: adapt their applications & gain experience • (e.g. DILIGENT) Ian Bird - OGF/EGEE User Forum - May 9th 2007

  16. Grid Management Structure • Regional operations Centres • Core support infrastructure • Grid User Support (GGUS) • Coordination, management of user support • EGEE Network Operations Centre (ENOC) • Coordination with NRENs & GEANT2 • Operations Coordination Centre • Management, oversight, coordination Ian Bird - OGF/EGEE User Forum - May 9th 2007

  17. Grid Operations Fully distributed – key are the Regional Operations Centres • Many of the ROCs are themselves distributed organizations • Grid Operator on Duty • Weekly rotation of teams • Critical activity in maintaining usability and stability of sites • Important tools • Site Availability Monitoring and Testing(SAM) • Information system monitoring • GGUS system for trouble ticket management • Portal for operations : https://cic.gridops.org Significant work on operations procedures • Evolved throughout EGEE and EGEE-II • Contribute to establishment of regional grid infrastructures through related projects – well beyond Europe now Ian Bird - OGF/EGEE User Forum - May 9th 2007

  18. User Support No. Tickets Processed Operations Network User All GGUS – now well established • Use continues to grow • Most ROCs provide dedicated effort to manage the process – similar to operator on duty teams • Setting up user support advisory groups to steer the priorities GGUS tool used for all support activities • Interlinks many local ticketing systems Ian Bird - OGF/EGEE User Forum - May 9th 2007

  19. Policy & Security APGridPMA EUGridPMA TAGPMA Asia-Pacific Grid PMA The Americas Grid PMA European Grid PMA Ian Bird - OGF/EGEE User Forum - May 9th 2007 Joint Security Policy Group (JSPG) • Produces and maintains security policy and procedures • for EGEE, OSG, NDGF, WLCG, and other EU Grid infrastructures • Achieved common policy between EGEE and OSG (for interoperation) • New Grid Site Operations Policy & Updated top-level Security Policy • Grid User AUP accepted by eIRG as good approach • Current work • New policy addressing User-level Accounting (data privacy issues) • New policy on VO and Grid service responsibilities Operational Security Coordination Team (OSCT) focuses on: • Incident Response & improvement • Security Monitoring • Best practice for system managers • Pan-regional security coordination Grid Security Vulnerability Group • New group analyzing potential vulnerabilities 19

  20. Grid Monitoring System Management Fabric management Best Practices Security ……. Grid Services Grid sensors Transport Repositories Views ……. System Analysis Application monitoring …… • “… To help improve the reliability of the grid infrastructure …” • “ … provide stakeholders with views of the infrastructure allowing them to understand the current and historical status of the service …” • “ … to gain understanding of application failures in the grid environment and to provide an application view of the state of the infrastructure …” • “ … improving system management practices, • Provide site manager input to requirements on grid monitoring and management tools • Propose existing tools to the grid monitoring working group • Produce a Grid Site Fabric Management cook-book • Identify training needs Becoming a critical activity to achieve reliability and stability Ian Bird - OGF/EGEE User Forum - May 9th 2007

  21. Monitoring Important to have standard solutions for: • Sensors • Repository schema • Interfaces Ian Bird - OGF/EGEE User Forum - May 9th 2007

  22. Experiment Dashboard INPUT Multiple sources of information • Increasing the reliability • Providing both global and very detailed view Information sources OUTPUT Providing output in various formats (Web pages, xml, csv, image formats) Monitoring systems (RGMA, GridIce, SAM, ICRTMDB, MonaAlisa, BDII, GridView…) Collect data of VO interest coming from various sources Store it in a single location Provide UI following VO requirements Analyze collected statistics Define alarm conditions Generic Grid Services Experiment specific services Can be used by various clients both users and applications Can satisfy users with various roles: • Generic user running his jobs on the Grid • Site administrator • VO manager, production or analysis group coordinator, data transfer coordinator… VO users with various roles This will be shown in the demo session Experiment work load management and data management systems • Potentially other • Clients: • PANDA, ATLAS production • <XML,CSV, image formats> Jobs instrumented to report monitoring information Ian Bird - OGF/EGEE User Forum - May 9th 2007

  23. Training • Broad range of courses to many disciplines and clients with very different backgrounds • Close relationships with applications and infrastructure activities for provision of material and lecturers • Needs are expanding rapidly with new communities and ‘beginner’ users • 110 events; 1600 participants Ian Bird - OGF/EGEE User Forum - May 9th 2007

  24. Infrastructure for training GILDA is an effective t-Infrastructure for EGEE and other European projects, providing resources and knowledge for training events Besides training events, GILDA is available around the clock for grid novices, with dedicated facilities The GILDA t-Infrastructure is currently supported by 12 sites, managed on a best-effort basis GILDA is also available for application porting Ian Bird - OGF/EGEE User Forum - May 9th 2007

  25. Interoperability/interoperation Well established with Open Science Grid in U.S. • In production use by CMS – submits work to OSG from EGEE • Weekly operations meetings attended by OSG staff • Processes set up with OSG for operations and user support workflows • OPS VO defined to support joint operations – for testing/monitoring use • Collaboration on monitoring tools and procedures EGEE also working with other grid projects on specific interoperability at the level of middleware: • NAREGI, Unicore, NDGF(ARC) Effort in GIN in several areas key for EGEE Important to have a user community/use case driving this Ian Bird - OGF/EGEE User Forum - May 9th 2007

  26. GIN Worldwide Grid Infrastructures • APAC • DEISA • EGEE • Naregi • NDGF • NGS • OSG • Pragma • Teragrid Ian Bird - OGF/EGEE User Forum - May 9th 2007

  27. Collaborating e-Infrastructures TWGRID Potential for linking ~80 countries Ian Bird - OGF/EGEE User Forum - May 9th 2007

  28. Registered Collaborating Projects Infrastructures geographical or thematic coverage Support Actions key complementary functions Applications improved services for academia, industry and the public 24 projects have registered as on February 2007:web page Ian Bird - OGF/EGEE User Forum - May 9th 2007

  29. Applications on EGEE This is an exciting year for science – LHC, the largest scientific instrument ever built, comes on-line - Grids are key to the success of LHC analysis Multitude of applications from a growingnumber of domains • Astrophysics • Computational Chemistry • Earth Sciences • Financial Simulation • Fusion • Geophysics • High Energy Physics • Life Sciences • Multimedia • Material Sciences • ….. Ian Bird - OGF/EGEE User Forum - May 9th 2007

  30. Virtual Organizations Total VOs: 204Registered VOs: 116Median sites per VO: 3 Total Users: 5034Affected People: 10200Median members per VO: 18 Ian Bird - OGF/EGEE User Forum - May 9th 2007

  31. Active VOs • Number of “active” VOs growing with time. • Turnover not shown: not same VOs every week! Ian Bird - OGF/EGEE User Forum - May 9th 2007

  32. Reported Applications Condensed Matter Physics Comp. Fluid Dynamics Computer Science/Tools Civil Protection Disciplines: 10 Sub-disciplines: 36 See growth and diversificationof applications. Reported apps. only Ian Bird - OGF/EGEE User Forum - May 9th 2007

  33. High Energy Physics Ian Bird - OGF/EGEE User Forum - May 9th 2007

  34. User Analysis with Ganga‏ • Used ATLAS and LHCb experiments, • developed with the contribution of EGEE NA4 • ~ 550 different users, ~100 users weeklyUsage monitoring started end 2006 • ~60% Atlas • ~25% LHCb • ~15% others • Easter Ian Bird - OGF/EGEE User Forum - May 9th 2007

  35. CMS analysis • CRAB Jobs @ FNAL (OSG)‏ • Users on the grid: • - April 2007 statistics - • CMS users submittingjobs to Grids via CRAB • (developed by CMS)‏ • Over 1,000 job/dayEfficiency over 90% • CRAB Jobs @ CERN (EGEE)‏ IT/PSS Group Meeting

  36. ALICE Grid Access Service • ALICE Grid Access (commands executed)‏ • ALICE Grid Access (commands executed)‏ • Slope changes because of • optimised access (less command executed • to interact with data management)‏ Ian Bird - OGF/EGEE User Forum - May 9th 2007

  37. High Energy Physics Data management: • Demonstrated data transfers at nominal rates:1.6 GB/s through FTS • 1 GB/s with real (simulated) workloads • 2 large experiments transferred >1 PB/month in summer 2006 Workload management • CMS – computing service challenge achieved 50k jobs/day • CMS aim this year for 100k jobs/day; ATLAS for 60k Reliability and availability • Significant effort to ensure Tier 1 sites meet MoU commitments – using site and service monitoring Grid is now the primary source of computing resources for LCG Ian Bird - OGF/EGEE User Forum - May 9th 2007

  38. Biomedical applications on different layers Applications 12 applications ported on the EGEE grid in areas of Medical Data management, Imaging, Bioinformatics and Drug Discovery High-level interfaces Generic portals Application specific interface Applications level Specific biomedical services Medical Data Management Data-intensive workflow management Middleware Infrastructure level Resources Communication layer Ian Bird - OGF/EGEE User Forum - May 9th 2007

  39. WISDOM WISDOM (http://wisdom.healthgrid.org/) • Developing new drugs for neglected and emerging diseases with a particular focus on malaria. • Reduced R&D costs for neglected diseases • Accelerated R&D for emerging diseases Three large calculations: • WISDOM-I (Summer 2005) • Avian Flu (Spring 2006) • WISDOM-II (Autumn 2006) WISDOM calculations used FlexX from BioSolveIT in addition to Autodock. Ian Bird - OGF/EGEE User Forum - May 9th 2007

  40. Docking Results Ian Bird - OGF/EGEE User Forum - May 9th 2007

  41. Confirming in vitro the results obtained in silico LPC Clermont-Ferrand: Biomedical grid SCAI Fraunhofer: Knowledge extraction, Chemoinformatics CEA, Acamba project: Biological targets, Chemogenomics Chonnam nat. univ.: In vitro testing Univ. Modena: Biological targets, Molecular Dynamics New HealthGrid: Biomedical grid, Dissemination ITB CNR: Bioinformatics, Molecular modelling Academica Sinica: Grid user interface Biological targets In vitro testing Univ. Los Andes: Biological targets, Malaria biology Univ. Pretoria: Bioinformatics, Malaria biology Avian flu data challenge: in the selection of 2250 compounds out of initial 308585 compounds, an enrichment factor of 111 was observed. Experimental trial confirms 7 actives out of 123 tested gave “potential hits”. Data challenges on malaria: the 25 most promising compounds out of 500.000 are now being tested in vitro at Chonnam National University I Ian Bird - OGF/EGEE User Forum - May 9th 2007

  42. Typical workflow Distributed Climate Data Model Data Scenario data Observation Data 1 Find & Select Data description 2 Collect & Prepare Analysis Dataset 3 Analyse Result Dataset 4 Visualize Earthsystem Sciences Goal: learn about the past, the present, and possible futures of the earth system Community: internationally and interdisciplinary distributed but strongly interconnected Method: Analysing, comparing and processing data Input: data from observations and/or other modelling studies Ian Bird - OGF/EGEE User Forum - May 9th 2007

  43. An example workflow: “qflux” Location Various data centers & portals Institutional storage & computing facilities local facilities Personal Computer 1 Find & Select relevant & available datasets Temperature Specific humidity Distributed Climate Data Wind speed 2 Collect & Prepare a temporal and spatial subset of the data Analysis Dataset 3 Analyse the integrated, transport of humidity between selected levels Result Dataset Visualize selected result 4 Ian Bird - OGF/EGEE User Forum - May 9th 2007 Datavolume Several PB ~3,1TB (300-500 files) ~10,3GB (28 files) ~76 MB ~6MB ~66KB

  44. Potential use of grid technology Current issues • Central unique authentication to a commoncatalogue with standardizedmetadata • Shared resources with standardized access hiding proprietary access mechanisms • Commonly defined tool description • Log processing steps and automatically republish processed data • Integrate basic visualization (first peep) into the workflow Search & select • Different portalswith differentauthenticationsand datadescriptions Collect & prepare • Different access mechanisms of thedifferentproviders • Pre-processing requires sufficient local facilities Analyse • Existing tools and already processed data are available locally and miss proper description Visualize • Detached from the remaining workflow Ian Bird - OGF/EGEE User Forum - May 9th 2007

  45. Presentations in User Forum on applications in EGEE and Related Projects • Specific applications • Atmosphere and Ocean Models • Earthquake modelling • Fusion • Range of biomedical applications • Computational Chemistry • Astrophysics • Space applications • HEP (LHC and non-LHC) • Applications in Related Projects • EUMEDgrid • BalticGrid • EELA • EUChinaGrid • EUIndiaGrid • G-Eclipse • SymGrid • DILIGENT • BeInGrid Ian Bird - OGF/EGEE User Forum - May 9th 2007

  46. Sustainability: Beyond EGEE-II Need to prepare permanent, common Grid infrastructure Ensure the long-term sustainability of the European e-infrastructure independent of short project funding cycles Coordinate the integration and interaction between National Grid Infrastructures (NGIs) Operate the European level of the production Grid infrastructure for a wide range of scientific disciplines to link NGIs Ian Bird - OGF/EGEE User Forum - May 9th 2007

  47. EGEE and standards See also: http://egee-na5.web.cern.ch/egee-na5/NA5Standardisation.html EGEE and other grid infrastructures need to co-exist and interoperate • At many levels – campus, local, national, regional, international A large production system has inertia – cannot change quickly • Introducing new software and standards is slow, need to maintain backward compatibility • Cannot frequently change the infrastructure gLite choice of standard adoption is based on interoperability needs and impact assessment on the infrastructure Operational experience essential • Leads to best practices which in turn should drive standardization efforts • Actively pushing convergence for most pressing needs The EGI/NGI era will rely on interoperability and coexistence • Appropriate and workable standards will be essential • Care not to fix standards too soon – this is not mature technology Ian Bird - OGF/EGEE User Forum - May 9th 2007

  48. Examples EGEE has worked on real community implementations of standards Example 1: SRM (Storage Resource Manager) • SRM v2.2 defined > 1 year ago to satisfy LCG requirements • Dedicated effort to reach today with beta versions of real interoperating implementations (5) – and this was vital for LCG • Needed many iterations on details of the specifications • Interoperation test suites and real use case testing was essential • Also required changes to all clients – the APIs were completely changed from SRM v1.1 Example 2: GLUE (information system schema) • Today this is the accumulated knowledge of experience in real large scale production of EGEE, OSG, ARC over 5 years • The information systems are not perfect – we see scalability problems • The experience is in the schema • It can and should evolve to something better – but it must evolve • Is an OGF working group Ian Bird - OGF/EGEE User Forum - May 9th 2007

  49. Areas of standardization Driven by the need for interoperation, co-existence, etc. EGEE is actively involved in many areas, including with OGF Security (AAA) • Policy work & IETF wg on Incident Response • VOMS and proxy certificates • Interoperability with Shibboleth Data Management • SRM, FTS Accounting & monitoring • Common usage record, schema, sensors Job Management • Gatekeeper interfaces Information system • Common schema Important for coexistence/interoperability: • areas close to fabric (accounting, monitoring, sensors, etc.) need to be common Ian Bird - OGF/EGEE User Forum - May 9th 2007

  50. Open Issues General issues: Making grid tools easily usable by non-experts Failures not easy to understand • Lack of consistent or thorough error reporting Lack of consistent administrative interfaces makes them hard to manage EGEE issues: Portability of current gLite distribution prevents wider acceptance and coexistence Ian Bird - OGF/EGEE User Forum - May 9th 2007

More Related