400 likes | 549 Views
EGI-InSPIRE SA3 “Heavy User Communities”. Past, Present & Future Jamie.Shiers@cern.ch. EGI InSPIRE SA3: Status & Plans. Jamie.Shiers@cern.ch WLCG Grid Deployment Board June 2010. The EGI- InSPIRE Project.
E N D
EGI-InSPIRE SA3“Heavy User Communities” Past, Present & Future Jamie.Shiers@cern.ch
EGI InSPIRE SA3: Status & Plans Jamie.Shiers@cern.ch WLCG Grid Deployment Board June 2010
The EGI-InSPIRE Project Integrated Sustainable Pan-European Infrastructure for Researchers in Europe • A proposal for an FP7 project • Work in progress..., i.e. this may all change! • Targeting call objectives: • 1.2.1.1: European Grid Initiative • 1.2.1.2: Service deployment for Heavy Users • Targeting a 3 year project (this did change!) • Seeking a total 25M€ EC contribution Slides from S. Newhouse
II. Resource infrastructure PY1-PY2 Trend PY1 PY2 CPU norm. wall clock hours SA1 and JRA1 - June 2012
II. Resource infrastructure CPU Usage SA1 and JRA1 - June 2012
Communities & Activities These and other communities supported by shared tools & services High Energy Physics TSA3.3 The LHC experiments use grid computing for data distribution, processing and analysis. Strong focus on common tools and solutions. Areas supported include: Data Management, Data Analysisand Monitoring. Main VOs: ALICE, ATLAS, CMS, LHCb but covers many other HEP experiments + related projects. Focuses on medical, biomedical and bioinformatics sectors to connect worldwide laboratories, share resources and ease access to data in a secure and confidential way. Supports 5VOs (biomed, lsgri, vlemed, pneumogrid + medigrid) across 6 NGIs via the Life Science Grid Community Life Sciences TSA3.4 Life Sciences Covers the European Extremely Large Telescope (E-ELT), the Square Kilometre Array (SKA) and Cerenkov Telescope Array (CTA) and others. Activities focus on visualisation tools and database/catalog access from the grid. Main VOs: Argo, Auger, Glast, Magic, Planck, CTA, plus others (total 23) across 7 NGIs. Astronomy & Astrophysics TSA3.5 Large variety of ES disciplines. Provides also access from the grid to resources within the Ground European Network for Earth Science Interoperations - Digital Earth Community (GENESI-DEC); assists scientists working on climate change via the Climate-G testbed. Main VOs: esr, egeode, climate-g, env.see-grid-sci.eu, meteo.see-grid-sci.eu, seismo.see-grid-sci.eu- support by ~20 NGIs Earth Sciences TSA3.6 EGI-InSPIRE Review 2012
Communities & Activities These and other communities supported by shared tools & services High Energy Physics TSA3.3 The LHC experiments use grid computing for data distribution, processing and analysis. Strong focus on common tools and solutions. Areas supported include: Data Management, Data Analysisand Monitoring. Main VOs: ALICE, ATLAS, CMS, LHCb but covers many other HEP experiments + related projects. Focuses on medical, biomedical and bioinformatics sectors to connect worldwide laboratories, share resources and ease access to data in a secure and confidential way. Supports 5VOs (biomed, lsgri, vlemed, pneumogrid + medigrid) across 6 NGIs via the Life Science Grid Community Life Sciences TSA3.4 Life Sciences Covers the European Extremely Large Telescope (E-ELT), the Square Kilometre Array (SKA) and Cerenkov Telescope Array (CTA) and others. Activities focus on visualisation tools and database/catalog access from the grid. Main VOs: Argo, Auger, Glast, Magic, Planck, CTA, plus others (total 23) across 7 NGIs. Astronomy & Astrophysics TSA3.5 Large variety of ES disciplines. Provides also access from the grid to resources within the Ground European Network for Earth Science Interoperations - Digital Earth Community (GENESI-DEC); assists scientists working on climate change via the Climate-G testbed. Main VOs: esr, egeode, climate-g, env.see-grid-sci.eu, meteo.see-grid-sci.eu, seismo.see-grid-sci.eu- support by ~20 NGIs Earth Sciences TSA3.6 EGI-InSPIRE Review 2012
SA3 Overview CERN France Slovenia Slovakia Italy Spain Finland Poland EMBL Ireland Germany 9 Countries 11 Beneficiaries 725 PMs 20.1 FTEs SA3 Effort EGI-InSPIRE Review 2012
SA3 Objectives • Transition to sustainable support: • Identify tools of benefit to multiple communities • Migrate these as part of the core infrastructure • Establish support models for those relevant to individual communities EGI-InSPIRE Review 2012
Achievements in Context • As an explicit example, we use the case of HEP / support for WLCG • The 3 phases of EGEE (I/II/III) overlapped almost exactly with finalpreparations for LHC data taking: • WLCG Service Challenges 1-4, CCRC’08, STEP’09 • EGI-InSPIRE SA3 covered virtually all the initial data taking run (3.5TeV/beam) of the LHC: first data taking and discoveries! • The transition from EGEE to EGI was non-disruptive • Continuous service improvement has been demonstrated • Problems encountered during initial data taking were rapidly solved • Significant progress in the identification and delivery of common solutions • Active participation in the definition of the future evolution of WLCG EGI-InSPIRE Review 2012
WLCG Service Incidents These are significant service incidentswrt targets defined in the WLCG MoU. Basically mean major disruption to datataking, distribution, processing or analysis. A Service Incident Report is required. Scale Test EGI-InSPIRE Review 2012
WLCG Service Incidents Start of Data Taking Scale Test EGI-InSPIRE Review 2012
Resolution of Incidents Incidents Data taking EGI-InSPIRE Review 2012
Services for HEP Focus on Common Solutions Across (all) VOs Common Solutions EGI-InSPIRE Review 2012
The Common Solutions Strategy of the Experiment Support group at CERN for the LHC Experiments Maria Girone, CERN On behalf of the CERN IT-ES Group CHEP, New York City, May 2012
Motivation • Despite their differences as experiments at the LHC, from a computing perspective a lot of the workflows are similar and can be done with common services • While the collaborations are huge and highly distributed, effort available in ICT development is limited and decreasing • Effort is focused on analysis and physics • Common solutions are a more efficient use of effort and more sustainable in the long run Maria Girone, CERN
Anatomy of Common Solution Experiment Specific Elements Higher Level Services that translate between Common Infrastructure Components and Interfaces • Most common solutions can be diagrammed as the interface layer between common infrastructure elements and the truly experiment specific components • One of the successes of the grid deployment has been the use of common grid interfaces and local site service interfaces • The experiments have a environments and techniques that are unique • In common solutions we target the box in between. A lot of effort is spent in these layers and there are big savings of effort in commonality • not necessarily implementation, but approach & architecture • LHC schedule presents a good opportunity for technology changes Maria Girone, CERN
The Group EGI-InSPIRE INFSO-RI-261323 • IT-ES is a unique resource in WLCG • The group is currently supported with substantial EGI-InSPIRE project effort • Careful balance of effort embedded in the experiments & on common solutions • Development of expertise in experiment systems & across experiment boundaries • People uniquely qualified to identify and implement common solutions • Matches well with the EGI-InSPIRE mandate of developing sustainable solutions • A strong and enthusiastic team Maria Girone, CERN
Activities • Monitoring and Experiment Dashboards • Allows experiments and sites to monitor and track their production and analysis activities across the grid • Including services for data popularity, data cleaning and data integrity and site test stressing • Distributed Production and Analysis • Design and development for experiment workload management and analysis components • Data Management support • Covers development and integration of the experiment specific and shared grid middleware • The LCG Persistency Framework • Handles the event and detector conditions data from the experiments Maria Girone, CERN
Achievements in Context • SA3 has fostered and developed cross-VO and cross-community solutions beyond that previously achieved • Benefits of multi-community WP • The production use of grid at the petascale and “Terra”scale has been fully and smoothly achieved • Benefits of many years of grid funding EGI-InSPIRE Review 2012
Reviewers’ Comments • “In view of the recent news from CERN, it can easily be seen that the objectives of WP6 (=SA3) for the current period have not only been achieved but exceeded. Technically, the work carried out in WP6 is well managed and is of a consistently high quality, meeting the goals, milestones and objectives described in the DoW.” [ etc. ] EGI-InSPIRE Review 2012
LHC Timeline EGI-InSPIRE Review 2012
Sustainability Statements EGI-InSPIRE D6.8 Draft
FP8 / Horizon 2020 • Expect first calls in 2013 – funding from late 2013 / early 2014 • IMHO, calls relating to data management and/or data preservation plus specific disciplines (e.g. LS) are likely • Will we be part of these projects? • Actively pursuing leads now with this objective • Will not solve the problem directly related to experiment support, nor address “the gap” • EU projects need not have a high overhead!
Summary • EGI-InSPIRE SA3 has provided support for many disciplines – the key grid communities at the end of EGEE III • It has played a key role in the overall support provided to experiments by IT-ES • All of the main grid communities will be affected by the end of the work package • The “sustainability plans” are documented in D6.8, due January 2013 • Expect no miracles
Examples: Data Popularity Experiment Booking Systems Mapping Files to Datasets Files accessed, users and CPU used File Opens and Reads • Experiments want to know which datasets are used, how much, and by whom • Good chance of a common solution • Data popularity uses the fact that all experiments open files and access storage • The monitoring information can be accessed in a common way using generic and common plug-ins • The experiments have systems that identify how those files are mapped onto logical objects like datasets, reprocessing and simulation campaigns Maria Girone, CERN
Popularity Service See D. Giordano et al., [176] Implementing data placement strategies for the CMS experiment based on a popularity model Used by the experiments to assess the importance of computing processing work, and to decide when the number of replicas of a sample needs to be adjusted either up or down Maria Girone, CERN
Cleaning Service • The Site Cleaning Agent is used to suggest obsolete or unused data that can be safely deleted without affecting analysis. • The information about space usage is taken from the experiment dedicated data management and transfer system Maria Girone, CERN
Dashboard Framework and Applications Sites and activities Framework & visualization Job submission &data transfers D. Tuckett et al., [300], Designing and developing portable large-scale JavaScript web applications within the Experiment Dashboard framework • Dashboard is one of the original common services • All experiments execute jobs and transfer data • Dashboard services rely on experiment specific information for site names, activity mapping, error codes • The job monitoring system collects centrally information from workflows about the job status and success • Database, framework and visualization are common Maria Girone, CERN
Site Status Board • Another example of a good common service • Takes specific lower level checks on the health of common services • Combines with some experiment specific workflow probes • Includes links into the ticketing system • Combines to a common view Maria Girone, CERN
HammerCloud Distributed analysis Frameworks Testing and Monitoring Framework Computing & Storage Elements HammerCloud is a common testing framework for ATLAS (PanDA), CMS (CRAB) and LHCb(Dirac) Common layer for functional testing of CEs and SEs from a user perspective Continuous testing and monitoring of site status and readiness. Automatic Site exclusion based on defined policies Same development, same interface, same infrastructure less workforce , Maria Girone, CERN
HammerCloud D. van der Ster et al. [283], Experience in Grid Site Testing for ATLAS, CMS and LHCb with HammerCloud
New Activities – Analysis Workflow Data discovery, environment configuration, and job splitting Job Tracking, Resubmission, and scheduling Job submission and Pilots • Up to now services have generally focused on monitoring activities • All of these are important and commonality saves effort • Not normally in the core workflows of the experiment • Success with the self contained services has provided confidence moving into a core functionality • Looking at the Analysis Workflow • Feasibility Study for a Common Analysis Framework between ATLAS and CMS Maria Girone, CERN
Analysis Workflow Progress Data discovery, job splitting and packaging of user environment Job Tracking, Resubmission, and scheduling Job submission and Pilots • Looking at ways to make the workflow engine common between the two experiments • Improving the sustainability of the central components that interface to low-level services • A thick layer that handles prioritization, job tracking and resubmission • Maintaining experiment specific interfaces • Job splitting, environment, and data discovery would continue to be experiment specific Maria Girone, CERN
Proof of Concept Diagram Maria Girone, CERN Feasibility Study proved that there are no show-stoppers to design a common analysis framework Next step is a proof of concept
Even Further Ahead Datasets to file mapping File locations and files in transfer File Transfer Service (FTS) • As we move forward, we would also like to assess and document the process • This should not be the only common project • The diagram for data management would look similar • A thick layer between the experiment logical definitions of datasets and the service that moves files • Deals with persistent location information and tracks files in progress and validates file consistency • Currently no plans for common services, but has the right properties Maria Girone, CERN
Outlook • IT-ES has a good record of identifying and developing common solutions between the LHC experiments • Setup and expertise of the group have helped • Several services focused primarily on monitoring have been developed and are in production use • As a result, more ambitious services that would be closer to the experiment core workflows are under investigation • The first is a feasibility study and proof of concept of a common analysis framework between ATLAS and CMS • Both better and more sustainable solutions could result – with lower operational and maintenance costs Maria Girone, CERN