320 likes | 331 Views
IT-ES – PoW Input. Achievements 2010 Plans for 2011 Issues & Concerns: Vision for the future. IT Department Programme of Work Meeting, November 2010. Strategy. Mandate is to support (in particular) the LHC experiments in their use of the WLCG service
E N D
IT-ES – PoW Input Achievements 2010 Plans for 2011 Issues & Concerns: Vision for the future IT Department Programme of Work Meeting, November 2010
Strategy • Mandate is to support (in particular) the LHC experiments in their use of the WLCG service • The priority is to increase commonality – e.g. exporting tools / ideas from one experiment to others • Examples: • HammerCloud & other analysis tools, Site & Link Commissioning, Dataset popularity, Storage accounting, Active Error Handling for real time site monitoring, Persistency Framework, FroNTier/Squid, Monitoring / Dashboards, Security framework services (web redirector) • Common workplan with other activities supported in group, mirroring EGI-InSPIRE
Strategy – detail • Mandate is to support (in particular) the LHC experiments in their use of the WLCG service • The group is also involved in a number of projects covering HEP, LS, ES and others • EGI-InSPIRE, PARTNER / ULICE, EnviroGRIDS • UNOSAT: a new GRID run is being discussed • And looks forward to future (FP8) activities, such as Data Preservation • And we have a common strategy that unifies all of these areas • As “experiment-ware” has expanded, this role is arguably more critical than in the past! • Today we have two sections which reflect the structure of EGI-InSPIRE SA3 • Shared Tools and Services (TSA3.2) [ DnG ] • Services for HEP (TSA3.3) [ VOS – All other ] • EGI- InSPIRE requires us to have concrete plans and measurable deliverables quarter by quarter • And a major long-term goal in terms of sustainability
Achievements (1/2) • A significant achievement was the launch of EGI-InSPIRE SA3 “Heavy User Communities” • This is funding five people in Services for HEP and one person in each of the Dashboard and Ganga areas • We also consolidated the PARTNER fellows in B28 / IT and selected a ULICE fellow to be part of the Ganga support team • PARTNER encouraged to submit proposal (with greater IT content) eventually leading to a 3rd FP7/8 project • EnviroGRIDS – contributes to DNG goals; mid-term project report; documented creation of VO • Much better understanding of (high) costs – and potential benefits – of such projects
Achievements (2/2) • We have shown that we can participate in experiment-related activities & projects – such as experiment operations, ATLAS DDM, site commissioning – for a finite period of timeand then pull back [ already several examples – to be repeated… ] • Value to IT in these exercises is clear (policy, implementation, commonality) • A major push in terms of common solutions is paying off • Specific examples include the adoption of HammerCloud by ¾ LHC experiments • Tools to help analysis: accounting, popularity datasets, automatic site blacklisting, error logging etc being shared • Delivered technical achievements recognized by the experiments and which clearly add value to an IT @ CERN • Numerous reports, milestones and deliverables, papers at CHEP, WLCG Collaboration Workshop • Analysis of major service incident and GGUS tickets presented at CHEP – the first time such an analysis has been done: better tools for measuring the service & monitoring improvements quarter by quarter • Group remains king-pin of the regular WLCG operations and T1SCMs • Managed staff mobility seamlessly – including critical periods (HI run) • Detailed achievements by activity in later…
Plans for 2011 • Address priority issues from LHC experiments / WLCG • Further promote common solutions – technical working groups – to better align activities within group with needs • More homogeneous / better analysis support • There is an opportunity here for a common ATLAS/CMS approach! • Monitoring / Dashboard requirements • Changes in Data / Storage management & Detector Conditions • Specific requirements for experiment services • Sustainability – not only an EGI-InSPIRE deliverable but essential to prepare for 2013+ • Some specific “one-time” re-engineering will have delivered but IT does not stand still… (could Marie-Curie funding help?) • Address multi-community issues within context of EGI-InSPIRE and CERN’s PoW • “Physics for Health”, for example; UNOSAT, others? • Always with clear targets, timelines, metrics & review
Issues & Concerns – The Future? • LHC exploitation phase: • Production, no longer deployment… • Service, no longer gridification & development • High number of very good project personnel but short contracts • Asymmetry EGI-InSPIRE / non-InSPIRE may well cause frictions • Contract / career opportunities • Preparing for the future – temporary staff cut to bare minimum (1 TECH / section) • Need investment for the future • Some WLCG / IT “projects” foreseen – funding? • The activities supported by IT-ES add clear value to the department, the lab and also its plans for the future • All are joint collaborations with sites, experiments and/or EU funded • The corresponding M + P needs to be foreseen – it is not there today
What are we trying to do? • We are not only trying to help the experiments but also to guide / influence them • Examples include experiment operations where we have pushed a common model – plus the fact that it needs to be largely staffed by the experiments • Another is data management, where we have helped introduce methodology and discipline • We have 2+ years of manpower with which to ‘help’ further in this fashion • What we really need is the ability to plan in the longer term • And on timescales of 3 years (not less…) • Building on individual TECH / FELL rounds is not an efficient model for this type of work • This translates to core expertise (we need to have critical mass and to be trusted) plus a budget for “project-style” activities
IT-ES-VOS • Work (achievements, plans) divided in several areas: • Infrastructure, e.g. GGUS, Nagios, Security framework services, VOC working group, VO boxes; • Data management support: ATLAS DDM, CMS PhEDEx, LHCb DIRAC, AliEn • Analysis support (Ganga/CRAB) & HammerCloud; • Operations support – site / link commissioning, debugging of issues such as CERN-BNL BDII, shared software area issues, conditions data access • Persistency framework: POOL, COOL, CORAL • EGI-InSPIRE SA3: helps support most of above areas • Guided by experiments regarding their requirements and priorities: regular reviews & targets
IT-ES-VOS: Nagios • Achievements: • Migration: SAM to Nagios experiment-specific tests: development and deployment of new Nagios tests • Deployed first version of Nagios sensors for SRM • Plans: • Nagios tests in production to test Experiment services – Q1 2011
IT-ES-VOS: GGUS • Achievements: • GGUS became(?) only incident tracker for WLCG • Plans: • Ensure GGUS to Service Now transition does not perturb WLCG service delivery / incident follow-up • Defend WLCG needs in EGI and EMI user support activities
IT-ES-VOS: VOC Working Group • Achievements: • Developed and promoted common tools for automatic deployment, management, monitoring/recovering and operation of experiment specific highly critical services. Such tools have enormously increased the efficiency of operations and decreased unavailability due to previous human interventions. Security framework services (web redirector) have been made available and adopted by 3 LHC experiments and by the monitoring dashboards. • Plans • Continue in the area of security and automation for LHC experiment services in order to ensure long-term sustainability for services at CERN and outside.
IT-ES-VOS: Persistency Framework • Main achievements in 2010 • LHC experiment support and follow-up of service incidents • CORAL server successfully used by ATLAS HLT for LHC data taking • Analysis of Oracle server problems after the April security update • Analysis of gssapi problems in Globus and Oracle client libraries • Twelve releases in 2010 of full PF stack (CORAL, COOL, POOL) • Bug fixes and a few enhancements as requested by the experiments • New platforms (osx10.6, gcc4.5, icc, llvm), external upgrades (ROOT...) • Started to consolidate the POOL and CORAL test infrastructures • Installed Oracle 11.2 client patches used by the new releases • Complete fix for SELinux/SLC5, fix for ATLAS crash on AMD quadcore • Plans for 2011 • Continue support and software releases for the LHC experiments • Fix CORAL handling of database connection glitches • Complete the consolidation of the CORAL test infrastructure • Support and optimizations for the use of Frontier and CORAL server A. Valassi (IT-ES) – November 2010 POW 2010 – Persistency Framework
IT-ES-VOS: HammerCloud • Achievements: • HammerCloud improved the ATLAS distributed analysis reliability by adding a suite of frequent validation tests at all sites, and in parallel the service was generalized and successfully deployed for CMS and LHCb. • Plans: • We plan to continue our support for ATLAS and to finalize the integration of HC into the site commissioning and continuous validation procedures as they apply to CMS and LHCb. • Development-wise we will make improvements to the test result analysis tools which will make it easier to identify specific site problems.
IT-ES-VOS: ATLAS • Achievements: • our effort in development, integration, testing, operations and support of ATLAS distributed computing services with the WLCG infrastructure contributed significantly to the success of ATLAS physics analysis in the first year of data taking. • Plans: • we plan to focus our effort in the ATLAS distributed analysis activity through improvement of the Ganga framework, evaluation and development of new data management functionalities and operational support with a special focus on the CERN facility.
IT-ES-VOS: CMS • Achievements • Successfully integrated the HammerCloud testing system to the CMS computing tools • Successfully migrated the SAM tests to the Nagios framework • Improved monitoring of Tier-1 resource utilization • Improved coordination with Tier-1 sites • Expanded CMS Site Readiness to include more metrics • Supported the CMS effort in commissioning the full T2-T2 transfer matrix • Involved CMS Computing Run Coordinators in daily WLCG operations • First version of FTS transfer monitoring deployed and used to spot issues in CMS transfer operations • Plans • Consolidate monitoring information on site storage usage • Review and improve site monitoring • Investigate data access protocols (NFS 4.1 etc.)
IT-ES-VOS: CMS PhEDEx • PhEDEx achievements in 2010 • Deployed to Production improvements to site storage consistency tools, now routinely used in CMS operations • Developed new features for dataset subscriptions as requested by CMS computing management, currently undergoing final validation test before release to Production during the LHC winter shutdown • Added several new modules to the new beta PhEDEx web site, including views for CMS computing shifters, and the new subscription management panel • PhEDExPoW for 2011 • Improve transfer latency and routing (reducing PhEDEx overhead as much as possible), and provide realistic latency monitoring for users (block replica ETA) • Review the UI and complete the functionality of the new PhEDEx web site, gradually replacing the current web page, and adding new features requested by CMS users • Begin development of new authentication/authorization chains for data management request approval, to be deployed during next major shutdown (2012?)
IT-ES-VOS: CRAB • Plan: Deploy CRAB3 by the end 2011. Main key items will be part of CRAB3 are: • Improved integration with data management to manage user's produced data. • Improved client-server communication to be based on RESTfull. • Build a client flexible and generic and not application specific. • Implement Log-Archive feature. • Reviewed Job State Machine.
IT-ES-VOS: LHCb • Evaluation and eventual integration of an alternative to Oracle for detector conditions at Tier1 sites • Evaluation and eventual integration of a CernVMFS based solution for s/w distribution • Improvements in data management / transfer monitoring • DIRAC maintenance and development, documentation • Maintenance of Dashboard SiteView Gridmap application
IT-ES-VOS: ALICE • Plans: • Contribution to the new AliEn vervion 2.19 and following releases • Automatize and improve some daily operation things related to ALICE • Support ALICE operations • Reorganize vobox's templates to make them as general as possible in order to make it more maintainable and easier to new comers • Try to introduce more Lemon monitoring, as other experiments do, for the services in the vobox's and try to generalize it as much as possible to make it common enough to be able to share it with other experiments.
IT-ES-DNG: monitoring • Experiment Dashboard monitoring applications proved to be an essential component of the monitoring infrastructure of the LHC experiments during the first year of data taking. • Though the load on the applications was steadily increasing in terms of the data volume, update rate and number of users the applications scaled well. • Usage of the Dashboard applications was steadily growing during the year of 2010. • The CMS Dashboard for example serves up to 5K unique visitors per month and more than 100K pages are viewed daily. • The performance of several Dashboard applications was substantially increased. • New functionality was enabled for the existing applications and new applications were developed.
IT-ES-DNG: monitoring • Number of generic monitoring solutions provided by the Experiment Dashboard and shared by several LHC experiments had increased. • Site Status Board application is setup for all 4 experiments, while both ATLAS and CMS are using it for the computing shifts. • The generic solution for instrumentation of the Experiment workload management systems for reporting monitoring data via Messaging System for the Grid was successfully applied for ATLAS jobs. • New version of the generic Dashboard Job monitoring was deployed for ATLAS and CMS. Now ATLAS and CMS share a common job monitoring system. • Developed and deployed new application focused on the needs of the analysis support team which provides analysis statistics and generates weekly reports • Developed and deployed in production (currently for CMS) data access monitoring application, which allows to understand which (and how intensively) datasets are accessed by the physics community . The application is generic and can be shared with ATLAS. • Developed a prototype for the new monitoring display for data transfer systems which enables complete overview of data transfers from the source to destination with intuitive navigation to a detailed information about every particular transfer.
IT-ES-DNG: monitoring • The global monitoring system which integrates data coming from multiple VO-specific monitoring systems and provides high level monitoring view of the LHC activities on the Grid • WLCG Google Earth Dashboard is demonstrated at many computing sites and was used as a dissemination and publicity tool at multiple exhibitions, conferences and workshops.
IT-ES-DNG: dashboard • Dashboard modules had been migrated to SLC5. • Introduced new deployment model - python version agnostic • Dashboard repository migrated from cvs to svn • Dashboard cluster is being migrating to virtual machines • Dashboard Web applications started to use web redirector with CERN single sigh on (Shibboleth) • Enabled in the framework a possibility to connect from the same module to multiple database instances • Started to redesign user interfaces using modern web technologies like jQuery, Django, client-side plotting.
IT-ES-DNG: distributed analysis • During the first year of data taking Ganga job submission system was intensively used by the LHC users, namely in ATLAS and LHCb experiments. Average monthly numbers of GANGA users are 240 for ATLAS and 220 for LHCb. • Developed generic plug-ins for Ganga which enabled reporting of jobs status information for various submission backends to the Messaging System of the Grid. • Developed Ganga task monitoring application. • Multiple improvements were performed for ALICE experiment in the ALIEN workload management system. Among them support for glexec, SE discovery, integration with site analysis facilities (like CAF, LAF, SAF), https authentication using grid_site. • Generic error reporting system shared by ATLAS, LHCb and CMS was developed and is currently being validated by the user community and analysis support teams
IT-ES-DNG plans for monitoring • Concentrate effort on the generic applications which are shared by several LHC experiments, namely job monitoring, Site Usability and Site Status Board. • Develop and deploy new version of Site Usability interface compatible with the new version of SAM which is currently being redesigned. • Implement in Site Status Board common solution for providing information about scheduled and unscheduled downtimes. • Develop and deploy in production new version of ATLAS Data Management monitoring. Part of this work which relates to visualization can be reused in the wider scope for any generic application for data transfer monitoring. • Improve Dashboard user interfaces using modern technologies like jQuery and client-side plotting
IT-ES-DNG plans for analysis • Enable following new functionality in ALIEN • Automatic removal of unused physical files • Automatic collocation of files according to job requirements • Integration of PoD (Proof On Demand) • Improvement the scalability of the system • In the framework of the EGI Inspire work exploit new task monitoring applications for Ganga/DIANE for the communities outside the LHC scope
2010 PARTNER - GRID D. Abler V. Kanellopoulos F. Roman • Conceptual services and ethical & legal requirements for • Scientific use case: rare tumour database • Clinical use case: patient referral • First steps towards a collaborative portal using grids Prototype status: • Liferay, VINE, AMGA, cgMDR installed • VOMS integration • Basic interfaces/services to test functionality • Compilation of a dataset of relevant medical/biological information to record and evaluate Hadrontherapy. • Combined project to record and evaluate toxicities in Hadrontherapy and to test core functionalities needed for an information sharing platform Abler, D.; Kanellopoulos, V. & Roman, F. L.Future information sharing in Hadron TherapyPoster at 'Physics for Health in Europe' workshop, 2-4 Feb. 2010, CERN, Geneva FP7/2007-2013 PITN-GA-2008-215840-PARTNER
2010 PARTNER – Monte Carlo T. Böhlen • Benchmarking of nuclear models of FLUKA and GEANT4 for carbon iontherapy (published Phys. Med. Biol. 55 5833) • Development of a MC simulation code for the FIRST Experiment • Simulation studies for the design of the detectors • Simulation of detector responses • Simulating microdosimetric measurements for hadron therapy with FLUKA(contribution MC2010 conference + paper) FP7/2007-2013 PITN-GA-2008-215840-PARTNER
PARTNER plans for 2011 D. Abler V. Kanellopoulos F. Roman • Development of “Semantic Framework” within HISP and implementation of adverse-event (AE) reporting system combining aspects of clinical and scientific use case.Modeling and Service Logics for data recording and data typing using cancerGrid developments, and for metadata management and reasoning • Development of Evaluation Methods for AE Data. • Workflow Engine Integration & Security Framework • Evaluation of performance and scalability of the approach in HISP • Development of a Model to predict the Outcome of RadiotherapyComparison of Photon, Proton and Carbon therapy of Chordoma FP7/2007-2013 PITN-GA-2008-215840-PARTNER
PARTNER plans for 2011 T. Böhlen • Finalizing development of physics model for FLUKA which describes the acollinearity of two quantum annihilation for PET • Continued work for the FIRST Experiment • Coupling of Radiobiologic model with Monte Carlo simulations FP7/2007-2013 PITN-GA-2008-215840-PARTNER