160 likes | 260 Views
Consolidation of Grid operations. Costin Grigoras ALICE Offline. Preamble. In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and MC production and also (quite successfully) for end user analysis
E N D
Consolidation of Grid operations CostinGrigoras ALICE Offline
Preamble • In the period of steady LHC operation, The Grid usage is constant and high and, as foreseen, is used for massive RAW and MC production and also (quite successfully) for end user analysis • To help the Grid users and administrators, many applications have been developed in the early years of the Grid. ALICE has made an effort to consolidate all of these in a coherent set of monitoring and control tools • The following presentation is a quick overview of some of them Consolidation of Grid operations
Central production management - LPM • Speed is of the essence – the RAW reconstruction follows promptly the data taking, allowing for immediate QA and physics analysis • LPM (Lightweight Production Manager) • Several triggers to assure RAW and conditions data integrity • Fully automatic • Does also replication of RAW to T1 • Manages not only Pass1, but all central RAW and MC productions and the organized analysis trains • Up to now, 360 production cycles have been handled by LPM Consolidation of Grid operations
Dependent tasks - LPM chains • Data processing jobs which must be launched only when a previous process has successfully completed • For example, the QA tasks are ‘cascaded’ after Pass1 RAW reco. is completed • Same for AOD production, data merging • The depth of cascading is unlimited • Speeds up considerably the data production! Consolidation of Grid operations
LPM chains logic Reco. 1job/chunk QA 1job/chunk QA merging Delete partial output Merge ROOT tags When complete, start in parallel Resubmit error jobs AOD 1job/chunk AOD Merging Delete partial output Same mechanism is used also for MonteCarlo productions and analysis trains on MC and RAW data
LPM chains logic – example • Parallel productions are possible • With different weights / priorities • Branches can be temporarily disabled • Tasks can be simple JDLs or more complex code that prepares the execution (creating collections, checking conditions) Consolidation of Grid operations
Integration of Grid status monitoring • Monitoring data (MonALISA) is used to trigger the LPM activity • New jobs are submitted when the number of waiting tasks pass below a threshold • Pre-staging of data from tape is triggered before the reconstruction jobs are submitted • Running jobs are tracked individually for resources usage • Automatic alerts in case of unreasonable disk/memory/CPU consumption, jobs can be terminated… Consolidation of Grid operations
Resource usage alerts • Trigger now at 2GB RSS • Mail sent toboth adminsand the user Consolidation of Grid operations
Opportunistic storage discovery • A client-to-storage metric allows the automatic discovery of the closest (working) storage elements from every job • Based on the network topology information collected by MonALISA • Continuous functional tests of storages • SE occupancy status • Users specify the number of output replicas and type of storage (disk, custodial), but not the SEs Consolidation of Grid operations
France Nordic Countries Italy Russia USA Consolidation of Grid operations
User catalogue and job management • Web-based access to the AliEn catalogue (with certificate authentication) • Insert your favorite plugin (ROOT) here
Catalogue browser – view and edit • Viewer with syntax highlight and catalogue links • SE discovery syntax is highlighted Consolidation of Grid operations
Jobs management • Full job tracking, with submission and resubmission capabilities Consolidation of Grid operations
Jobs management • Detailed view of a particular masterjob • All trace logs can be accessed online
Summary • The Grid is in a full production mode since almost one year • Its operation is very successful, providing millions of CPU days and PBs of storage • To efficiently use there resources, consolidated tools Consolidation of Grid operations
Thanks a lot for your attention! Questions please? http://alimonitor.cern.ch/