E N D
THE INFN GRID PROJECT • Scope: Study and develop a general INFN computing infrastructure, based on GRID technologies, to be validated (as first use case) implementing distributed Regional Center prototypes for LHC expts: ATLAS, CMS, ALICE and, later on, also for other INFN expts (Virgo, Gran Sasso ….) • Project Status: • Outline of proposal submitted to INFN management 13-1-2000 • 3 Year duration • Next meeting with INFN management 18th of February • Feedback documents from LHC expts by end of February (sites, FTEs..) • Final proposal to INFN by end of March
INFN & “Grid Related Projects” • Globus tests • “Condor on WAN” as general purpose computing resource • “GRID” working group to analyze viable and useful solutions (LHC computing, Virgo…) • Global architecture that allows strategies for the discovery, allocation, reservation and management of resource collection • MONARC project related activities
Evaluation of the Globus ToolKit • 5 sites Testbed (Bologna, CNAF, LNL, Padova, Roma1) • Use case: HTL CMS studies • MC Prod. Complete HLT chain • Services to test/implement • Resource management • fork() Interface to different local resource managers (Condor, LSF) • Resources chosen by hand Smart Broker to implement a Global resource manager • Data Mover (Gass, Gsiftp…) • to stage executable and input files • to retrieve output files • Bookkeeping (Is this a worth a general tool ?)
Status • Globus installed in 5 Linux PCs in 3 sites • Globus Security Infrastructure • works !! • MDS • Initial problems accessing data (long response time and time out) • GRAM, GASS, Gloperf • Work in progress
Condor on WAN Objectives • Large INFN project of the Computing Commission involving ~20 sites • INFN collaboration with Condor Team UWISC • I goal: Condor “tuning” on WAN • verify Condor reliability and robustness in Wide Area Network environment • Verify suitability to INFN computing needs • Network I/O impact and measures
II goal: Network as a Condor Resource • Dynamic checkpointing and Checkpoint domain configuration • Pool partitioned in checkpoint domains (a dedicated ckpt server for each domain) • Definition of a checkpoint domain according: • Presence of a sufficiently large CPU capacity • Presence of a set of machines with an efficient network connectivity • Sub-pools
Checkpointing: next step • Distributed dynamic checkpointing • Pool machines select the “best” checkpoint server (from a network view) • Association between execution machine and checkpoint server dynamically decided
Implementation Characteristics of the INFN Condor pool: • Single pool • To optimize CPU usage of all INFN hosts • Sub-pools • To define policies/priorities on resource usage • Checkpoint domains • To guarantee the performance and the efficiency of the system • To reduce network traffic for checkpointing activity
CKPT domain # hosts Default CKPT domain @ Cnaf USA INFN Condor Pool on WAN: checkpoint domains EsNet 155Mbps 15 TRENTO 4 10 40 UDINE GARR-B Topology 155 Mbps ATM based Network access points (PoP) main transport nodes MILANO TORINO PADOVA LNL TRIESTE 15 FERRARA PAVIA 10 GENOVA 65 Central Manager PARMA CNAF 3 BOLOGNA PISA 1 FIRENZE S.Piero 6 PERUGIA LNGS 10 3 ROMA 5 L’AQUILA ROMA2 LNF 3 SASSARI NAPOLI 15 BARI 2 LECCE SALERNO 2 T3 CAGLIARI COSENZA ~180 machines 500-1000 machines 6 ckpt servers 25 ckpt servers 5 PALERMO CATANIA LNS USA
Management • Central management (condor-admin@infn.it) • Local management (condor@infn.it) • Steering committee • software maintenance contract with Condor_support team of University of Madison
INFN-GRID project requirements Networked Workload Management: • Optimal co-allocation of data and CPU and network for a specific “grid/network-aware” job • distributed scheduling (data and/or code migration) • unscheduled/ scheduled job submission • Management of heterogeneous computing systems • Uniform interface to various local resource managers and schedulers • Priorities, policies on resource (CPU, Data, Network) usage • bookkeeping and ‘web’ user interface
Project req. (cont.) Networked Data Management: • Universal name space: transparent, location independent • Data replication and caching • Data mover (scheduled/interactive at OBJ/file/DB granularity) • Loose synchronization between replicas • Application Metadata, interfaced with DBMS, i.e. Objectivity, … • Network services definition for a given application • End systems network protocol tuning
Project req. (cont.) Application Monitoring/Management: • Performance, “instrumented systems” with timing information and analysis tools • Run-time analysis of collected application events • Bottleneck analysis • Dynamic monitoring of GRID resources to optimize resource allocation • Failure management
Project req. (cont.) Computing Fabric and general utilities for a global managed Grid: • Configuration management of computing facilities • Automatic software installation and maintenance • System, service, network monitoring and global alarm notification, automatic recovery from failures • resource use accounting • Security of GRID resources and infrastructure usage • Information service