EU DataGrid WP4

EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN http://cern.ch/hep-proj-grid-fabric

Outline • Background • Architecture • Short term prototypes (September 2001) • GRID issues • Conclusions http://cern.ch/hep-proj-grid-fabric

Background • 3 years EU funded project lead by Fabrizio Gagliardi, CERN • Started 1/1/2001 • 6 principal contractors: CERN, CNRS, ESA, INFN, FOM, PPARC • 15 assistant contractors http://cern.ch/hep-proj-grid-fabric

Workpackages • WP1: Workload Management • WP2: Grid Data Management • WP3: Grid Monitoring Services • WP4: Fabric management • WP5: Mass Storage Management • WP6: Integration Testbed – Production quality International Infrastructure • WP7: Network Services • WP8: High-Energy Physics Applications • WP9: Earth Observation Science Applications • WP10: Biology Science Applications • WP11: Information Dissemination and Exploitation • WP12: Project Management http://cern.ch/hep-proj-grid-fabric

WP4: Fabric Management “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” http://cern.ch/hep-proj-grid-fabric

WP4: Fabric Management • ~14 FTEs (6 funded by the EU) for 3 years split over 6 partners: CERN, FOM/NIKHEF, ZIB, Heidelberg Univ. PPARC, INFN • The work divided into 6 subtasks • Configuration management • Automatic software installation & maintenance • Monitoring • Fault tolerance • Resource management • “Gridification” http://cern.ch/hep-proj-grid-fabric

Grid Monitoring & Information Service Grid Scheduler Gridification Monitoring Resource Management Configuration Management Installation Management Fault Tolerance Cluster Dependencies GRID Fabric http://cern.ch/hep-proj-grid-fabric

Configuration management GUI CDB HLD LLD Manipulations (read/write) Compilation (one-way) CLI Fetching only • HLD = High Level Description • LLD = Low Level Description • MLD = Machine Level Description Client machine Translation Cached LLD MLD http://cern.ch/hep-proj-grid-fabric

SRS Installation management Software Maintainers Configuration Management Resource Management Local Node BSS Fault Tolerance NMS • SRS = Software Repository • NMS = Node Management • BSS = Bootstrap Service Monitoring http://cern.ch/hep-proj-grid-fabric

Scheduling of Actions • Node autonomy approach (chaotic) • High level configuration change propagated to all affected nodes • Monitoring senses a change of configuration • Fault tolerance fires an actuator to bring the node to its configured state (could be “re-install”) • What happens to running jobs? • Who tells scheduler that node is in maintenance? • How are dependent actions handled (e.g. server intervention)? http://cern.ch/hep-proj-grid-fabric

Scheduling of Actions • Decompose complex actions into simple “atomic” actions that can be serialized centrally • Each configuration change would generate a simple action on the affected nodes • Scripts to bundle the actions together and executes them in a sensible order • Use APIs to the different sub-components http://cern.ch/hep-proj-grid-fabric

Change glibc on service A • Get list of ndoes L belonging to service A • For all nodes (L1…Ln) • Disable Li in scheduler queue A • Wait for completion of 2 • For all nodes (L1…Ln) • Submit admin job to node Li • Wait for completion of 4 • For all nodes (L1…Ln) • Re-enable node Li in scheduler queue A http://cern.ch/hep-proj-grid-fabric

For September 2001 • First prototype of the configuration management system • Low level (node) query interface • Caching • “Interim” installation system • LCFG for upgrades and maintenance • SystemImager for initial system install and VACM console control for system preparation http://cern.ch/hep-proj-grid-fabric

GRID issues • “Gridification” Protect the fabric against GRID jobs • Local farms will still be used by local users • Firewalls (channeling of job I/O, interactive jobs, MPI over WAN, …) • Local authorization of grid users • Job information http://cern.ch/hep-proj-grid-fabric

Conclusions • DataGrid WP4 is not so much about the G-word. It is really about automating cluster management • In the process of defining the global architecture. How do we best put the bits and pieces together? • Ambitious delivery plans already for September http://cern.ch/hep-proj-grid-fabric

EU DataGrid WP4

EU DataGrid WP4

Presentation Transcript

Grid networking in EU DataGRID

EU DataGrid security with GSI and Globus

DataTAG-WP4 EU-US GRID Interoperability

ASIS et le projet EU DataGrid (EDG)

The EU DataGrid - Introduction

The EU DataGrid

EU DataGrid Project TestBed Status and Plans

EU DataGrid segment in Russia. Testbed WP6.

News from EU DataGrid

EU DataGrid Testbed

The EU DataGrid Security Services

EU DataGrid segment in Russia. Testbed WP6.

Fabric Management in EU DataGrid: An Overview

DataGRID WP4 Installation Task - in progress

Grid networking in EU DataGRID

The EU DataGrid Testbed

The EU DataGrid Project

EU DataGrid Project TestBed Status and Plans

EU DataGrid Testbed