150 likes | 282 Views
EU DataGrid WP4. Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof B ä rring, CERN. Outline. Background Architecture Short term prototypes (September 2001) GRID issues Conclusions. Background. 3 years EU funded project lead by Fabrizio Gagliardi, CERN Started 1/1/2001
E N D
EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN http://cern.ch/hep-proj-grid-fabric
Outline • Background • Architecture • Short term prototypes (September 2001) • GRID issues • Conclusions http://cern.ch/hep-proj-grid-fabric
Background • 3 years EU funded project lead by Fabrizio Gagliardi, CERN • Started 1/1/2001 • 6 principal contractors: CERN, CNRS, ESA, INFN, FOM, PPARC • 15 assistant contractors http://cern.ch/hep-proj-grid-fabric
Workpackages • WP1: Workload Management • WP2: Grid Data Management • WP3: Grid Monitoring Services • WP4: Fabric management • WP5: Mass Storage Management • WP6: Integration Testbed – Production quality International Infrastructure • WP7: Network Services • WP8: High-Energy Physics Applications • WP9: Earth Observation Science Applications • WP10: Biology Science Applications • WP11: Information Dissemination and Exploitation • WP12: Project Management http://cern.ch/hep-proj-grid-fabric
WP4: Fabric Management “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” http://cern.ch/hep-proj-grid-fabric
WP4: Fabric Management • ~14 FTEs (6 funded by the EU) for 3 years split over 6 partners: CERN, FOM/NIKHEF, ZIB, Heidelberg Univ. PPARC, INFN • The work divided into 6 subtasks • Configuration management • Automatic software installation & maintenance • Monitoring • Fault tolerance • Resource management • “Gridification” http://cern.ch/hep-proj-grid-fabric
Grid Monitoring & Information Service Grid Scheduler Gridification Monitoring Resource Management Configuration Management Installation Management Fault Tolerance Cluster Dependencies GRID Fabric http://cern.ch/hep-proj-grid-fabric
Configuration management GUI CDB HLD LLD Manipulations (read/write) Compilation (one-way) CLI Fetching only • HLD = High Level Description • LLD = Low Level Description • MLD = Machine Level Description Client machine Translation Cached LLD MLD http://cern.ch/hep-proj-grid-fabric
SRS Installation management Software Maintainers Configuration Management Resource Management Local Node BSS Fault Tolerance NMS • SRS = Software Repository • NMS = Node Management • BSS = Bootstrap Service Monitoring http://cern.ch/hep-proj-grid-fabric
Scheduling of Actions • Node autonomy approach (chaotic) • High level configuration change propagated to all affected nodes • Monitoring senses a change of configuration • Fault tolerance fires an actuator to bring the node to its configured state (could be “re-install”) • What happens to running jobs? • Who tells scheduler that node is in maintenance? • How are dependent actions handled (e.g. server intervention)? http://cern.ch/hep-proj-grid-fabric
Scheduling of Actions • Decompose complex actions into simple “atomic” actions that can be serialized centrally • Each configuration change would generate a simple action on the affected nodes • Scripts to bundle the actions together and executes them in a sensible order • Use APIs to the different sub-components http://cern.ch/hep-proj-grid-fabric
Change glibc on service A • Get list of ndoes L belonging to service A • For all nodes (L1…Ln) • Disable Li in scheduler queue A • Wait for completion of 2 • For all nodes (L1…Ln) • Submit admin job to node Li • Wait for completion of 4 • For all nodes (L1…Ln) • Re-enable node Li in scheduler queue A http://cern.ch/hep-proj-grid-fabric
For September 2001 • First prototype of the configuration management system • Low level (node) query interface • Caching • “Interim” installation system • LCFG for upgrades and maintenance • SystemImager for initial system install and VACM console control for system preparation http://cern.ch/hep-proj-grid-fabric
GRID issues • “Gridification” Protect the fabric against GRID jobs • Local farms will still be used by local users • Firewalls (channeling of job I/O, interactive jobs, MPI over WAN, …) • Local authorization of grid users • Job information http://cern.ch/hep-proj-grid-fabric
Conclusions • DataGrid WP4 is not so much about the G-word. It is really about automating cluster management • In the process of defining the global architecture. How do we best put the bits and pieces together? • Ambitious delivery plans already for September http://cern.ch/hep-proj-grid-fabric