1 / 15

EU DataGrid WP4

EU DataGrid WP4. Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof B ä rring, CERN. Outline. Background Architecture Short term prototypes (September 2001) GRID issues Conclusions. Background. 3 years EU funded project lead by Fabrizio Gagliardi, CERN Started 1/1/2001

hope-gibbs
Download Presentation

EU DataGrid WP4

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. EU DataGrid WP4 Large-Scale Cluster Computing Workshop FNAL, May 24 2001 Olof Bärring, CERN http://cern.ch/hep-proj-grid-fabric

  2. Outline • Background • Architecture • Short term prototypes (September 2001) • GRID issues • Conclusions http://cern.ch/hep-proj-grid-fabric

  3. Background • 3 years EU funded project lead by Fabrizio Gagliardi, CERN • Started 1/1/2001 • 6 principal contractors: CERN, CNRS, ESA, INFN, FOM, PPARC • 15 assistant contractors http://cern.ch/hep-proj-grid-fabric

  4. Workpackages • WP1: Workload Management • WP2: Grid Data Management • WP3: Grid Monitoring Services • WP4: Fabric management • WP5: Mass Storage Management • WP6: Integration Testbed – Production quality International Infrastructure • WP7: Network Services • WP8: High-Energy Physics Applications • WP9: Earth Observation Science Applications • WP10: Biology Science Applications • WP11: Information Dissemination and Exploitation • WP12: Project Management http://cern.ch/hep-proj-grid-fabric

  5. WP4: Fabric Management “To deliver a computing fabric comprised of all the necessary tools to manage a center providing grid services on clusters of thousands of nodes.” http://cern.ch/hep-proj-grid-fabric

  6. WP4: Fabric Management • ~14 FTEs (6 funded by the EU) for 3 years split over 6 partners: CERN, FOM/NIKHEF, ZIB, Heidelberg Univ. PPARC, INFN • The work divided into 6 subtasks • Configuration management • Automatic software installation & maintenance • Monitoring • Fault tolerance • Resource management • “Gridification” http://cern.ch/hep-proj-grid-fabric

  7. Grid Monitoring & Information Service Grid Scheduler Gridification Monitoring Resource Management Configuration Management Installation Management Fault Tolerance Cluster Dependencies GRID Fabric http://cern.ch/hep-proj-grid-fabric

  8. Configuration management GUI CDB HLD LLD Manipulations (read/write) Compilation (one-way) CLI Fetching only • HLD = High Level Description • LLD = Low Level Description • MLD = Machine Level Description Client machine Translation Cached LLD MLD http://cern.ch/hep-proj-grid-fabric

  9. SRS Installation management Software Maintainers Configuration Management Resource Management Local Node BSS Fault Tolerance NMS • SRS = Software Repository • NMS = Node Management • BSS = Bootstrap Service Monitoring http://cern.ch/hep-proj-grid-fabric

  10. Scheduling of Actions • Node autonomy approach (chaotic) • High level configuration change propagated to all affected nodes • Monitoring senses a change of configuration • Fault tolerance fires an actuator to bring the node to its configured state (could be “re-install”) • What happens to running jobs? • Who tells scheduler that node is in maintenance? • How are dependent actions handled (e.g. server intervention)? http://cern.ch/hep-proj-grid-fabric

  11. Scheduling of Actions • Decompose complex actions into simple “atomic” actions that can be serialized centrally • Each configuration change would generate a simple action on the affected nodes • Scripts to bundle the actions together and executes them in a sensible order • Use APIs to the different sub-components http://cern.ch/hep-proj-grid-fabric

  12. Change glibc on service A • Get list of ndoes L belonging to service A • For all nodes (L1…Ln) • Disable Li in scheduler queue A • Wait for completion of 2 • For all nodes (L1…Ln) • Submit admin job to node Li • Wait for completion of 4 • For all nodes (L1…Ln) • Re-enable node Li in scheduler queue A http://cern.ch/hep-proj-grid-fabric

  13. For September 2001 • First prototype of the configuration management system • Low level (node) query interface • Caching • “Interim” installation system • LCFG for upgrades and maintenance • SystemImager for initial system install and VACM console control for system preparation http://cern.ch/hep-proj-grid-fabric

  14. GRID issues • “Gridification” Protect the fabric against GRID jobs • Local farms will still be used by local users • Firewalls (channeling of job I/O, interactive jobs, MPI over WAN, …) • Local authorization of grid users • Job information http://cern.ch/hep-proj-grid-fabric

  15. Conclusions • DataGrid WP4 is not so much about the G-word. It is really about automating cluster management • In the process of defining the global architecture. How do we best put the bits and pieces together? • Ambitious delivery plans already for September http://cern.ch/hep-proj-grid-fabric

More Related