Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems Chris Dabrowski Geoff Fox cdabrowski@nist.govgcf@indiana.edu OGF21 Seattle, Washington, USA October 17, 2007

Proposed Meeting Agenda I. Introduction II. Presentation/Review of Draft OGF Informational Document “Reliability in Grid Computing Systems” • A work in progress III. Discussion IV. Close

Grid Reliability and Robustness RG Purpose:Make recommendations and explore methods for improving reliability and robustness of standards-based grid systems. Main Product:Produce OGF Informational Document that Summarizes the state of work on Grid system reliability and identifies reliability and robustness issues/requirements for grid systems First draft in progress Contributions, review needed! Additional Products: Facilitate collaborations between researchers on grid reliability Preliminary requirements for reliability measurement methods and tools Web pages and reflector Official: https://forge.gridforum.org/sf/projects/gridrel-rg Unofficial: http://gridreliability.nist.gov/ List of resources (in progress) Reflector: gridrel-rg@ogf.org

OGF Informational Document Title: Reliability in Grid Computing Systems: Purpose: Summarizes the state of work on Grid system reliability based on input from grid system practitioners/researchers Identifies issues that must be addressed/solved to ensure reliability and robustness in grid systems Provides basis for identifying requirements for establishing and maintaining high levels of reliability in large-scale Grids Basis for preliminary requirements for methods and tools to measure grid system reliability  Focus on current practices and research that provide insight on how WS and grid specifications may affect grid reliability Serve as resource on reliability issues for OGF working groups developing specifications and for grid developers.

Document basis: previous workshops on grid reliability First workshop (GGF16, Athens, Greece) Site Assessment and Probabilistic Risk Analysis (PRA) of Grid Computing Facilities, by Joe Higgins and Robert Sewell of Sun Microsystems Methods for analyzing risks involved in deploying and configuring grid computing sites Reliable Messaging for Grids and Web Services, by Geoffrey Fox, Shrideep Pallickara, Damodar Yemme, Hasan Bulut and Sima Patel, Community Grids Lab, Indiana University NaradaBrokering: scalable, standards-based management architecture for fault-tolerant grids Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH), by Heon Y. Yeom of Distributed Computing Systems Laboratory, Seoul National University Fault-tolerant MPI (FT-MPICH) with coordinated checkpointing of interacting, parallel processes QoS-Aware Fault Tolerance in Grid Computing, by L. Valcarenghi, F. Cugini, F. Paolucci, and P. Castoldi, Scuola Superiore of Sant’Anna and CNIT, Pisa, Italy Fault-tolerance thru integrating replicated services and QoS capable network protocol layer A Program of Work for Understanding Emergent Behavior in Global Grid Systems, by Kevin Mills and Chris Dabrowski, of the U.S. NIST Developing methods for understanding and controlling complex systems behavior in grids

Document basis: previous workshops on grid reliability Second workshop (OGF19, Chapel Hill, USA) Using a Large-Scale Survivability Architecture to Control Grids: A Status Report, by Zach Hill, Jonathan Rowanhill, Jim Basney, Glenn Wasson, John Knight, Anh Nguyen-Tuong, Andrew Grimshaw and Marty Humphrey, University of Virginia and NCSA/University of Illinois, Urbana-Champaign Reconfigurable Grid system architecture (Willow) for promoting survivability & dependability Platform Symphony Reliability, by Nick Werstiuk, Platform Computing Grid architecture for promoting reliability & dependability through failure detection and failover Managing Grid and Web Services and their exchanged messages, by Harshawardhan Gadgil, Geoffrey Fox, Shrideep Pallickara, and Marlon Pierce, Indiana University Results showing performance, scalability and cost-effectiveness of NaradaBrokering architecture Reliability Assessment of Grid Software Systems Using Emergent Features, by Carol Song, Umut Topkara, Jungha Woo, and Sang Phill Park, Purdue University Method for identifying centralized software components likely to impact grid system reliability Reflections on Reliability Issues in OGSA, by Matti Hiltunen, AT&T Labs Summary of requirements for ensuring reliability and availability of OGSA-based services

Document Outline: Reliability in Grid Computing Systems • Introduction • Definitions • Current Practices on Grid System Reliability • Reliability of Grid Applications • Reliability Issues and Preliminary Requirements • Reliability Metrics and Preliminary Measurement Requirements • Summary • Resources

2. Definitions: • Source • Avizienis, A., Laprie, J., Randell, B., and Landwehr, C. “Basic Concepts and Taxonomy of Dependable and Secure Computing,” • Key definitions: • Reliability, availability, dependability, and fault tolerance • Grid resources • Decomposition of Grid Reliability concerns • Hardware and Software computing resources accessible via grid • Core infrastructure and resource management services • Allocate and manage grid resources • Example: discovery, negotiation, execution management, notification, security, etc. • Underlying connection and data transport facilities: grid network • Overall system perspective

3. Current practices/research on grid system reliability • Some main points: grid reliability methods • Still leverage redundancy • In deployed systems are based on methods used in cluster computing • Must face scalability & administrative boundary issues • Areas covered • Fault tolerance of grid resources • Fault detection • Recovery methods for grid resources Checkpoint and recovery through process migration, grid resource replication, replication in data grids • Fault removal through testing and code certification • Reliability of supporting infrastructure and management services • Grid connection and transport reliability • Specifications, fault tolerant grid networks, reliable multicasting • Reliability from overall system perspective • Architectural perspective, complex systems perspective

4. Reliability of grid applications • Some main points: • Grid applications may/should ensure their reliability themselves (perspective of GCPR WG?) • Merging of grid user/client FT methods and provider FT methods? • What’s being done for FT in grid workflows? • Areas covered • Fault tolerance of remote application processes • Fault tolerance of grid resource compositions and workflows • Workflows composed with languages/tools for grid environments • Workflows composed with languages/tools for generic web service environments • Merging application and provider fault tolerance strategies

4. Reliability issues and preliminary requirements • Fault removal • Cost-benefits of testing grid components to determine which functions and kind of tests needed (component, integration, or interaction tests) • Fault Tolerance • Fault detection: need for scalability of methods, fault taxonomies • Recovery: tradeoffs between methods, understanding which methods to use and when, and coordinated checkpoint methods. • Special requirements for infrastructure and resource management services • Criticality of services leads to different tradeoff dynamics • Fault tolerance for grid networking and data transport • FT/control in overlays, combining overlays, dedicated networks, enhance specs for reliability(?), reliable multicasting? • Fault tolerance of grid applications • User vs provider FT, FT considerations for workflow languages?

5. Metrics and preliminary measurement requirements • Preliminary work on grid reliability metrics • OGF Network measurement working group (2004), analysis of reliability of a grid by Xie and colleagues (2004). • Preliminary requirements for metrics, three classes: • OGF NM WG • Metrics to measure availability and reliability of individual grid resources (needed by grid users for evaluation purposes) • Metrics to measure reliability of entire grid or significant subsections (as above)

6. Summary • TBD 7. Resources • Over 180 cited • Organized topically in an appendix • Additional sources to be worked in

Presentation Summary • Document work in progress • Please review and comment! • Please contribute!

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

Presentation Transcript

Trust and Grid Computing Systems

Grid Computing Overview and Research Issues

Research Computing Workshop Common Solutions Group Meeting

The Access Grid  Group to Group Collaboration on the Grid

High Performance and Grid Computing Group

Grid Computing Systems

Software Testing and Reliability Robustness Testing

Peer-to-Peer and Grid Systems Grid computing

Weather Research and Forecast implementation on Grid Computing

Grid Computing and Thailand Research Community

The impact of grid computing on UK research

Grid Computing in Data Mining and Data Mining on Grid Computing

Grid working group meeting

Reliability and Robustness in Engineering Design

Grid Reliability

Grid Computing Systems

Welcome to Research Group Meeting on Reliability and Robustness in Grid Computing Systems

Smart Structural Systems vs. Reliability and Robustness

The impact of grid computing on UK research

Grid Infrastructure and Key Technologies (+ Grid Computing Research in Hong Kong)

Grid Computing in Data Mining and Data Mining on Grid Computing

Grid Computing Overview and Research Issues