Clustered Systems for Massive Parallelism

Clustered Systems for • Massive Parallelism N. XiongGeorgia State University

Review and Introduction

Design Objectives of Clusters and MPPs Cluster and MPP System Architectures Design Principles of Clustered Systems Multiple Job Scheduling and Management Virtual Clustering and Resource Provisioning Homework Problems Chapter 04 Main Contents

Scalability Packaging Control Homogeneity Security Design Objectives of Clustered Systems

Design Objectives of Clustered Systems

Fundamental Cluster Design Issues • Scalable Performance • Single System Image • Availability Support • Cluster Job Management • Internode Communication • Fault Tolerance and Recovery • Growth of Servers in HPC and HTC Systems

Resource-Sharing in Cluster Systems

An Idealized Cluster Architecture • Conventional databases and OLTP monitors offer users a desktop environment • Supports parallel programming based on standard languages and communication libraries • A user-interface subsystem combines the advantages of the Web interface and the windows GUI

Node Architectures and System Packaging • Two types of cluster nodes • compute nodes • service nodes

Compute Node Examples

Modular Packaging of IBM BlueGene/L System

Cluster System Interconnects

High-Bandwidth Interconnects

An InfiniBand Cluster Interconnection Network

High-bandwidth Interconnects in Top-500 Systems

Hardware, Software, and Middleware Support

Design Principles of Clusters • Single-System-Image (SSI) Features • Single System • Single Control • Symmetry • Location Transparent

Design Principles of Clusters • Single-System-Image Layers • Application Software Layer • Hardware or Kernel Layer • Middleware Layer

Design Principles of Clusters • Single-System-Image Composition • Single Entry Point • Single File Hierarchy • Single I/O, Networking, and Memory Space • Other Desired SSI Features

Single Entry Point

Single File Hierarchy • It is persistent. • It is fault tolerant to some degree. • Network File System (NFS) and Andrew File System (AFS).

Single File Hierarchy

Single I/O, Networking, and Memory Space • Single Input/Output • Single Networking • Single Point of Control • Single Memory Space

Single I/O, Networking, and Memory Space

An Example

Other Desired SSI Features • Single Job Management System • Single User Interface • Single Process Space

Middleware Support for SSI Clustering

High Availability Through Redundancy • Reliability • Availability • Serviceability

Availability and Failure Rate

Availability Values of Several Representative Systems

Redundancy Techniques

Fault-Tolerant Cluster Configurations • Hot Standby • Mutual Takeover • Fault-Tolerance

Recovery Schemes • Backward recovery • Forward recovery: in real-time systems

Checkpointing and Recovery Techniques • Kernel, Library, and Application Levels • Checkpoint Overheads • Choosing an Optimal Checkpoint Interval

Checkpointing Parallel Programs

Cluster Job Scheduling and Management • Cluster Job Management Issues • A user server • A job scheduler • A resource manager

Cluster Job Types • Serial jobs • Parallel jobs • Interactive jobs • Batch jobs • Foreign jobs

Multi-Job Scheduling Schemes

Share Cluster Nodes • Dedicated Mode • Space Sharing • Time Sharing

Migration Schemes Issues • Node Availability • Migration Overhead • Recruitment Threshold： • the amount of time a workstation stays unused before the cluster considers it an idle node

Virtual Clustering and Resource Provisioning

Five Virtual Cluster Research Projects

Live VM Migration and Cluster Management

Effect by Live Migration

Dynamic Virtual Resource Provisioning

Autonomic Adaptation of Virtual Environments

Some References and Further Reading

Homework Problems

Clustered Systems for Massive Parallelism

Clustered Systems for Massive Parallelism

Presentation Transcript

Taming GPU compute with C++ Accelerated Massive Parallelism

Harnessing GPU compute with C++ Accelerated Massive Parallelism

Parallelism

Parallelism

Parallelism

C++ Accelerated Massive Parallelism in Visual C++ 2012

Scheduling for parallelism

Parallelism

parallelism

Massive Parallelism in AI Throughput versus Realtime

Parallelism

Parallelism

Parallelism

COMP60621 Designing for Parallelism

COMP60621 Designing for Parallelism

Parallelism

Clustered Computing

PARALLELISM PARALLELISM PARALLELISM

Clustered Planarity = Flat Clustered Planarity

Efficient Interconnects for Clustered Microarchitectures

Parallelism

Clustered Planarity = Flat Clustered Planarity