Performance-Robust Parallel I/O

Virtual Streams Performance-Robust Parallel I/O Z. Morley Mao, Noah Treuhaft CS258 5/17/99 Professor Culler

Introduction • Clusters exhibit performance heterogeneity • static & dynamic, due to both hardware and software • Consistent peak performance demands adaptive software • building performance-robust parallel software means keeping heterogeneity in mind • This work explores… • adaptivity appropriate for I/O-bound parallel programs • how to provide that adaptivity

Heterogeneity demands adaptivity Process Disk Cluster Node • Physical I/O streams are simple to build and use • But their performance is highly variable • different drive models, bad blocks, multizone behavior, file layout, competing programs, host bottlenecks • I/O-bound parallel programs run at rate of slowest disk ...

Virtual Streams Process • Performance-robust programs want virtual streams that... • eliminate dependence on individual disk behavior • continually equalize throughput delivered to processes Virtual Streams Layer Disk

Graduated Declustering (GD): a Virtual Streams implementation • data replicated (mirrored) for availability • use replicas to provide performance availability, too • fast network makes remote disk access comparable to local • distributed algorithm for adaptivity • client provides information about its progress • server reacts by scheduling requests to even out progress client A client B Process GD server GD client library server server B A

GD in action Before Perturbation After Perturbation Client0 B Client1 B Client2 B Client3 B Client0 7B/8 Client1 7B/8 Client2 7B/8 Client3 7B/8 To Client0 To Client0 B/2 B/4 B/2 B/2 B/2 B/2 5B/8 3B/8 B/2 B/2 B/2 B/2 3B/8 B/4 B/2 B/2 B/2 5B/8 From Server3 From Server3 0 0 1 1 1 1 2 2 2 2 3 3 3 3 0 0 Server0 B Server1 B Server2 B Server3 B Server0 B Server1 B/2 Server2 B Server3 B • Local decisions yield global behavior

Evaluation of original GD implementation: progress-based Seek overhead • Seek overhead due to reading from all replicas

Deficiency of original GD implementation: seek overhead • Under the assumption of sequential data access: • Seek occurs even when there is no perturbation • seeks are becoming more significant as disk transfer rate increases • Need a new algorithm, that ... • reads mostly from a single disk under no perturbation • dynamically adjusts to perturbation when necessary • achieves both performance adaptivity and minimal overhead

Proposed solution: response-rate-based GD • Number of requests clients send to server based on server response rate • servers use request queue lengths to make scheduling decisions • uses implicit information, “historyless” • no bandwidth information transmitted between server and client • advantage: each client has a primary server

Evaluation of response-rate-based GD Reduced Seek overhead • Graph of bandwidth vs. disk nodes perturbed

Historyless vs. History-based adaptiveness • History-based: (progress based) • Adjustment to perturbation occurs gradually over time • Close to perfect knowledge, if the information not outdated • extra overhead in sending control information • Historyless: (response-rate based) • primary server designation possible • to increase sensitivity to real perturbation by creating “artificial” perturbation • considers varying performance of data consumers • takes longer to converge

Stability and Convergence • How long does it take for the system to converge? • Linear with the number of nodes • Depends on the last occurrence of perturbation • Influenced by the style of communication (implicit vs. explicit)

Server request handoff • If a server finishes all its requests, it will contact other servers with the same replicas to help serve their clients (workstealing) • server request handoff keeps all disks busy when possible • design decisions? • How many requests to handoff? Depending on the BW history of both servers, depending on the size of request queue. • Benefit vs. Cost tradeoff

Writes Process • Identical to reads except... • Create incomplete replicas with “holes” • track “holes” in metadata • afterward, do “hole-filling” both for availability and for performance robustness

Conclusions • What did we achieve? • New load balancing algorithm--response-rate based • Deliver equal BW to parallel-program processes in face of performance heterogeneity • demonstrate the stability of the system • reduce seek overhead • server request handoff • writes • creates a useful abstraction for steaming I/O in clusters

Future Work • Future work: • hot file replication • get peak BW after perturbation ceases • achieve orderly replies • multiple disks abstraction

Performance-Robust Parallel I/O

Performance-Robust Parallel I/O

Presentation Transcript

SURF : Speeded-Up Robust Features

Performance Analysis using Windows Performance Toolkit

NASA NCCS APPLICATION PERFORMANCE DISCUSSION

Chapter 11: Monitoring Server Performance

Introduction to Parallel I/O and MPI-IO

Robust PCA in Stata

Parallel Computing Explained Parallel Computing Overview

Parallel Programming in C with MPI and OpenMP

CS 484 Parallel Programming spring 2014

Parallel Algorithms for VLSI Routing

Parallel and Distributed Algorithms

Part 2: Fault-Tolerance Distributed Systems 2010

CUDA Lecture 3 Parallel Architectures and Performance Analysis

Lecture 5: Parallel Tools Landscape – Part 2

Performance Tools

Parallel Architecture is Ubiquitous

Parallel HDF5 Tutorial

How to Think Algorithmically in Parallel?

Components for High Performance Computing: Common Component Architecture

PARALLEL COMPUTING WITH MPI

Robust PCA