High-Performance Clusters: Challenges and Opportunities

High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998

Clusters have Arrived • … the SPAA / PDOC testbed going forward SPAA/PODC Clusters

Berkeley NOW • http://now.cs.berkeley.edu/ SPAA/PODC Clusters

NOW’s Commercial Version • 240 procesors, Active Messages, myrinet, ... SPAA/PODC Clusters

Berkeley Massive Storage Cluster • serving Fine Art at www.thinker.org/imagebase/ • or try SPAA/PODC Clusters

Commercial Scene SPAA/PODC Clusters

What’s a Cluster? • Collection of independent computer systems working together as if a single system. • Coupled through a scalable, high bandwidth, low latency interconnect. SPAA/PODC Clusters

Outline for Part 1 • Why Clusters NOW? • What is the Key Challenge? • How is it overcome? • How much performance? • Where is it going? SPAA/PODC Clusters

Why Clusters? • Capacity • Availability • Scalability • Cost-effectiveness SPAA/PODC Clusters

Interconnect Disk array A Disk array B Traditional Availability Clusters • VAX Clusters => IBM sysplex => Wolf Pack Clients Server B Server A SPAA/PODC Clusters

Node Performance in Large System Engineering Lag Time Why HP Clusters NOW? • Time to market => performance • Technology • internet services SPAA/PODC Clusters

Technology Breakthrough • Killer micro => Killer switch • single chip building block for scalable networks • high bandwidth • low latency • very reliable SPAA/PODC Clusters

Opportunity: Rethink System Design • Remote memory and processor are closer than local disks! • Networking Stacks ? • Virtual Memory ? • File system design ? • It all looks like parallel programming • Huge demand for scalable, available, dedicated internet servers • big I/O, big compute SPAA/PODC Clusters

$ $ Example: Traditional File System • Server resources at a premium • Client resources poorly utilized Server Fast Channel (HPPI) Clients $ RAID Disk Storage $$$ Global Shared File Cache ° ° ° Local Private File Cache Bottleneck • Expensive • Complex • Non-Scalable • Single point of failure SPAA/PODC Clusters

P P P P P P P P File Cache File Cache File Cache File Cache File Cache File Cache File Cache File Cache Truly Distributed File System • VM: page to remote memory Scalable Low-Latency Communication Network Cluster Caching Local Cache Network RAID striping G = Node Comm BW / Disk BW SPAA/PODC Clusters

Comm. Software Comm. Software Comm.. Software Comm. Software Network Interface Hardware Network Interface Hardware Network Interface Hardware Network Interface Hardware Fast Communication Challenge • Fast processors and fast networks • The time is spent in crossing between them Killer Platform ° ° ° ns ms µs Killer Switch SPAA/PODC Clusters

P P P P P P P Opening: Intelligent Network Interfaces • Dedicated Processing power and storage embedded in the Network Interface • An I/O card today • Tomorrow on chip? Mryicom Net 160 MB/s Myricom NIC M M I/O bus (S-Bus) 50 MB/s M M $ M $ $ $ Sun Ultra 170 $ SPAA/PODC Clusters

Our Attack: Active Messages • Request / Reply small active messages (RPC) • Bulk-Transfer (store & get) • Highly optimized communication layer on a range of HW Request handler Reply handler SPAA/PODC Clusters

NOW System Architecture Parallel Apps Large Seq. Apps Sockets, Split-C, MPI, HPF, vSM Global Layer UNIX Process Migration Distributed Files Network RAM Resource Management UNIX Workstation UNIX Workstation UNIX Workstation UNIX Workstation Comm. SW Comm. SW Comm. SW Comm. SW Net Inter. HW Net Inter. HW Net Inter. HW Net Inter. HW Fast Commercial Switch (Myrinet) SPAA/PODC Clusters

Cluster Communication Performance SPAA/PODC Clusters

LogP • Latency in sending a (small) message between modules • overhead felt by the processor on sending or receiving msg • gap between successive sends or receives (1/rate) • Processors P ( processors ) P M P M P M ° ° ° o o (overhead) g (gap) L (latency) Limited Volume Interconnection Network ( L/g to a proc) Round Trip time: 2 x ( 2o + L) SPAA/PODC Clusters

LogP Comparison • Direct, user-level network access • Generic AM, FM (uiuc), PM (rwc), Unet (cornell), … Latency 1/BW SPAA/PODC Clusters

MPI over AM: ping-pong bandwidth SPAA/PODC Clusters

MPI over AM: start-up SPAA/PODC Clusters

Cluster Application Performance: NAS Parallel Benchmarks SPAA/PODC Clusters

NPB2: NOW vs SP2 SPAA/PODC Clusters

NPB2: NOW vs SGI Origin SPAA/PODC Clusters

Where the Time Goes: LU SPAA/PODC Clusters

Where the time goes: SP SPAA/PODC Clusters

LU Working Set • 4-processor • traditional curve for small caches • Sharp knee >256KB (1 MB total) SPAA/PODC Clusters

LU Working Set (CPS scaling) • Knee at global cache > 1MB • machine experiences drop in miss rate at specific size SPAA/PODC Clusters

Application Sensitivity to Communication Performance SPAA/PODC Clusters

Adjusting L, o, and g (and G) in situ • Martin, et al., ISCA 97 Host Workstation Host Workstation AM lib AM lib O: stall Ultra on msg write O: stall Ultra on msg read Lanai Lanai L: defer marking msg as valid until Rx + L Myrinet g: delay Lanai after msg injection (after fragment for bulk transfers) SPAA/PODC Clusters

Calibration SPAA/PODC Clusters

Split-C Applications Program Input P=16 P=32 (us) Msg Type Interval Radix Integer radix sort 16M 32-bit keys 13.7 7.8 6.1 msg EM3D(write) Electro-magnetic 80K Nodes, 40% rmt 88.6 38.0 8.0 write EM3D(read) Electro-magnetic 80K Nodes, 40% rmt 230.0 114.0 13.8 read Sample Integer sample sort 32M 32-bit keys 24.7 13.2 13.0 msg Barnes Hierarchical N-Body 1 Million Bodies 77.9 43.2 52.8 cached read P-Ray Ray Tracer 1 Million pixel image 23.5 17.9 156.2 cached read MurPHI Protocol Verification SCI protocol, 2 proc 67.7 35.3 183.5 Bulk Connect Connected Comp 4M nodes, 2-D mesh, 30% 2.3 1.2 212.6 BSP NOW-sort Disk-to-Disk Sort 32M 100-byte records 127.2 56.9 817.4 I/O Radb Bulk version Radix 16M 32-bit keys 7.0 3.7 852.7 Bulk SPAA/PODC Clusters

Sensitivity to Overhead SPAA/PODC Clusters

Comparative Impact SPAA/PODC Clusters

Sensitivity to bulk BW (1/G) SPAA/PODC Clusters

Cluster Communication Performance • Overhead, Overhead, Overhead • hypersensitive due to increased serialization • Sensitivity to gap reflects bursty communication • Surprisingly latency tolerant • Plenty of room for overhead improvement - How sensitive are distributed systems? SPAA/PODC Clusters

Extrapolating to Low Overhead SPAA/PODC Clusters

Direct Memory Messaging • Send region and receive region for each end of communication channel • Write through send region into remote rcv region SPAA/PODC Clusters

MEMORY CHANNEL interconnect ° ° ° Link Interface PCT rx tx ctr ctr rcv dma Bus Interface PCI (33 MHz) B/A AlphaServer SMP Alpha Mem P - $ Direct Memory Interconnects • DEC Memory Channels • 3 us end-to-end • ~ 1us o, L • SCI • SGI • Shrimp (Princeton) 100 MB/s SPAA/PODC Clusters

P P P P P P P P Scalability, Availability, and Performance FE FE FE • Scale disk, memory, proc independently • Random node serves query, all search • On (hw or sw) failure, lose random cols of index • On overload, lose random rows Inktomi Myrinet 100 Million Document Index SPAA/PODC Clusters

Summary • Performance => Generality (see Part 2) • From Technology “Shift” to Technology “Trend” • Cluster communication becoming cheap • gigabit ethernet • System Area Networks becoming commodity • Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun • Improvements in interconnect BW • gigabyte per second and beyond • Bus connections improving • PCI, ePCI, Pentium II cluster slot, … • Operating system out of the way • VIA SPAA/PODC Clusters

Advice • Clusters are cheap, easy to build, flexible, powerful, general purpose and fun • Everybody doing SPAA or PODC should have one to try out their ideas • Can use Berkeley NOW through npaci • www.npaci.edu SPAA/PODC Clusters

High-Performance Clusters: Challenges and Opportunities

High-Performance Clusters: Challenges and Opportunities

Presentation Transcript

High-Performance Computing With Windows

High-Performance Leadership

High Performance Boards

Overview of Pay without Performance

Performance Analysis of Virtualization for High Performance Computing

High Performance Computing, Clusters, and Productivity

Performance Networking

Chapter 2 Computer Clusters

Performance of MapReduce on Multicore Clusters

Agenda

High Performance Buildings

NPACI Panel on Clusters

HPD -- A High Performance Debugger Implementation

Parallel Computing With High Performance Computing Clusters (HPCs)

Developing a High Performance Anatomy

Designing Lattice QCD Clusters

MANAGING PERFORMANCE IMPROVEMENT TOVEY/UREN/SHELDON

Lecture 11: Unix Clusters

Performance Measurement 101 PART 1

Overview of Pay without Performance

High Performance Ajax Applications

Programming Models and Languages for Clusters of Multi-core Nodes Part 2: Hybrid MPI and OpenMP