450 likes | 477 Views
Explore the evolution of high-performance clusters, their key challenges, breakthrough technologies, and future directions in computer science. Learn why clusters are essential for capacity, scalability, and cost-effectiveness in internet services.
E N D
High-Performance Clusters part 1: Performance David E. Culler Computer Science Division U.C. Berkeley PODC/SPAA Tutorial Sunday, June 28, 1998
Clusters have Arrived • … the SPAA / PDOC testbed going forward SPAA/PODC Clusters
Berkeley NOW • http://now.cs.berkeley.edu/ SPAA/PODC Clusters
NOW’s Commercial Version • 240 procesors, Active Messages, myrinet, ... SPAA/PODC Clusters
Berkeley Massive Storage Cluster • serving Fine Art at www.thinker.org/imagebase/ • or try SPAA/PODC Clusters
Commercial Scene SPAA/PODC Clusters
What’s a Cluster? • Collection of independent computer systems working together as if a single system. • Coupled through a scalable, high bandwidth, low latency interconnect. SPAA/PODC Clusters
Outline for Part 1 • Why Clusters NOW? • What is the Key Challenge? • How is it overcome? • How much performance? • Where is it going? SPAA/PODC Clusters
Why Clusters? • Capacity • Availability • Scalability • Cost-effectiveness SPAA/PODC Clusters
Interconnect Disk array A Disk array B Traditional Availability Clusters • VAX Clusters => IBM sysplex => Wolf Pack Clients Server B Server A SPAA/PODC Clusters
Node Performance in Large System Engineering Lag Time Why HP Clusters NOW? • Time to market => performance • Technology • internet services SPAA/PODC Clusters
Technology Breakthrough • Killer micro => Killer switch • single chip building block for scalable networks • high bandwidth • low latency • very reliable SPAA/PODC Clusters
Opportunity: Rethink System Design • Remote memory and processor are closer than local disks! • Networking Stacks ? • Virtual Memory ? • File system design ? • It all looks like parallel programming • Huge demand for scalable, available, dedicated internet servers • big I/O, big compute SPAA/PODC Clusters
$ $ Example: Traditional File System • Server resources at a premium • Client resources poorly utilized Server Fast Channel (HPPI) Clients $ RAID Disk Storage $$$ Global Shared File Cache ° ° ° Local Private File Cache Bottleneck • Expensive • Complex • Non-Scalable • Single point of failure SPAA/PODC Clusters
P P P P P P P P File Cache File Cache File Cache File Cache File Cache File Cache File Cache File Cache Truly Distributed File System • VM: page to remote memory Scalable Low-Latency Communication Network Cluster Caching Local Cache Network RAID striping G = Node Comm BW / Disk BW SPAA/PODC Clusters
Comm. Software Comm. Software Comm.. Software Comm. Software Network Interface Hardware Network Interface Hardware Network Interface Hardware Network Interface Hardware Fast Communication Challenge • Fast processors and fast networks • The time is spent in crossing between them Killer Platform ° ° ° ns ms µs Killer Switch SPAA/PODC Clusters
P P P P P P P Opening: Intelligent Network Interfaces • Dedicated Processing power and storage embedded in the Network Interface • An I/O card today • Tomorrow on chip? Mryicom Net 160 MB/s Myricom NIC M M I/O bus (S-Bus) 50 MB/s M M $ M $ $ $ Sun Ultra 170 $ SPAA/PODC Clusters
Our Attack: Active Messages • Request / Reply small active messages (RPC) • Bulk-Transfer (store & get) • Highly optimized communication layer on a range of HW Request handler Reply handler SPAA/PODC Clusters
NOW System Architecture Parallel Apps Large Seq. Apps Sockets, Split-C, MPI, HPF, vSM Global Layer UNIX Process Migration Distributed Files Network RAM Resource Management UNIX Workstation UNIX Workstation UNIX Workstation UNIX Workstation Comm. SW Comm. SW Comm. SW Comm. SW Net Inter. HW Net Inter. HW Net Inter. HW Net Inter. HW Fast Commercial Switch (Myrinet) SPAA/PODC Clusters
Cluster Communication Performance SPAA/PODC Clusters
LogP • Latency in sending a (small) message between modules • overhead felt by the processor on sending or receiving msg • gap between successive sends or receives (1/rate) • Processors P ( processors ) P M P M P M ° ° ° o o (overhead) g (gap) L (latency) Limited Volume Interconnection Network ( L/g to a proc) Round Trip time: 2 x ( 2o + L) SPAA/PODC Clusters
LogP Comparison • Direct, user-level network access • Generic AM, FM (uiuc), PM (rwc), Unet (cornell), … Latency 1/BW SPAA/PODC Clusters
MPI over AM: ping-pong bandwidth SPAA/PODC Clusters
MPI over AM: start-up SPAA/PODC Clusters
Cluster Application Performance: NAS Parallel Benchmarks SPAA/PODC Clusters
NPB2: NOW vs SP2 SPAA/PODC Clusters
NPB2: NOW vs SGI Origin SPAA/PODC Clusters
Where the Time Goes: LU SPAA/PODC Clusters
Where the time goes: SP SPAA/PODC Clusters
LU Working Set • 4-processor • traditional curve for small caches • Sharp knee >256KB (1 MB total) SPAA/PODC Clusters
LU Working Set (CPS scaling) • Knee at global cache > 1MB • machine experiences drop in miss rate at specific size SPAA/PODC Clusters
Application Sensitivity to Communication Performance SPAA/PODC Clusters
Adjusting L, o, and g (and G) in situ • Martin, et al., ISCA 97 Host Workstation Host Workstation AM lib AM lib O: stall Ultra on msg write O: stall Ultra on msg read Lanai Lanai L: defer marking msg as valid until Rx + L Myrinet g: delay Lanai after msg injection (after fragment for bulk transfers) SPAA/PODC Clusters
Calibration SPAA/PODC Clusters
Split-C Applications Program Input P=16 P=32 (us) Msg Type Interval Radix Integer radix sort 16M 32-bit keys 13.7 7.8 6.1 msg EM3D(write) Electro-magnetic 80K Nodes, 40% rmt 88.6 38.0 8.0 write EM3D(read) Electro-magnetic 80K Nodes, 40% rmt 230.0 114.0 13.8 read Sample Integer sample sort 32M 32-bit keys 24.7 13.2 13.0 msg Barnes Hierarchical N-Body 1 Million Bodies 77.9 43.2 52.8 cached read P-Ray Ray Tracer 1 Million pixel image 23.5 17.9 156.2 cached read MurPHI Protocol Verification SCI protocol, 2 proc 67.7 35.3 183.5 Bulk Connect Connected Comp 4M nodes, 2-D mesh, 30% 2.3 1.2 212.6 BSP NOW-sort Disk-to-Disk Sort 32M 100-byte records 127.2 56.9 817.4 I/O Radb Bulk version Radix 16M 32-bit keys 7.0 3.7 852.7 Bulk SPAA/PODC Clusters
Sensitivity to Overhead SPAA/PODC Clusters
Comparative Impact SPAA/PODC Clusters
Sensitivity to bulk BW (1/G) SPAA/PODC Clusters
Cluster Communication Performance • Overhead, Overhead, Overhead • hypersensitive due to increased serialization • Sensitivity to gap reflects bursty communication • Surprisingly latency tolerant • Plenty of room for overhead improvement - How sensitive are distributed systems? SPAA/PODC Clusters
Extrapolating to Low Overhead SPAA/PODC Clusters
Direct Memory Messaging • Send region and receive region for each end of communication channel • Write through send region into remote rcv region SPAA/PODC Clusters
MEMORY CHANNEL interconnect ° ° ° Link Interface PCT rx tx ctr ctr rcv dma Bus Interface PCI (33 MHz) B/A AlphaServer SMP Alpha Mem P - $ Direct Memory Interconnects • DEC Memory Channels • 3 us end-to-end • ~ 1us o, L • SCI • SGI • Shrimp (Princeton) 100 MB/s SPAA/PODC Clusters
P P P P P P P P Scalability, Availability, and Performance FE FE FE • Scale disk, memory, proc independently • Random node serves query, all search • On (hw or sw) failure, lose random cols of index • On overload, lose random rows Inktomi Myrinet 100 Million Document Index SPAA/PODC Clusters
Summary • Performance => Generality (see Part 2) • From Technology “Shift” to Technology “Trend” • Cluster communication becoming cheap • gigabit ethernet • System Area Networks becoming commodity • Myricom OEM, Tandem/Compaq ServerNet, SGI, HAL, Sun • Improvements in interconnect BW • gigabyte per second and beyond • Bus connections improving • PCI, ePCI, Pentium II cluster slot, … • Operating system out of the way • VIA SPAA/PODC Clusters
Advice • Clusters are cheap, easy to build, flexible, powerful, general purpose and fun • Everybody doing SPAA or PODC should have one to try out their ideas • Can use Berkeley NOW through npaci • www.npaci.edu SPAA/PODC Clusters