480 likes | 492 Views
What’s So Different about Cluster Architectures?. David E. Culler Computer Science Division U.C. Berkeley http://now.cs.berkeley.edu. High Performance Clusters “happen”. Many groups have built them. Many more are using them. Industry is running with it Virtual Interface Architecture
E N D
What’s So Different about Cluster Architectures? David E. Culler Computer Science Division U.C. Berkeley http://now.cs.berkeley.edu IPPS 98
High Performance Clusters “happen” • Many groups have built them. • Many more are using them. • Industry is running with it • Virtual Interface Architecture • System Area Networks • A powerful, flexible new design technique IPPS 98
Outline • Quick “guided tour” of Clusters at Berkeley • Three Important Advances => Virtual Networks Alan Mainwaring => Implicit Co-scheduling Andrea Arpaci-Dusseau => Scalable I/O Remzi Arpaci-Dusseau • What it means IPPS 98
Stop 1: HP/fddi Prototype • FDDI on the HP/735 graphics bus. • First fast msg layer on non-reliable network IPPS 98
Stop 2: SparcStation NOW • ATM was going to take over the world. The original INKTOMI IPPS 98
Stop 3: Large Ultra/Myrinet NOW IPPS 98
Stop 4: Massive Cheap Storage • Basic unit: 2 PCs double-ending four SCSI chains Currently serving Fine Art at http://www.thinker.org/imagebase/ IPPS 98
Stop 5: Cluster of SMPs (CLUMPS) • Four Sun E5000s • 8 processors • 3 Myricom NICs • Multiprocessor, Multi-NIC, Multi-Protocol • see S. Lumetta IPPS98 IPPS 98
Stop 6: Information Servers • Basic Storage Unit: • Ultra 2, 300 GB raid, 800 GB tape stacker, ATM • scalable backup/restore • Dedicated Info Servers • web, • security, • mail, … • VLANs project into dept. IPPS 98
Stop 7: Millennium PC Clumps • Inexpensive, easy to manage Cluster • Replicated in many departments • Prototype for very large PC cluster IPPS 98
So What’s So Different? • Commodity parts? • Communications Packaging? • Incremental Scalability? • Independent Failure? • Intelligent Network Interfaces? • Complete System on every node • virtual memory • scheduler • files • ... IPPS 98
Three important system design aspects • Virtual Networks • Implicit co-scheduling • Scalable File Transfer IPPS 98
Communication Performance Direct Network Access • LogP: Latency, Overhead, and Bandwidth • Active Messages: lean layer supporting programming models Latency 1/BW IPPS 98
General purpose requirements • Many timeshared processes • each with direct, protected access • User and system • Client/Server, Parallel clients, parallel servers • they grow, shrink, handle node failures • Multiple packages in a process • each may have own internal communication layer • Use communication as easily as memory IPPS 98
Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain. IPPS 98
How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory • active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem P Network Interface IPPS 98
Endpoint Transition Diagram HOT R/W NIC Memory Evict Write MsgArrival WARM R/O Paged Host Memory Read Swap COLD Paged Host Memory IPPS 98
Network Interface Support • NIC has endpoint frames • Services active endpoints • Signals misses to driver • using a system endpont Frame 0 Transmit Receive Frame 7 EndPoint Miss IPPS 98
Solaris System Abstractions • Segment Driver • manages portions of an address space • Device Driver • manages I/O device Virtual Network Driver IPPS 98
LogP Performance • Competitive latency • Increased NIC processing • Difference mostly • ack processing • protection check • data structures • code quality • Virtualization cheap IPPS 98
Msg burst work Client Server Client Server Server Client Bursty Communication among many IPPS 98
Perspective on Virtual Networks • Networking abstractions are vertical stacks • new function => new layer • poke through for performance • Virtual Networks provide a horizontal abstraction • basis for build new, fast services IPPS 98
Beyond the Personal Supercomputer • Able to timeshare parallel programs • with fast, protected communication • Mix with sequential and interactive jobs • Use fast communication in OS subsystems • parallel file system, network virtual memory, … • Nodes have powerful, local OS scheduler • Problem: local schedulers do not know to run parallel jobs in parallel IPPS 98
Local Scheduling • Schedulers act independently w/o global control • Program waits while trying communicate with its peers that are not running • 10 - 100x slowdowns for fine-grain programs! => need coordinated scheduling IPPS 98
Explicit Coscheduling • Global context switch according to precomputed schedule • How do you build it? Does it work? IPPS 98
Master LS LS LS LS A A A A A A GS GS GS GS LS LS LS LS A A A A A A Typical Cluster Subsystem Structures Master-Slave Local service Applications Communication Communication Peer-to-Peer Global Service IPPS 98 Communication
GS GS LS LS A A GS GS LS LS A A A A Ideal Cluster Subsystem Structure • Obtain coordination without explicit subsystem interaction, only the events in the program • very easy to build • potentially very robust to component failures • inherently “service on-demand” • scalable • Local service component can evolve. IPPS 98
M LS LS GS GS LS LS A LS A LS A A A A A A GS GS GS GS GS GS LS LS LS LS LS LS A A A A A A A A A A Three approaches examined in NOW • GLUNIX explicit master-slave (user level) • matrix algorithm to pick PP • uses stops & signals to try to force desired PP to run • Explicit peer-peer scheduling assist with VNs • co-scheduling daemons decide on PP and kick the solaris scheduler • Implicit • modify the parallel run-time library to allow it to get itself co-scheduled with standard scheduler IPPS 98
Problems with explicit coscheduling • Implementation complexity • Need to identify parallel programs in advance • Interacts poorly with interactive use and load imbalance • Introduces new potential faults • Scalability IPPS 98
WS 1 Job A sleep Job A request response WS 2 Job B Job A WS 3 Job B Job A spin WS 4 Job B Job A Why implicit coscheduling might work • Active message request-reply model • Infer non-local state from local observations; react to maintain coordination observation implication action fast response partner scheduled spin delayed response partner not scheduled block IPPS 98
Obvious Questions • Does it work? • How long do you spin? • What are the requirements on the local scheduler? IPPS 98
How Long to Spin? • Answer: round trip time + 5 x wake-up time • round-trip to stay scheduled together • plus wake-up to get scheduled together • plus wake-up to be competitive with blocking cost • plus 3 x wake-up to meet “pairwise” cost IPPS 98
Does it work? IPPS 98
Synthetic Bulk-synchronous Apps • Range of granularity and load imbalance • spin wait 10x slowdown IPPS 98
With mixture of reads • Block-immediate 4x slowdown IPPS 98
Timesharing Split-C Programs IPPS 98
Many Questions • What about • mix of jobs? • sequential jobs? • unbalanced placement? • Fairness? • Scalability? • How broadly can implicit coordination be applied in the design of cluster subsystems? IPPS 98
A look at Serious File I/O • Traditional I/O system • NOW I/O system • Benchmark Problem: sort large number of 100 byte records with 10 byte keys • start on disk, end on disk • accessible as files (use the file system) • Datamation sort: 1 million records • Minute sort: quantity in a minute Proc- Mem P-M P-M P-M P-M IPPS 98
NOW-Sort Algorithm: 1 pass • Read • N/P records from disk -> memory • Distribute • send keys to processors holding result buckets • Sort • partial radix sort on each bucket • Write • gather and write records to disk IPPS 98
Key Implementation Techniques • Performance Isolation: highly tuned local disk-to-disk sort • manage local memory • manage disk striping • memory mapped I/O with m-advise, buffering • manage overlap with threads • Efficient Communication • completely hidden under disk I/O • competes for I/O bus bandwidth • Self-tuning Software • probe available memory, disk bandwidth, trade-offs IPPS 98
World-Record Disk-to-Disk Sort • Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth IPPS 98
Towards a Cluster File System • Remote disk system built on a virtual network Client RD server RDlib Active msgs IPPS 98
Streaming Transfer Experiment IPPS 98
Results • Data distribution affects resource utilization • Not delivered bandwidth IPPS 98
I/O Bus crossings IPPS 98
Conclusions • Complete system on every node makes clusters a very powerful architecture. • Extend the system globally • virtual memory systems, • schedulers, • file systems, ... • Efficient communication enables new solutions to classic systems challenges. • Opens a rich set of issues for parallel processing beyond the personal supercomputer. IPPS 98