1 / 48

What’s So Different about Cluster Architectures?

What’s So Different about Cluster Architectures?. David E. Culler Computer Science Division U.C. Berkeley http://now.cs.berkeley.edu. High Performance Clusters “happen”. Many groups have built them. Many more are using them. Industry is running with it Virtual Interface Architecture

rhaag
Download Presentation

What’s So Different about Cluster Architectures?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s So Different about Cluster Architectures? David E. Culler Computer Science Division U.C. Berkeley http://now.cs.berkeley.edu IPPS 98

  2. High Performance Clusters “happen” • Many groups have built them. • Many more are using them. • Industry is running with it • Virtual Interface Architecture • System Area Networks • A powerful, flexible new design technique IPPS 98

  3. Outline • Quick “guided tour” of Clusters at Berkeley • Three Important Advances => Virtual Networks Alan Mainwaring => Implicit Co-scheduling Andrea Arpaci-Dusseau => Scalable I/O Remzi Arpaci-Dusseau • What it means IPPS 98

  4. Stop 1: HP/fddi Prototype • FDDI on the HP/735 graphics bus. • First fast msg layer on non-reliable network IPPS 98

  5. Stop 2: SparcStation NOW • ATM was going to take over the world. The original INKTOMI IPPS 98

  6. Stop 3: Large Ultra/Myrinet NOW IPPS 98

  7. Stop 4: Massive Cheap Storage • Basic unit: 2 PCs double-ending four SCSI chains Currently serving Fine Art at http://www.thinker.org/imagebase/ IPPS 98

  8. Stop 5: Cluster of SMPs (CLUMPS) • Four Sun E5000s • 8 processors • 3 Myricom NICs • Multiprocessor, Multi-NIC, Multi-Protocol • see S. Lumetta IPPS98 IPPS 98

  9. Stop 6: Information Servers • Basic Storage Unit: • Ultra 2, 300 GB raid, 800 GB tape stacker, ATM • scalable backup/restore • Dedicated Info Servers • web, • security, • mail, … • VLANs project into dept. IPPS 98

  10. Stop 7: Millennium PC Clumps • Inexpensive, easy to manage Cluster • Replicated in many departments • Prototype for very large PC cluster IPPS 98

  11. So What’s So Different? • Commodity parts? • Communications Packaging? • Incremental Scalability? • Independent Failure? • Intelligent Network Interfaces? • Complete System on every node • virtual memory • scheduler • files • ... IPPS 98

  12. Three important system design aspects • Virtual Networks • Implicit co-scheduling • Scalable File Transfer IPPS 98

  13. Communication Performance  Direct Network Access • LogP: Latency, Overhead, and Bandwidth • Active Messages: lean layer supporting programming models Latency 1/BW IPPS 98

  14. General purpose requirements • Many timeshared processes • each with direct, protected access • User and system • Client/Server, Parallel clients, parallel servers • they grow, shrink, handle node failures • Multiple packages in a process • each may have own internal communication layer • Use communication as easily as memory IPPS 98

  15. Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain. IPPS 98

  16. How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory • active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem P Network Interface IPPS 98

  17. Endpoint Transition Diagram HOT R/W NIC Memory Evict Write MsgArrival WARM R/O Paged Host Memory Read Swap COLD Paged Host Memory IPPS 98

  18. Network Interface Support • NIC has endpoint frames • Services active endpoints • Signals misses to driver • using a system endpont Frame 0 Transmit Receive Frame 7 EndPoint Miss IPPS 98

  19. Solaris System Abstractions • Segment Driver • manages portions of an address space • Device Driver • manages I/O device Virtual Network Driver IPPS 98

  20. LogP Performance • Competitive latency • Increased NIC processing • Difference mostly • ack processing • protection check • data structures • code quality • Virtualization cheap IPPS 98

  21. Msg burst work Client Server Client Server Server Client Bursty Communication among many IPPS 98

  22. Multiple VN’s, Single-thread Server IPPS 98

  23. Multiple VNs, Multithreaded Server IPPS 98

  24. Perspective on Virtual Networks • Networking abstractions are vertical stacks • new function => new layer • poke through for performance • Virtual Networks provide a horizontal abstraction • basis for build new, fast services IPPS 98

  25. Beyond the Personal Supercomputer • Able to timeshare parallel programs • with fast, protected communication • Mix with sequential and interactive jobs • Use fast communication in OS subsystems • parallel file system, network virtual memory, … • Nodes have powerful, local OS scheduler • Problem: local schedulers do not know to run parallel jobs in parallel IPPS 98

  26. Local Scheduling • Schedulers act independently w/o global control • Program waits while trying communicate with its peers that are not running • 10 - 100x slowdowns for fine-grain programs! => need coordinated scheduling IPPS 98

  27. Explicit Coscheduling • Global context switch according to precomputed schedule • How do you build it? Does it work? IPPS 98

  28. Master LS LS LS LS A A A A A A GS GS GS GS LS LS LS LS A A A A A A Typical Cluster Subsystem Structures Master-Slave Local service Applications Communication Communication Peer-to-Peer Global Service IPPS 98 Communication

  29. GS GS LS LS A A GS GS LS LS A A A A Ideal Cluster Subsystem Structure • Obtain coordination without explicit subsystem interaction, only the events in the program • very easy to build • potentially very robust to component failures • inherently “service on-demand” • scalable • Local service component can evolve. IPPS 98

  30. M LS LS GS GS LS LS A LS A LS A A A A A A GS GS GS GS GS GS LS LS LS LS LS LS A A A A A A A A A A Three approaches examined in NOW • GLUNIX explicit master-slave (user level) • matrix algorithm to pick PP • uses stops & signals to try to force desired PP to run • Explicit peer-peer scheduling assist with VNs • co-scheduling daemons decide on PP and kick the solaris scheduler • Implicit • modify the parallel run-time library to allow it to get itself co-scheduled with standard scheduler IPPS 98

  31. Problems with explicit coscheduling • Implementation complexity • Need to identify parallel programs in advance • Interacts poorly with interactive use and load imbalance • Introduces new potential faults • Scalability IPPS 98

  32. WS 1 Job A sleep Job A request response WS 2 Job B Job A WS 3 Job B Job A spin WS 4 Job B Job A Why implicit coscheduling might work • Active message request-reply model • Infer non-local state from local observations; react to maintain coordination observation implication action fast response partner scheduled spin delayed response partner not scheduled block IPPS 98

  33. Obvious Questions • Does it work? • How long do you spin? • What are the requirements on the local scheduler? IPPS 98

  34. How Long to Spin? • Answer: round trip time + 5 x wake-up time • round-trip to stay scheduled together • plus wake-up to get scheduled together • plus wake-up to be competitive with blocking cost • plus 3 x wake-up to meet “pairwise” cost IPPS 98

  35. Does it work? IPPS 98

  36. Synthetic Bulk-synchronous Apps • Range of granularity and load imbalance • spin wait 10x slowdown IPPS 98

  37. With mixture of reads • Block-immediate 4x slowdown IPPS 98

  38. Timesharing Split-C Programs IPPS 98

  39. Many Questions • What about • mix of jobs? • sequential jobs? • unbalanced placement? • Fairness? • Scalability? • How broadly can implicit coordination be applied in the design of cluster subsystems? IPPS 98

  40. A look at Serious File I/O • Traditional I/O system • NOW I/O system • Benchmark Problem: sort large number of 100 byte records with 10 byte keys • start on disk, end on disk • accessible as files (use the file system) • Datamation sort: 1 million records • Minute sort: quantity in a minute Proc- Mem P-M P-M P-M P-M IPPS 98

  41. NOW-Sort Algorithm: 1 pass • Read • N/P records from disk -> memory • Distribute • send keys to processors holding result buckets • Sort • partial radix sort on each bucket • Write • gather and write records to disk IPPS 98

  42. Key Implementation Techniques • Performance Isolation: highly tuned local disk-to-disk sort • manage local memory • manage disk striping • memory mapped I/O with m-advise, buffering • manage overlap with threads • Efficient Communication • completely hidden under disk I/O • competes for I/O bus bandwidth • Self-tuning Software • probe available memory, disk bandwidth, trade-offs IPPS 98

  43. World-Record Disk-to-Disk Sort • Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth IPPS 98

  44. Towards a Cluster File System • Remote disk system built on a virtual network Client RD server RDlib Active msgs IPPS 98

  45. Streaming Transfer Experiment IPPS 98

  46. Results • Data distribution affects resource utilization • Not delivered bandwidth IPPS 98

  47. I/O Bus crossings IPPS 98

  48. Conclusions • Complete system on every node makes clusters a very powerful architecture. • Extend the system globally • virtual memory systems, • schedulers, • file systems, ... • Efficient communication enables new solutions to classic systems challenges. • Opens a rich set of issues for parallel processing beyond the personal supercomputer. IPPS 98

More Related