180 likes | 193 Views
Cluster Archtectures and the NPACI Berkeley NOW. David E. Culler Computer Science Division U.C. Berkeley http://now.cs.berkeley.edu. Architectural Drivers. Node architecture dominates performance processor, cache, bus, and memory design and engineering $ => performance
E N D
Cluster Archtectures andthe NPACI Berkeley NOW David E. Culler Computer Science Division U.C. Berkeley http://now.cs.berkeley.edu
Architectural Drivers • Node architecture dominates performance • processor, cache, bus, and memory • design and engineering $ => performance • Greatest demand for performance is on large systems • must track the leading edge of technology without lag • MPP network technology => mainstream • system area networks • System on every node is a powerful enabler • very high speed I/O, virtual memory, schedulings, … • Incremental scalability (up, down, and across) • Complete software tools • Wide class of applications
Berkeley NOW • 100 Sun UltraSparcs • 200 disks • Myrinet SAN • 160 MB/s • Fast comm. • AM, MPI, ... • Ether/ATM switched external net • Global OS • Self Config
P P Basic Components MyriNet 160 MB/s Myricom NIC M M I/O bus $ Sun Ultra 170
Massive Cheap Storage Cluster • Basic unit: 2 PCs double-ending four SCSI chains of 8 disks each Currently serving Fine Art at http://www.thinker.org/imagebase/
Cluster of SMPs (CLUMPS) • Four Sun E5000s • 8 processors • 4 Myricom NICs each • Multiprocessor, Multi-NIC, Multi-Protocol • NPACI => Sun 450s
Millennium PC Clumps • Inexpensive, easy to manage Cluster • Replicated in many departments • Prototype for very large PC cluster
So What’s So Different? • Commodity parts? • Communications Packaging? • Incremental Scalability? • Independent Failure? • Intelligent Network Interfaces? • Complete System on every node • virtual memory • scheduler • files • ...
Communication Performance Direct Network Access • LogP: Latency, Overhead, and Bandwidth • Active Messages: lean layer supporting programming models Latency 1/BW
World-Record Disk-to-Disk Sort • Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth
General purpose Parallel System • Many timeshared processes • each with direct, protected access • partition it any way you like • User and system • Client/Server, Parallel clients, parallel servers • they grow, shrink, handle node failures • Multiple packages in a process • each may have own internal communication layer • Use communication as easily as memory
Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain.
How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory • active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem P Network Interface
Network Interface Support • NIC has endpoint frames • Services active endpoints • Signals misses to driver • using a system endpont Frame 0 Transmit Receive Frame 7 EndPoint Miss
Msg burst work Client Server Client Server Server Client Communication under Load
Beyond the Personal Supercomputer • Able to timeshare parallel programs • with fast, protected communication • Mix with sequential and interactive jobs • Use fast communication in OS subsystems • parallel file system, network virtual memory, … • Nodes have powerful, local OS scheduler • Simple implicit scheduling techniques provide coordinated scheduling => ride workstation/PC nodes and internet server systems technology => focus CS partners on RAS for long running apps