970 likes | 1.2k Views
High Performance Linux Clusters. Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC. Overview of San Diego Supercomputer Center . Founded in 1985 Non-military access to supercomputers Over 400 employees Mission: Innovate, develop, and deploy technology to advance science
E N D
High Performance Linux Clusters Guru Session, Usenix, Boston June 30, 2004 Greg Bruno, SDSC
Overview of San Diego Supercomputer Center • Founded in 1985 • Non-military access to supercomputers • Over 400 employees • Mission: Innovate, develop, and deploy technology to advance science • Recognized as an international leader in: • Grid and Cluster Computing • Data Management • High Performance Computing • Networking • Visualization • Primarily funded by NSF
My Background • 1984 - 1998: NCR - Helped to build the world’s largest database computers • Saw the transistion from proprietary parallel systems to clusters • 1999 - 2000: HPVM - Helped build Windows clusters • 2000 - Now: Rocks - Helping to build Linux-based clusters
Cluster Pioneers • In the mid-1990s, Network of Workstations project (UC Berkeley) and the Beowulf Project (NASA) asked the question: Can You Build a High Performance Machine From Commodity Components?
The Answer is: Yes Source: Dave Pierce, SIO
Types of Clusters • High Availability • Generally small (less than 8 nodes) • Visualization • High Performance • Computational tools for scientific computing • Large database machines
High Availability Cluster • Composed of redundant components and multiple communication paths
Visualization Cluster • Each node in the cluster drives a display
High Performance Cluster • Constructed with many compute nodes and often a high-performance interconnect
Cluster Processors • Pentium/Athlon • Opteron • Itanium
Processors: x86 • Most prevalent processor used in commodity clustering • Fastest integer processor on the planet: • 3.4 GHz Pentium 4, SPEC2000int: 1705
Processors: x86 • Capable floating point performance • #5 machine on Top500 list built with Pentium 4 processors
Processors: Opteron • Newest 64-bit processor • Excellent integer performance • SPEC2000int: 1655 • Good floating point performance • SPEC2000fp: 1691 • #10 machine on Top500
Processors: Itanium • First systems released June 2001 • Decent integer performance • SPEC2000int: 1404 • Fastest floating-point performance on the planet • SPEC2000fp: 2161 • Impressive Linpack efficiency: 86%
But What You Really Build? • Itanium: Dell PowerEdge 3250 • Two 1.4 GHz CPUs (1.5 MB cache) • 11.2 Gflops peak • 2 GB memory • 36 GB disk • $7,700 • Two 1.5 GHz (6 MB cache) makes the system cost ~$17,700 • 1.4 GHz vs. 1.5 GHz • ~7% slower • ~130% cheaper
Opteron • IBM eServer 325 • Two 2.0 GHz Opteron 246 • 8 Gflops peak • 2 GB memory • 36 GB disk • $4,539 • Two 2.4 GHz CPUs: $5,691 • 2.0 GHz vs. 2.4 GHz • ~17% slower • ~25% cheaper
Pentium 4 Xeon • HP DL140 • Two 3.06 GHz CPUs • 12 Gflops peak • 2 GB memory • 80 GB disk • $2,815 • Two 3.2 GHz: $3,368 • 3.06 GHz vs. 3.2 GHz • ~4% slower • ~20% cheaper
What People Are Buying • Gartner study • Servers shipped in 1Q04 • Itanium: 6,281 • Opteron: 31,184 • Opteron shipped 5x more servers than Itanium
What Are People Buying • Gartner study • Servers shipped in 1Q04 • Itanium: 6,281 • Opteron: 31,184 • Pentium: 1,000,000 • Pentium shipped 30x more than Opteron
Interconnects • Ethernet • Most prevalent on clusters • Low-latency interconnects • Myrinet • Infiniband • Quadrics • Ammasso
Why Low-Latency Interconnects? • Performance • Lower latency • Higher bandwidth • Accomplished through OS-bypass
How Low Latency Interconnects Work • Decrease latency for a packet by reducing the number memory copies per packet
Bisection Bandwidth • Definition: If split system in half, what is the maximum amount of data that can pass between each half? • Assuming 1 Gb/s links: • Bisection bandwidth = 1 Gb/s
Bisection Bandwidth • Assuming 1 Gb/s links: • Bisection bandwidth = 2 Gb/s
Bisection Bandwidth • Definition: Full bisection bandwidth is a network topology that can support N/2 simultaneous communication streams. • That is, the nodes on one half of the network can communicate with the nodes on the other half at full speed.
Large Networks • When run out of ports on a single switch, then you must add another network stage • In example above: Assuming 1 Gb/s links, uplinks from stage 1 switches to stage 2 switches must carry at least 6 Gb/s
Large Networks • With low-port count switches, need many switches on large systems in order to maintain full bisection bandwidth • 128-node system with 32-port switches requires 12 switches and 256 total cables
Myrinet • Long-time interconnect vendor • Delivering products since 1995 • Deliver single 128-port full bisection bandwidth switch • MPI Performance: • Latency: 6.7 us • Bandwidth: 245 MB/s • Cost/port (based on 64-port configuration): $1000 • Switch + NIC + cable • http://www.myri.com/myrinet/product_list.html
Myrinet • Recently announced 256-port switch • Available August 2004
Myrinet • #5 System on Top500 list • System sustains 64% of peak performance • But smaller Myrinet-connected systems hit 70-75% of peak
Quadrics • QsNetII E-series • Released at the end of May 2004 • Deliver 128-port standalone switches • MPI Performance: • Latency: 3 us • Bandwidth: 900 MB/s • Cost/port (based on 64-port configuration): $1800 • Switch + NIC + cable • http://doc.quadrics.com/Quadrics/QuadricsHome.nsf/DisplayPages/A3EE4AED738B6E2480256DD30057B227
Quadrics • #2 on Top500 list • Sustains 86% of peak • Other Quadrics-connected systems on Top500 list sustain 70-75% of peak
Infiniband • Newest cluster interconnect • Currently shipping 32-port switches and 192-port switches • MPI Performance: • Latency: 6.8 us • Bandwidth: 840 MB/s • Estimated cost/port (based on 64-port configuration): $1700 - 3000 • Switch + NIC + cable • http://www.techonline.com/community/related_content/24364
Ethernet • Latency: 80 us • Bandwidth: 100 MB/s • Top500 list has ethernet-based systems sustaining between 35-59% of peak
Ethernet • What we did with 128 nodes and a $13,000 ethernet network • $101 / port • $28/port with our latest Gigabit Ethernet switch • Sustained 48% of peak • With Myrinet, would have sustained ~1 Tflop • At a cost of ~$130,000 • Roughly 1/3 the cost of the system
Rockstar Topology • 24-port switches • Not a symmetric network • Best case - 4:1 bisection bandwidth • Worst case - 8:1 • Average - 5.3:1
Low-Latency Ethernet • Bring os-bypass to ethernet • Projected performance: • Latency: less than 20 us • Bandwidth: 100 MB/s • Potentially could merge management and high-performance networks • Vendor “Ammasso”
Local Storage • Exported to compute nodes via NFS
Network Attached Storage • A NAS box is an embedded NFS appliance
Storage Area Network • Provides a disk block interface over a network (Fibre Channel or Ethernet) • Moves the shared disks out of the servers and onto the network • Still requires a central service to coordinate file system operations
Parallel Virtual File System • PVFS version 1 has no fault tolerance • PVFS version 2 (in beta) has fault tolerance mechanisms