130 likes | 282 Views
How to build a Myria-cluster (In the metric system, “myria” = 10,000). Charles L. Seitz Myricom, Inc. (chuck@myri.com) CCGSC 24 September 2000. Background. The assumed time frame is 2001-2002. There are a few 1,000 + -host clusters in operation today, with many more planned.
E N D
How to build a Myria-cluster(In the metric system, “myria” = 10,000) Charles L. Seitz Myricom, Inc. (chuck@myri.com) CCGSC24 September 2000
Background • The assumed time frame is 2001-2002. • There are a few 1,000+-host clusters in operation today, with many more planned. • Just 4 years ago a 100-host cluster was considered to be remarkable. The subject of this brief talk is to consider from an engineering viewpoint the feasibility of a cluster of 10,000 hosts. This talk is intended to be interactive. I shall frequently ask the members of this distinguished audience for their estimates, opinions, and predictions of the future.
The performance/cost argument for clusters Performance For a single processor Region of diminishing returns Cost-effective processors Cost Cluster computing shifts the burden of extending performance from the processors, where it is limited by physics and technology, to algorithms and programming (distributing an application across a collection of cost-effective hosts).
Assumed Host Characteristics • The best peak performance/cost is likely with 2-4 processors per host (small-scale SMP architecture). • Example: 2-processor or 4-processor Alpha EV-7 host: • >> 100 SPECfp95; >>50 SPECint95 each processor • Main-memory peak data rate >10 GB/s (?) • Peak I/O rate (PCI-X) ~1 GB/s per PCI slot A useful “rule of thumb” for clusters is that the data rate to or from a host need not be more than a modest fraction – perhaps 10-20% – of the host’s memory bandwidth. A distributed computation that consumes so much memory bandwidth that it impacts compute performance is over-distributed. Thus, the network connection should support 1-2 GB/s data rate.
Physical Size • A rack, including front and back access space, is ~1 m2 of floor space. • Somewhat pessimistic estimate: 10 hosts/rack is 1,000 m2 = ~(32m)2 for 10,000 hosts. • Implication: Most connectivity will be fiber if a “good” topology is employed. • Best data rate per cost will be with 2.5 GBaud (2 Gbit/s after 8b/10b encoding) VCSEL optical components and multimode fiber -- the fiber PHY level of 1x InfiniBand, Myrinet-2000, and other fast networks. Part of the “Los Lobos” cluster at the University of New Mexico. The 256 hosts are 2-processor IBM Netfinity units, the operating system is Linux, and the interconnect is Myrinet. The system supplier was IBM.
The topology should be a Clos network • Clos networks are named for Charles Clos, who introduced them in a paper titled “A Study of Non-Blocking Switching Networks,” published in the Bell System Technical Journal in March 1953. • A Clos network is a rearrangeable, which means that it can route any permutation without blocking. Although the property of being rearrangeable is rarely exploited directly in clusters, a rearrangeable network necessarily exhibits full (maximal) bisection, a property that is crucial to any network that claims to be scalable. • A crossbar switch is a rearrangeable and full-bisection network, but technology sets an upper bound on the degree of crossbar switches.
Scalable Clos Networks Example of a Clos network for 320 hosts, composed of 16-port switches 64 hosts 64 hosts 64 hosts 64 hosts 64 hosts
A small, intuitive explanation 64 inter-switch links This line cuts 64 links Preserves 64-link bandwidth in this direction 64 host links The vertical dashed line cuts 32 links. In fact, the number of links between any {32, 32} partition of hosts is 32 (maximal).
Why Clos Networks? • Maximal performance under arbitrary traffic patterns • Minimum bisection is the largest possible • “Rearrangable Network” (can route any permutation) • Network looks the same from any host (simplifies cluster management) • Multiple paths • All progressive routes are deadlock-free • Use multiple paths for redundancy • Use multiple paths to avoid hot spots (random dispersion) • Scales well. For n hosts (minimum bisection = n /2): • Diameter varies as log(n) • Cost varies as nlog(n) • Modular • Economies of sharing power and system monitoring between many switches, and implementing many of the inter-switch links on circuit boards rather than cables.
Recap of the Configuration • 10,000 2-processor or 4-processor hosts (10 (?) peak Gflops per host) • 2 x (250+250 MB/s) ports per host (1 GB/s total data rate) • 2 ports per NIC is attractive both for performance and for failover • 20,000-host-port Clos network • Diameter = 9 switches if based on 16-port switches Reliability / Availability • At 50K hours MTBF, an average of 5 hosts fail each day. • At 4M hours MTBF, a NIC fails ~ each 2 weeks. • A localized failure in the central switch ~ each 2 weeks (?). Clearly, such a system must be designed so that all components can be hot-swapped, and the system-management software must monitor the system status and allocate resources accordingly.
Difficult questions Cost • ~$250M, based on $20K/host plus interconnect and integration. Operating System • Linux, of course ;-). Would you want to use a proprietary OS? Why? • A sufficiently important set of computing problems. • Bragging rights (at several levels).