Building a Low-Cost Supercomputer by Dr. Tim McGuire - ACET 2000 Austin, TX

NO Building a Low-Cost Supercomputer Dr. Tim McGuire Sam Houston State University ACET 2000 Austin, TX

Acknowledgments • Most treatments of cluster computing (including this one) are heavily based on the seminal work of Greg Pfister (IBM Research, Austin,) In Search of Clusters • The concept of Beowulf clusters originated with Donald J. Becker and Thomas Sterling at the Center of Excellence in Space Data and Information Sciences, NASA Goddard Space Flight Center

Introduction • There are three ways to do anything faster: • Work harder • "Crunch Time" is familiar to all of us • Work smarter • Better to find a way to reduce the work needed • Get help • Certainly works, but we all know about committees ...

In a computer ... • Working Harder Get a faster processor • Working Smarter Use a better algorithm • Getting Help Parallel processing

Working Harder -- Faster Processors • The effect of faster processors is astonishing • The effective speed of the x86 family of processors has increased nearly 50% per year • RISC architectures have sustained a 60% annual cumulative growth rate • These trends will likely continue for the foreseeable future

Working Smarter -- Better Algorithms • The increases in speed made possible by better algorithms dwarf the accomplishments of faster hardware • Binary search on 1 billion items takes 30 comparisons, versus a maximum of one billion comparisons using linear search

Getting Help -- Parallel Processing • Covert parallel processing pipelining, vector processing, etc. really equivalent to faster hardware • Overt parallelism Done via software • "Parallelism is the wave of the future -- and always will be"

Early Attempts at Parallelism • Von Neumann thought it was too hard, and gave us the "Von Neumann bottleneck" • 60's ILLIAC IV project was the first great attempt at parallel processing (as well as trying to advance circuit and software technology.) • Japanese Fifth Generation Project launched another wave, including the Grand Challenge problems

Microprocessor Revolution • Microprocessors have had a superior price/performance ratio • "All you have to do is gang a whole bunch of them together" • The problem is "All you also have to do is program them to work together" • Programming costs much more than hardware

Highly Parallel Computing • Finally, (early 90's) microprocessors became fast and powerful enough that a practical-sized aggregation of them seemed the only feasible way to exceed supercomputer speeds • Even Cray Research (T3D) got into the act

"Lowly" Parallel Processing • Mid-to-late 90's -- military downsizing (among other things) caused funding to dry up • However … • Microprocessors kept getting faster … a lot faster • With overall performance doubling each year, in 4 years what needed 256 processors can be done with 16 instead. • System availability became a mass market issue • Since computers are so cheap, buy two (or more) for redundancy in case one fails and use them both, interconnected by a network

SMP -- One Form of "Cheap" Parallelism • Symmetric multiprocessors have been around for some time and have certain advantages over clusters • Typically, these have been shared memory systems -- few communication problems

The Big Distinction -- Programming • How you program SMP systems is substantially different from programming clusters: Their programming models are different • If you explicitly exploit SMP in an application, it's essentially impossible to efficiently exploit clusters in the same program

Why Clusters? • The Standard Litany • Why Now ? • Why Not Now?

The Standard Litany • Performance • Availability • Price/Performance Ratio • Incremental Growth • Scaling • Scavenging

Performance • No matter what form or measure of performance one is seeking -- throughput, response time, turnaround time, etc., it is straightforward to claim that one can get even more of it by using a bunch of machines at the same time. • Only occasionally does one hear the admission that a "tad bid" of new programming will be needed for anything to work correctly.

Availability • Having a computer shrivel up into an expensive paperweight can be a lot less traumatic if it's not unique, but rather one of a herd. • The work done by the dear departed sibling can be redistributed among the others (fail-soft computing)

Price/Performance Ratio • Clusters and other forms of computer aggregation are typically collections of machines that individually have very good performance for their price. • The promise is that the aggregate retains the price/performance of its individual members.

Incremental Growth • To the degree that one really does attain greater performance and availability with a group of computers, one should be able to enhance both by merely adding more machines. • Replacing machines should not be necessary.

Scaling • "Scalable" is, unfortunately, a buzzword • What it does deal with is how big a computer system can usably get. • It is a crucial element in the differentiation between clusters and symmetric multiprocessors.

Scavenging • "Look at all those unused CPU cycles spread across all the desktops in our network…" • Unused cycles are free. • However, how do you get and manage them? -- this complicates cluster support very significantly

The Benefits are Real • But, how does one take advantage of it? • The hardware provides the potential. • The fulfillment lies in the software, and unfortunately, software isn't riding the exponential growth curve.

Why Now? • Three Trends • Fat Boxes -- very high performance microprocessors • Fat Pipes -- standard high-speed communication • Thick Glue -- standard tools for distributed computing • One Market Requirement • High Availability

Fat Boxes • Microprocessors have kept, and will keep getting faster. • Supercomputers in the classic style are extinct for practical purposes • Mass-market, inexpensive microprocessors have crawled up the tailpipe of the workstation market just like workstations crawled up the tailpipe of minicomputers and mainframes earlier. • There are no more supercomputers, there is only supercomputing.

Fat Pipes • Commodity off the shelf (COTS) networking parts have achieved communication performance that was only previously possible with expensive, proprietary techniques • Standardized communication facilities such as • ATM - Asynchronous Transmission Mode • Switched Gigabit Ethernet • FCS -- Fibre Channel Standard • Performance of Gigabytes per second are possible.

Thick Glue • Standard tools for distributed computing such as TCP/IP • Intranets, the Internet, and the World Wide Web • Tool sets for distributed system administration • PVM (Parallel Virtual Machine) and MPI (Message Passing Interface)

Requirement for High Availability • Nobody has ever wanted computers to break. • However, never before has high availability become a significant issue in a mass market computer arena. • Clusters are uniquely capable of answering the need of both sides of the spectrum and are much cheaper than hardware based fault-tolerant approaches.

Why Not Now? • If they're so good, why haven't clusters become the most common mode of computation? • Lack of "single system image" software • Limited exploitation

Lack of Single System Image Software • Replacing a single large computer with a cluster means that many systems will have to be managed rather than one. • Their distributed management tools are tools, not turnkey systems • 50% of the cost of a computer system is staffing, rather than hardware, software, or maintenance

Limited Exploitation • Only relatively few types of subsystems now exploit the ability of clusters to provide both scalable performance and high availability. • This is a direct result of substantial difficulties that arise in parallel programming. • The problem is not hardware, it's software

An Exception • For one kind of parallel system, the software issues have been addressed to a large degree: The symmetric multiprocessor (SMP) • It of necessity requires a single system image

Definitions, Distinctions, and Comparisons • Definition • Distinction from Parallel Systems • Distinctions from Distributed Systems • Comparisons and Contrasts

Definition • A cluster is a type of parallel or distributed system that: • consists of a collection of interconnected stand-alone computers, and • is used as a single, unified computing resource • We define them as a subparadigm of distributed (or parallel) systems

Distinction from Parallel Systems • A useful analogy: • This is A Dog • (a single computer)

A Pack of Dogs • And this is a pack of dogs (running in parallel) • (a cluster)

A Savage Multiheaded Pooch • … or, pardon the abbreviation, "SMP" • (This pooch is no relation to Kerberos (Cerberus in Latin) that guards both the gates of Hades and distributed systems -- He only has three heads.)

Dog Packs and SMPs are Similar • Both are more potent than just plain dogs • They can both bring down larger prey than a plain single dog. • They eat more and eat faster than a single dog

Dog Packs and SMPs are Different • Scaling • Availability • System Management • Software Licensing

Scaling Differences • The Savage Multiheaded Pooch can take many bites at once • What happens when it tries to swallow? • It needs a larger throat, stomach, intestines, etc. • Similarly, to scale SMPs, you must beef up the entire machine • When you add another dog to a dog pack, you add a whole dog. You don't have to do anything to the other dogs. • Likewise, clusters

Availability • If an SMP breaks a leg … "that dog won't hunt" … no matter how many heads it has. • If a member of the pack is injured, the rest of the pack can still bring down prey.

System Management • You only have to walk a SMP once. • It takes a good deal more effort to train a pack of dogs to behave. • With the SMP, all you have to do is get the heads to learn basic cooperation (and that should be built into the operating system.)

Licensing (Dogs or Software) • If you get a license for an SMP, you'll probably only need one license • For an cluster of dogs, you'll need one per dog

Distinctions from Distributed Systems • The distinctions of clusters from distributed systems is not as clear (and a lot of people confuse the two.) • We'll try. The salient points are: • Internal Anonymity • Peer Relationship • Clusters as part of a Distributed System

Internal Anonymity • Nodes in a distributed system necessarily retain their own individual identities • The elements of a cluster are usually viewed from outside the cluster as anonymous • Internally, they may be differentiated, but externally the jobs are submitted to the cluster, not, for example, to cluster node #4

Peer Relationship • Distributed systems • use an underlying communication layer that is peer-to-peer • at a higher level, they are often organized into a client-server paradigm • Clusters • underlying communication is peer-to peer • organization is also peer-to-peer (with some minor exceptions)

Clusters as part of a Distributed System • Clusters usually exist in the context of a distributed system • In this case, they are viewed by the distributed system as a single node • For example, the cluster could server as a compute engine • It also could serve as, say, a DBMS server in the client-server paradigm (but that's not the organization we want to consider in this presentation)

Beowulf Clusters • The Beowulf project was initiated in 1994 under the sponsorship of the NASA HPCC program to explore how computing could be made "cheaper better faster". • They termed this PoPC -- a Pile of PCs

The "Pile of PCs" Approach • Very similar to COW (cluster of workstations) and shares the roots of NOW (network of workstations,) but emphasizes: • COTS (commodity off the shelf) components • dedicated processors (rather than scavenging cycles from idle workstations) • a private system area network (enclosed SAN rather than exposed LAN)

What Beowulf Adds • Beowulf adds to the PoPC model by emphasizing • no custom components • easy replication from multiple vendors • scalable I/O • a freely available software base • using freely available distributed computing tools with minimal changes • a collaborative design

Advantages of the Beowulf Approach • No single vendor owns the rights to the product -- not vulnerable to single vendor decisions • Approach permits technology tracking -- using the best, most recent components at the best price • Allows "just in place" configuration -- permits flexible and user driven decisions

Building a Low-Cost Supercomputer by Dr. Tim McGuire - ACET 2000 Austin, TX

Building a Low-Cost Supercomputer by Dr. Tim McGuire - ACET 2000 Austin, TX

Presentation Transcript

Low Cost Supercomputing

Low Cost Facilities

Low Cost Solutions

Building low-cost high-speed fiber networks

Low Cost DAQ

Low cost airlines

GEO, GEONETCast and building a low cost ground receiving station

Low Cost Workstation

Low Cost Green Building Materials selection

Designing and Building a Low-Cost, Laminar Flow Wind Tunnel

BUILDING A LOW COST LISN FOR EMI TESTS

Low Cost FPGAs

Building a Low-Cost Supercomputer

Low cost, building integrated Concentration PV 1

LOW COST HOUSING

Low Cost MRI

Alternative Low Cost Building Materials on Beams

low cost movers

Low Cost Facilities