Parallel Computing Systems Part I: Introduction

Parallel Computing SystemsPart I: Introduction Dror Feitelson Hebrew University

Topics • Overview of the field • Architectures: vectors, MPPs, SMPs, and clusters • Networks and routing • Scheduling parallel jobs • Grid computing • Evaluating performance

Today (and next week?) • What is parallel computing • Some history • The Top500 list • The fastest machines in the world • Trends and predictions

What is a Parallel System? In particular, what is the difference between parallel and distributed computing?

What is a Parallel System? Chandy: it is related to concurrency. • In distributed computing, concurrency is part of the problem. • In parallel computing, concurrency is part of the solution.

Distributed Systems • Concurrency because of physical distribution • Desktops of different users • Servers across the Internet • Branches of a firm • Central bank computer and ATMs • Need to coordinate among autonomous systems • Need to tolerate failures and disconnections

Parallel Systems • High-performance computing: solve problems that are to big for a single machine • Get the solution faster (weather forecast) • Get a better solution (physical simulation) • Need to parallelize algorithm • Need to control overhead • Can assume friendly system?

The Convergence Use distributed resources for parallel processing • Networks of workstations – use available desktop machines within organization • Grids – use available resources (servers?) across organizations • Internet computing – use personal PCs across the globe (SETI@home)

Some History

Early HPC • Parallel systems in academia/research • 1974: C.mmp • 1974: Illiac IV • 1978: Cm* • 1983: Goodyear MPP

Illiac IV • 1974 • SIMD: all processors do the same • Numerical calculations at NASA • Now in Boston computer museum

The Illiac IV in Numbers • 64 processors arranged as 8  8 grid • Each processor has 104 ECL transistors • Each processor has 2K 64-bit words (total is 8 Mbit) • Arranged in 210 boards • Packed in 16 cabinets • 500 Mflops peak performance • Cost: $31 million

Sustained vs. Peak • Peak performance: product of clock rate and number of functional units • Sustained rate: what you actually achieve on a real application • Sustained is typically much lower than peak • Application does not require all functional units • Need to wait for data to arrive from memory • Need to synchronize • Best for dense matrix operations (Linpack) A rate that the vendor guarantees will not be exceeded

Early HPC • Parallel systems in academia/research • 1974: C.mmp • 1974: Illiac IV • 1978: Cm* • 1983: Goodyear MPP • Vector systems by Cray and Japanese firms • 1976: Cray 1 rated at 160 Mflops peak • 1982: Cray X-MP, later Y-MP, C90, … • 1985: Cray 2, NEC SX-2

Cray’s Achievements • Architectural innovations • Vector operations on vector registers • All memory is equally close: no cache • Trade off accuracy and speed • Packaging • Short and equally long wires • Liquid cooling systems • Style

Vector Supercomputers • Vector registers store vectors of fast access • Vector instructions operate on whole vectors of values • Overhead of instruction decode only once per vector • Pipelined execution of instruction on vector elements: one result per clock tick (at least after pipeline is full) • Possible to chain vector operations: start feeding second functional unit before finishing first one

Cray 1 • 1975 • 80 MHz clock • 160 Mflops peak • Liquid cooling • World’s most expensive love seat • Power supply and cooling under the seat • Available in red, blue, black… • No operating system

Cray 1 Wiring • Round configuration for small and uniform distances • Longest wire: 4 feet • Wires connected manually by extra-small engineers

Cray X-MP • 1982 • 1 Gflop • Multiprocessor with 2 or 4 Cray1-like processors • Shard memory

Cray X-MP

Cray 2 • 1985 • Smaller and more compact than Cray 1 • 4 (or 8) processors • Total immersion liquid cooling

Cray Y-MP • 1988 • 8 proc’s • Achieved 1 Gflop

Cray Y-MP – Opened

Cray Y-MP – From Back Power supply and cooling

Cray C90 • 1992 • 1 Gflop per processor • 8 or more processors

The MPP Boom • 1985: Thinking Machines introduces the Connection Machine CM-1 • 16K single-bit processors, SIMD • Followed by CM-2, CM-200 • Similar machines by MasPar • mid ’80s: hypercubes become successful • Also: Transputers used as building blocks • Early ’90s: big companies join • IBM, Cray

SIMD Array Processors • ’80 favorites • Connection machine • Maspar • Very many single-bit processors with attached memory – proprietary hardware • Single control unit: everything is totally synchronized (SIMD = single instruction multiple data) • Massive parallelism even with “correct counting” (i.e. divide by 32)

Connection Machine CM-2 • Cube of 64K proc’s • Acts as backend • Hyper-cube topology • Data vault for parallel I/O

Hypercubes • Early ’80s: Caltech 64-node Cosmic Cube • Mid to late ’80s: Commercialized by several companies • Intel iPSC, iPSC/2, iPSC/860 • nCUBE, nCUBE 2 (later turned into a VoD server…) • Early ’90s: replaced by mesh/torus • Intel Paragon – i860 processors • Cray T3D, T3E – Alpha processors

Transputers • A microprocessor with built-in support for communication • Programmed using Occam • Used in Meiko and other systems PAR SEQ x := 13; c ! x; SEQ c ? y; z := y; -- z is 13 Synchronous communication: an assignment across processes

Attack of the Killer Micros • Commodity microprocessors advance at a faster rate than vector processors • Takeover point was around year 2000 • Even before that, using many together could provide lots of power • 1992: TMC uses SPARC in CM-5 • 1992: Intel uses i860 in Paragon • 1993: IBM SP uses RS/6000, later PowerPC • 1993: Cray uses Alpha in T3D • Berkeley NoW project

Connection Machine CM-5 • 1992 • SPARC-based • Fat-tree network • Dominant in early ’90s • Featured in Jurassic Park • Support for gang scheduling!

Intel Paragon • 1992 • 2 i860 proc’s per node: • Compute • Commun. • Mesh interconnect with spiffi display

Cray T3D/T3E • 1993 – Cray T3D • Uses commodity microprocessors (DEC Alpha) • 3D Torus interconnect • 1995 – Cray T3E

1993 • 16 RS/6000 processors per rack • Each runs AIX (full Unix) • Multistage network • Flexible configurations • First large IUCC machine IBM SP

Berkeley NoW • The building is the computer • Just need some glue software…

Not Everybody is Convinced… • Japan’s computer industry continues to build vector machines • NEC • SX series of supercomputers • Hitachi • SR series of supercomputers • Fujitsu • VPP series of supercomputers • Albeit with less style

Fujitsu VPP700

NEC SX-4

More Recent History • 1994 – 1995 slump • Cold war is over • Thinking machines files for chapter 11 • KSR Research files for chapter 11 • Late ’90s much better • IBM, Cray retain parallel machine market • Later also SGI, Sun, especially with SMPs • ASCI program is started • 21st century: clusters take over • Based on SMPs

SMPs • Machines with several CPUs • Initially small scale: 8-16 processors • Later achieved large scale of 64-128 processors • Global shared memory accessed via a bus • Hard to scale further due to shared memory and cache coherence

SGI Challenge • 1 to 16 processors • Bus interconnect • Dominated low end of Top500 list in mid ’90s • Not only graphics…

SGI Origin • MIPS processors • Remote memory access An Origin 2000 installed at IUCC

Architectural Convergence • Shared memory used to be uniform (UMA) • Based on bus or crossbar • Conventional load/store operations • Distributed memory used message passing • Newer machines support remote memory access • Nonuniform (NUMA): access to remote memory costs more • Put/get operations (but handled by NIC) • Cray T3D/T3E, SGI Origin 2000/3000

The ASCI Program • 1996: nuclear test ban leads to need for simulation of nuclear explosions • Accelerated Strategic Computing Initiative: Moore’s law not fast enough… • Budget of a billion dollars

The Vision ASCI requirements Technology transfer Performance PathForward Market-driven progress Time

ASCI Milestones • 1996 – ASCI Red: 1 TF Intel • 1998 – ASCI Blue Mountain: 3 TF • 1998 – ASCI Blue Pacific: 3 TF • 2001 – ASCI White: 10 TF • 2003 – ASCI Purple: 30 TF? so far two thirds delivered

The ASCI Red Machine • 9260 processors – PentiumPro 200 • Arranged as 4-way SMPs in 86 cabinets • 573 GB memory total • 2.25 TB disk space total • 2 miles of cables • 850 KW peak power consumption • 44 tons (+300 tons air conditioning equipment) • Cost: $55 million

Clusters vs. MPPs • Mix and match approach • PCs/SMPs/blades used as processing nodes • Fast switched network for interconnect • Linux on each node • MPI for software development • Something for management • Lower cost to set up • Non-trivial to operate effectively

SMP Nodes • PCs, workstations, or servers with several CPUs • Small scale (4-8) used as nodes in MPPs or clusters • Access to shared memory via shared L2 cache • SMP support (cache coherence) built into modern microprocessors

Parallel Computing Systems Part I: Introduction