1.01k likes | 1.21k Views
Parallel Computing Systems Part I: Introduction. Dror Feitelson Hebrew University. Topics. Overview of the field Architectures: vectors, MPPs, SMPs, and clusters Networks and routing Scheduling parallel jobs Grid computing Evaluating performance. Today (and next week?).
E N D
Parallel Computing SystemsPart I: Introduction Dror Feitelson Hebrew University
Topics • Overview of the field • Architectures: vectors, MPPs, SMPs, and clusters • Networks and routing • Scheduling parallel jobs • Grid computing • Evaluating performance
Today (and next week?) • What is parallel computing • Some history • The Top500 list • The fastest machines in the world • Trends and predictions
What is a Parallel System? In particular, what is the difference between parallel and distributed computing?
What is a Parallel System? Chandy: it is related to concurrency. • In distributed computing, concurrency is part of the problem. • In parallel computing, concurrency is part of the solution.
Distributed Systems • Concurrency because of physical distribution • Desktops of different users • Servers across the Internet • Branches of a firm • Central bank computer and ATMs • Need to coordinate among autonomous systems • Need to tolerate failures and disconnections
Parallel Systems • High-performance computing: solve problems that are to big for a single machine • Get the solution faster (weather forecast) • Get a better solution (physical simulation) • Need to parallelize algorithm • Need to control overhead • Can assume friendly system?
The Convergence Use distributed resources for parallel processing • Networks of workstations – use available desktop machines within organization • Grids – use available resources (servers?) across organizations • Internet computing – use personal PCs across the globe (SETI@home)
Early HPC • Parallel systems in academia/research • 1974: C.mmp • 1974: Illiac IV • 1978: Cm* • 1983: Goodyear MPP
Illiac IV • 1974 • SIMD: all processors do the same • Numerical calculations at NASA • Now in Boston computer museum
The Illiac IV in Numbers • 64 processors arranged as 8 8 grid • Each processor has 104 ECL transistors • Each processor has 2K 64-bit words (total is 8 Mbit) • Arranged in 210 boards • Packed in 16 cabinets • 500 Mflops peak performance • Cost: $31 million
Sustained vs. Peak • Peak performance: product of clock rate and number of functional units • Sustained rate: what you actually achieve on a real application • Sustained is typically much lower than peak • Application does not require all functional units • Need to wait for data to arrive from memory • Need to synchronize • Best for dense matrix operations (Linpack) A rate that the vendor guarantees will not be exceeded
Early HPC • Parallel systems in academia/research • 1974: C.mmp • 1974: Illiac IV • 1978: Cm* • 1983: Goodyear MPP • Vector systems by Cray and Japanese firms • 1976: Cray 1 rated at 160 Mflops peak • 1982: Cray X-MP, later Y-MP, C90, … • 1985: Cray 2, NEC SX-2
Cray’s Achievements • Architectural innovations • Vector operations on vector registers • All memory is equally close: no cache • Trade off accuracy and speed • Packaging • Short and equally long wires • Liquid cooling systems • Style
Vector Supercomputers • Vector registers store vectors of fast access • Vector instructions operate on whole vectors of values • Overhead of instruction decode only once per vector • Pipelined execution of instruction on vector elements: one result per clock tick (at least after pipeline is full) • Possible to chain vector operations: start feeding second functional unit before finishing first one
Cray 1 • 1975 • 80 MHz clock • 160 Mflops peak • Liquid cooling • World’s most expensive love seat • Power supply and cooling under the seat • Available in red, blue, black… • No operating system
Cray 1 Wiring • Round configuration for small and uniform distances • Longest wire: 4 feet • Wires connected manually by extra-small engineers
Cray X-MP • 1982 • 1 Gflop • Multiprocessor with 2 or 4 Cray1-like processors • Shard memory
Cray 2 • 1985 • Smaller and more compact than Cray 1 • 4 (or 8) processors • Total immersion liquid cooling
Cray Y-MP • 1988 • 8 proc’s • Achieved 1 Gflop
Cray Y-MP – From Back Power supply and cooling
Cray C90 • 1992 • 1 Gflop per processor • 8 or more processors
The MPP Boom • 1985: Thinking Machines introduces the Connection Machine CM-1 • 16K single-bit processors, SIMD • Followed by CM-2, CM-200 • Similar machines by MasPar • mid ’80s: hypercubes become successful • Also: Transputers used as building blocks • Early ’90s: big companies join • IBM, Cray
SIMD Array Processors • ’80 favorites • Connection machine • Maspar • Very many single-bit processors with attached memory – proprietary hardware • Single control unit: everything is totally synchronized (SIMD = single instruction multiple data) • Massive parallelism even with “correct counting” (i.e. divide by 32)
Connection Machine CM-2 • Cube of 64K proc’s • Acts as backend • Hyper-cube topology • Data vault for parallel I/O
Hypercubes • Early ’80s: Caltech 64-node Cosmic Cube • Mid to late ’80s: Commercialized by several companies • Intel iPSC, iPSC/2, iPSC/860 • nCUBE, nCUBE 2 (later turned into a VoD server…) • Early ’90s: replaced by mesh/torus • Intel Paragon – i860 processors • Cray T3D, T3E – Alpha processors
Transputers • A microprocessor with built-in support for communication • Programmed using Occam • Used in Meiko and other systems PAR SEQ x := 13; c ! x; SEQ c ? y; z := y; -- z is 13 Synchronous communication: an assignment across processes
Attack of the Killer Micros • Commodity microprocessors advance at a faster rate than vector processors • Takeover point was around year 2000 • Even before that, using many together could provide lots of power • 1992: TMC uses SPARC in CM-5 • 1992: Intel uses i860 in Paragon • 1993: IBM SP uses RS/6000, later PowerPC • 1993: Cray uses Alpha in T3D • Berkeley NoW project
Connection Machine CM-5 • 1992 • SPARC-based • Fat-tree network • Dominant in early ’90s • Featured in Jurassic Park • Support for gang scheduling!
Intel Paragon • 1992 • 2 i860 proc’s per node: • Compute • Commun. • Mesh interconnect with spiffi display
Cray T3D/T3E • 1993 – Cray T3D • Uses commodity microprocessors (DEC Alpha) • 3D Torus interconnect • 1995 – Cray T3E
1993 • 16 RS/6000 processors per rack • Each runs AIX (full Unix) • Multistage network • Flexible configurations • First large IUCC machine IBM SP
Berkeley NoW • The building is the computer • Just need some glue software…
Not Everybody is Convinced… • Japan’s computer industry continues to build vector machines • NEC • SX series of supercomputers • Hitachi • SR series of supercomputers • Fujitsu • VPP series of supercomputers • Albeit with less style
More Recent History • 1994 – 1995 slump • Cold war is over • Thinking machines files for chapter 11 • KSR Research files for chapter 11 • Late ’90s much better • IBM, Cray retain parallel machine market • Later also SGI, Sun, especially with SMPs • ASCI program is started • 21st century: clusters take over • Based on SMPs
SMPs • Machines with several CPUs • Initially small scale: 8-16 processors • Later achieved large scale of 64-128 processors • Global shared memory accessed via a bus • Hard to scale further due to shared memory and cache coherence
SGI Challenge • 1 to 16 processors • Bus interconnect • Dominated low end of Top500 list in mid ’90s • Not only graphics…
SGI Origin • MIPS processors • Remote memory access An Origin 2000 installed at IUCC
Architectural Convergence • Shared memory used to be uniform (UMA) • Based on bus or crossbar • Conventional load/store operations • Distributed memory used message passing • Newer machines support remote memory access • Nonuniform (NUMA): access to remote memory costs more • Put/get operations (but handled by NIC) • Cray T3D/T3E, SGI Origin 2000/3000
The ASCI Program • 1996: nuclear test ban leads to need for simulation of nuclear explosions • Accelerated Strategic Computing Initiative: Moore’s law not fast enough… • Budget of a billion dollars
The Vision ASCI requirements Technology transfer Performance PathForward Market-driven progress Time
ASCI Milestones • 1996 – ASCI Red: 1 TF Intel • 1998 – ASCI Blue Mountain: 3 TF • 1998 – ASCI Blue Pacific: 3 TF • 2001 – ASCI White: 10 TF • 2003 – ASCI Purple: 30 TF? so far two thirds delivered
The ASCI Red Machine • 9260 processors – PentiumPro 200 • Arranged as 4-way SMPs in 86 cabinets • 573 GB memory total • 2.25 TB disk space total • 2 miles of cables • 850 KW peak power consumption • 44 tons (+300 tons air conditioning equipment) • Cost: $55 million
Clusters vs. MPPs • Mix and match approach • PCs/SMPs/blades used as processing nodes • Fast switched network for interconnect • Linux on each node • MPI for software development • Something for management • Lower cost to set up • Non-trivial to operate effectively
SMP Nodes • PCs, workstations, or servers with several CPUs • Small scale (4-8) used as nodes in MPPs or clusters • Access to shared memory via shared L2 cache • SMP support (cache coherence) built into modern microprocessors