Parallel Computer Organization and Design EDA282

Parallel Computer Organization and DesignEDA282

Why Study Parallel Computers? Almost ALL computers are now parallel Understanding hardware is important for producing good software (converse also true!) It’s fun!

Logistics • EL43 1:15-3:00 T/Th (often1:15 F, too) • Expected participation • Attend lectures, participate in discussion • Complete labs (including a satisfactory writeup)— dates/times TBD • Read papers • Complete quizzes • Write (short) survey article (in teams) • Finish (short) take-home exam • Canvas course-management system • https://canvas.instructure.com/courses/777378 • Link: http://www.cse.chalmers.se/~mckee/eda282

Personnel • Prof. Sally McKee • Office hours: arrange meetings via email • Available for discussions after class • mckee@chalmers.se • Jacob Lidman • lidman@chalmers.se

Course Materials “Parallel Computer Organization and Design”Dubois, Annevaram, Stenström (at Cremona) Research and survey papers (linked to web page)

Course Structure/Contents • Intro today • Programming models • Data parallelism • Shared address spaces • Message passing • Hybrid • Design principles/tradeoffs(this is the bulk of the material) • Small-scale systems • Scalable systems • Interconnects

For Each Big Topic, We’ll Discuss . . . • History • How concepts originated in old machines • How they show up in current machines • Basics required in any parallel machine • Memory coherence • Communication • Synchronization

How Did We Get Here? Transistor count doubling every ~2 years Transistor feature sizes shrinking Costs changing Clock speeds hitting limits Parallelism per processor increasing Looking at trends is important when designing new systems!

Costs of Parallel Machines Things to keep in mind when designing a machine . . . What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime? (how long is a "lifetime"?)

Interesting Questions (i.e., course content) • What do we mean by parallel? • Task parallelism (SPMD, MPMD) • Data parallelism (SIMD) • Thread parallelism (Hyperthreading, SMT) • How do the processors coordinate their work? • Shared memory/message passing • Interconnection network (at least one!) • Synchronization primitives • Many combinations/variations • What’s the best way to put these pieces together? • What do you want to run? • How fast do you have to run it? • How much can you spend? • How much energy can you use?

Moore’s Law: Transistor Counts

Feature Sizes

Costs: Apple

Costs of Consumer Electronics Today

History Pascal adding machine, 1642 Leibniz adder/multiplier, ~1670 Babbage analytical engine, 1837 (punch cards, memory, printer!) Hollerith punchcards, 1890 (used for US census data) Aiken digital computer, 1940s (Harvard) Von Neumann stored-program computer, 1945 Eckert/Mauchly ENIAC GP computer, 1946

Evolution of Electronic Computers • Vacuum tubes replaced by transistors, late 1950s • Smaller, faster, more versital logic elements • Lower power • Longer lifetime • Integrated Circuits, late 1960s • Many transistors fabricated on silicon substrate • Wires plated in place • Lower price • Smaller size • Lower failure rate • LSI/VLSI/microprocessors, 1970s • 1000s of interconnected transistors etched into silicon • Could check 8 switches at once → 8-bit “byte”

History of Supercomputers • IBM 7030 Stretch, 1961 • 2K sq. ft. • Fastest computer in world at time • Slower than expected! • Cost initially $13M, dropped to $8.5M • Instruction pipelining, prefetching/decoding, memory interleaving • CDC 6600, 1964 • Size ~= 4 filing cabinets. • Cost $8M ($60M today) • 40MHz, 3M FLOPS at peak • Freon cooled • CPU == 10 FUs, multiple PCBs • 60-bit words/regs

History of Supercomputers (2) • Cray 1, 1976 • 64-bit words • 80 MHz • 136 MFLOPS! • Speed-critical partsplaced inside • 1662 PCBs w/ 144 Ics • 80 sold in 10 years • $5-8M ($25M now)

History of Supercomputers (3) • Cray XMP, 1982 • Up to 4 CPUs in 1 chassis • Up to 16M 64-bit words (128 MB, all SRAM!) • Up to 32 1.2GB disks • 105 MHz • Up to 800 MFLOPS (200/CPU) • Double memory bandwidth wrt Cray 1 • Cray 2, 1985 • Again ICs packed on logic boards • Again, horseshoe shape • Boards packed tightly — submersed in Fluorinert to cool(see http://archive.computerhistory.org/resources/text/Cray/Cray.Cray2.1985.102646185.pdf) • Up to 8 CPUs, 1.9 GFLOPS • Mainstream software/Unix System V OS

History of Supercomputers (4) • Intel Paragon, 1989 • I860-based • 32- or 64-bit • Up to 4K CPUs • 2D MIMD topology • Poor memory bandwidth utilization • ASCI Red, 1996 • First to use off-the-shelf CPUs (Pentium Pros, Xeons) • 6K CPUs • Broke 1 TFLOP barrier • Cost $46M ($67M now) • Upgrade had 9298 Xeons for 3.1 TFlops • Over 1MW power!

History of Supercomputers (5) • Hitachi SR2201, 1996 • H-shaped chassis • 2048 CPUs • 600 GFLOPS peak • Other similar machines (many Japanese) • 100s of CPUs • 2D or 3D networks (e.g., Cray torus) • MIMD • Seymour Cray leaves Cray Research • Cray Computer Corp (CCC) • Cray 3 first gallium arsenide chips • Cray 4 failed → bankruptcy • SRC Computers (see http://www.srccomp.com/about/aboutus.asp)

Biggest Machine Today SequoiaIBM BlueGene/Q machine at theU.S. Dept. of Energy Lawrence Livermore National Lab

Types of Parallelism • Instruction-Level Parallelism (ILP) • Superscalar issue • Out-of-order execution • Very Long Instruction Word (VLIW) • Thread-Level Parallelism (TLP) • Loop-level • Multithreading • Explicit • Speculative • Simultaneous/Hyperthreading • Task-Level Parallelism • Program-Level Parallelism • Data-Level Parallelism

for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i]; for i = 0 to N-1 d[i] := C*a[i]; Iteration: 0 1 2 … N-1 Loop 1 a[1] a[2] … a[0] Loop 2 a[0] a[1] … a[N-1] data dependencies Parallelism in Sequential Programs • Programming model: C (sequential) • Architecture: superscalar • ILP • Communication through registers • Synchronization through pipeline interlocks

Parallel Programming Models • Extend semantics to express • Units of parallelism • Instructions • Threads • Programs • Communication and coordination between units via • Registers • Memory • I/O

Model vs. Architecture CAD Databases Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compiler Communication abstraction or library User/system boundary Operating system support Har dwar e/softwar e boundary Communication harr dwar e Physical communication medium • Communication abstraction supports model • Communication architecture (ISA + comm/sync) implements part of model • Hw/sw boundary defines which parts of comm arch implemented in which

Memory P P P Shared Address Space Model for_alli = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j]; barrier; for_alli = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j]; Communication abstraction supported by HW/SW interface TLP Communication/coordination among threads via shared global address space

Message Passing Model for_alli = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_for barrier; for_alli = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for Process-level parallelism (separate addr spaces) Communication/coordination via messages

Data Parallelism (SIMD) parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i]; parallel (i:0->N-1) d[i] := C * a[i]; • Programming model • Operations done in parallel on multiple data elements • Single thread of control • Architectural model • Array of simple, cheap processors w/ little memory • Attached to control proc that issues instructions • Specialized + general comm, cheap sync

Coarser-Grain Data Parallelism Single-Program Multiple-Data More broadly applicable than SIMD

Creating a Parallel Program • ID work that can be done in parallel • Computation • Data access • I/O • Partition work/data among entities • Processes • Threads • Manage data access, comm, sync Speedup(P) = Performance(P)/Performance(1) = Time(1)/Time(P)

Steps • Decomposition • Assignment • Orchestration • Mapping • Can be done by • Programmer • Compiler • Runtime • Hardware (speculatively)

Parallelization Architecture independent Architecture dependent P0 P1 Mapping Decomposition Orchestration Assignment P2 P3 Sequential Compuitation Parallel Program Processors Tasks Processes

Concepts • Task • Arbitrary piece of work from computation • Sequentually executed • Could be fine- or coarse-grained • Process (or thread) • What gets executed by a core • Abstract entity that performs tasks assigned to it • Processes comm & sync to perform tasks • Processor (core) • Physical engine on which processes run • Virtualized machine view for programmer

Decomposition • Purpose: Break up computation into tasks to be divided among processes • Tasks may become available dynamically • Number of available tasks may vary with time • i.e., identify concurrency and decide level at which to exploit it • Goal: keep processes busy, but keep management reasonable • Number of tasks creates upper bound on speedup • Too many tasks requires too much coordination

Assignment • Specify mechanism to divide work among processes • Strive for balance • Reduce communication, management • Structured approach recommended • Inspect code • Apply well known heuristics • Programmer focuses on decomp/assign 1st • Largely independent of architecture/programming model • Choice of primitives (cost/complexity) affects decisions • Architects assume program(mer) does decent job

Orchestration • Purpose • Name data, structure comm/sync • Organize data structures, schedule tasks (temporally) • Goals • Reduce costs of comm/sync from processor POV • Improve data locality • Reduce overhead of managing parallelism • Choices depend heavily on comm abstraction, efficiency of primitives • Architects must provide appropriate, efficient primitives

Mapping • Two aspects • Which processes to run on same processor • Which process runs on which processor • One extreme-sharing • Partition machine s.t.only 1 app at a time in a subset • Pin processes to cores (or let OS balance workloads) • Another extreme • Control complete resource management in OS • Use performance techniques for dynamic balancing • Real world is between the two • User specifies desires in some aspects • System may ignore

High-Level Goals • High performance • Low resource usage • Low development effort • Low power consumption • Implications for algorithm designers and architects • Algorithm designers: high-performance, low resource needs • Architects: high-performance, low cost, reduced programming effort

Costs of Parallel Machines Things to keep in mind when designing a machine . . . What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime? (how long is a "lifetime"?)

Parallel Computer Organization and Design EDA282