Computer Architecture Principles Dr. Mike Frank

Computer Architecture PrinciplesDr. Mike Frank CDA 5155 (UF) / CA 714-R (NTU)Summer 2003 Module #34 Introduction to Multiprocessing

Introduction Application Domains Symmetric Shared Memory Architectures Their performance Distributed Shared Memory Architectures Their performance Synchronization Memory consistency Multithreading Crosscutting Issues Example: Sun Wildfire Multitheading example Embedded multiprocs. Fallacies & Pitfalls Concluding remarks Historical perspective H&P chapter 6 - Multiprocessing But, I will begin with some of my own material on cost-efficiencyand scalability of physically realisticparallel architectures.

Capacity Scaling – Some History • How can we increase the size & complexity of computations that can be performed? • Quantified as number of bits of memory required • Capacity scaling models: • Finite State Machines (a.k.a. Discrete Finite Automata): • Increase bits of state → Exponential increase in: • number of states & transitions, size of state-transition table • Infeasible to scale to large # bits – complex design, unphysical • Uniprocessor (serial) models: • Turing machine, von Neumann machine (RAM machine) (1940’s) • Leave processor complexity constant… • Just add more memory! • But, this is not a cost-effective way to scale capacity! • Multiprocessor models: • Von Neumann’s Cellular Automaton (CA) models (1950’s) • Keep individual processors simple, just have more of them • Design complexity stays manageable • Scale amount of processing & memory together worst best

Why Multiprocessing? • A pretty obvious idea: • Any given serial processor has a maximum speed: • Operations/second X. • Therefore, N such processors together will have a larger total max raw performance than this: • namely NX operations per second • If a computational task can be divided among these N processors, we may reduce its execution time by some speedup factor,  N. • Usually is at least slightly <N, due to overheads. • Exact factor depends on the nature of the application. • In extreme cases, speedup factor may be much less than N, or even 1 (no speedup)

Multiprocessing & Cost-Efficiency • For a given application, which is more cost-effective, a uniprocessor or a multiprocessor? • N-processor system cost, 1st-order approximation: • N-processor execution time: • Focus on overall cost-per-performance (est.) Measures cost to renta machine for the job(assuming fixeddepreciation lifetime).

Cost-Efficiency Cont. • Uniprocessor cost/performance: (C/P)uni = (Cfixed + Cproc) ·Tser • N-way multiprocessor cost/performance: (C/P)N = (Cfixed + N·Cproc) ·TserN·effN • The multiprocessor wins if and only if: (C/P)N < (C/P)unieffN > (N1 + r)/(1+r)r < (effN N1)/(1  effN)where r = Cproc / Cfixed. Pick N to maximize (1+r)·N·effN /(1 + N·r)

Parallelizability • An application or algorithm is parallelizable to the extent that adding more processors can reduce its execution time. • A parallelizable application is: • Communication-intensive if its performance is primarily limited by communication latencies. • Requires a tightly-coupled parallel architecture. • Means, low communication latencies between CPUs • Computation-intensive if its performance is primarily limited by speeds of individual CPUs. • May use a loosely-coupled parallel architecture • Loose coupling may even help! (b/c of heat removal.)

Performance Models • For a given architecture, a performance model of the architecture is: • an abstract description of the architecture that allows one to predict what the execution time of given parallel programs will be on that architecture. • Naïve performance models might make dangerous simplifying assumptions, such as: • Any processor will be able to access shared memory at the maximum bandwidth at any time. • A message from any processor to any other will arrive within n seconds. • Watch out! Such assumptions may be flawed...

Classifying Parallel Architectures • What’s parallelized? Instructions / data / both? • SISD: Single Instruction, Single Data (uniproc.) • SIMD: Single Instruction, Multiple Data (vector) • MIMD: Multiple Instruction, Mult. Data (multprc.) • MISD: (Special purpose stream processors) • Memory access architectures: • Centralized shared memory (fig. 6.1) • Uniform Memory Access (UMA) • Distributed shared memory (fig. 6.2) • Non-Uniform Memory Access (NUMA) • Distributed, non-shared memory • Message Passing Machines / Multicomputers / Clusters

Centralized Shared Memory A.k.a.symmetricmultiprocessor. A typicalexamplearchitecture. Typically, only 2 to a few dozenprocessors. After this, memory BWbecomes veryrestrictive.

Distributed Shared Memory Advantages: Memory BW scales w. #procs; local mem. latency kept small

DSP vs. Multicomputers • Distributed shared-memory architectures: • Although each processor is close to some memory,all processors still share the same address space. • Memory system responsible for maintaining consistency between each processor’s view of the address space. • Distributed non-shared memory architectures: • Each processor has its own address space. • Many independent computers → “multicomputer” • COTS computers, network → “cluster” • Processors communicate w. explicit messages • Can still layer shared object abstractions on top of this infrastructure via software.

Communications in Multiprocs. • Communications performance metrics: • Node bandwidth – bit-rate in/out of each proc. • Bisection bandwidth – b-rate between mach. halves • Latency – propagation delay across mach. diameter • Tightly coupled (localized) vs.loosely coupled (distributed) multiprocessors: • Tightly coupled: High bisection BW, low latency • Loosely coupled: Low bisection BW, high latency • Of course, you can also have a loosely-coupled (wide-area) network of (internally) tightly-coupled clusters.

Shared mem. vs. Message-Passing • Advantages of shared memory: • Straightforward, compatible interfaces - OpenMP. • Ease of applic. programming & compiler design. • Lower comm. overhead for small items • Due to HW support • Use automatic caching to reduce comm. needs • Advantages of message passing: • Hardware is simpler • Communication explicit → easier to understand • Forces programmer to think about comm. costs • Encourages improved design of parallel algorithms • Enables more efficient parallel algs. than automatic caching could ever provide

Scalability of Parallel Architectures

Scalability & Maximal Scalability • A multiprocessor architecture & accompanying performance model is scalable if: • it can be “scaled up” to arbitrarily large problem sizes, and/or arbitrarily large numbers of processors, without the predictions of the performance model breaking down. • An architecture (& model) is maximally scalable for a given problem if • it is scalable, and if no other scalable architecture can claim asymptotically superior performance on that problem • It is universally maximally scalable (UMS) if it is maximally scalable on all problems! • I will briefly mention some characteristics of architectures that are universally maximally scalable

Universal Maximum Scalability • Existence proof for universally maximally scalable (UMS) architectures: • Physics itself can be considered a universal maximally scalable “architecture” because any real computer is just a special case of a physical system. • So, obviously, no real class of computers can beat the performance of physical systems in general. • Unfortunately, physics doesn’t give us a very simple or convenient programming model. • Comprehensive expertise at “programming physics” means mastery of all physical engineering disciplines: chemical, electrical, mechanical, optical, etc. • We’d like an easier programming model than this!

Simpler UMS Architectures • (I propose) any practical UMS architecture will have the following features: • Processing elements characterized by constant parameters (independent of # of processors) • Mesh-type message-passing interconnection network, arbitrarily scalable in 2 dimensions • w. limited scalability in 3rd dimension. • Processing elements that can be operated in an arbitrarily reversible way, at least, up to a point. • Enables improved 3-d scalability in a limited regime • (In long term) Have capability for quantum-coherent operation, for extra perf. on some probs.

Shared Memory isn’t Scalable • Any implementation of shared memory requires communication between nodes. • As the # of nodes increases, we get: • Extra contention for any shared BW • Increased latency (inevitably). • Can hide communication delays to a limited extent, by latency hiding: • Find other work to do during the latency delay slot. • But the amount of “other work” available is limited by node storage capacity, parallizability of the set of running applications, etc.

Global Unit-Time Message Passing Isn’t Scalable! • Naïve model: “Any node can pass a message to any other in a single constant-time interval” • independent of the total number of nodes • Has same scaling problems as shared memory • Even if we assume that BW contention (traffic) isn’t a problem, unit-time assumption is still a problem. • Not possible for all N, given speed-of-light limit! • Need cube root of N asymptotic time, at minimum.

Many Interconnect Topologies Aren’t Scalable! • Suppose we don’t require a node can talk to any other in 1 time unit, but only to selected others. • Some such schemes still have scalability problems, e.g.: • Hypercubes, fat hypercubes • Binary trees, fat-trees • Crossbars, butterfly networks • Any topology in which the number of unit-time hops to reach any of N nodes is of order less than N1/3 is necessarily doomed to failure! See lastyear’s exams.

Only Meshes (or subgraphs of meshes) Are Scalable • 1-D meshes • Linear chain, ring, star (w. fixed # of arms) • 2-D meshes • Square grid, hex grid, cylinder, 2-sphere, 2-torus,… • 3-D meshes • Crystal-like lattices, w. various symmetries • Amorphous networks w. local interactions in 3d • An important caveat: • Scalability in 3rd dimension is limited by energy/information I/O considerations! More later… (Vitányi, 1988)

Which Approach Will Win? • Perhaps, the best of all worlds? • Here’s one example of a near-future, parallel computing scenario that seems reasonably plausible: • SMP architectures within smallest groups of processors on the same chip (chip multiprocessors), sharing a common bus and on-chip DRAM bank. • DSM architectures w. flexible topologies to interconnect larger (but still limited-size) groups of processors in a package-level or board-level network. • Message-passing w. mesh topologies for communication between different boards in a cluster-in-a-box (blade server),or higher level conglomeration of machines. But, what about the heat removal problem?

Landauer’s Principle Famous IBMresearcher’s1961 paper • We know low-level physics is reversible: • Means, the time-evolution of a state is bijective • Change is deterministic looking backwards in time • as well as forwards • Physical information (like energy) is conserved • It cannot ever be created or destroyed, • only reversibly rearranged and transformed! • This explains the 2nd Law of Thermodynamics: • Entropy (unknown info.) in a closed, unmeasured system can only increase (as we lose track of its state) • Irreversible bit “erasure” really just moves the bit into surroundings, increasing entropy & heat

s″2N−1 s″N−1 s′N−1 sN−1 s″0 s″N s′0 s0 1 0 0 1 0 0 0 0 Landauer’s Principle from basic quantum theory Illustrating Landauer’s principle Before bit erasure: After bit erasure: Nstates … … … Unitary(1-1)evolution 2Nstates Nstates … … … … Increase in entropy: S = log 2 = k ln 2. Energy lost to heat: ST = kT ln 2

Scaling in 3rd Dimension? • Computing based on ordinary irreversible bit operations only scales in 3d up to a point. • All discarded information & associated energy must be removed thru surface. But energy flux is limited! • Even a single layer of circuitry in a high-performance CPU can barely be kept cool today! • Computing with reversible, “adiabatic” operations does better: • Scales in 3d, up to a point… • Then with square root of further increases in thickness, up to a point. (Scales in 2.5 dimensions!) • Enables much larger thickness than irreversible!

Irreversible 3-D Mesh

Reversible 3-D Mesh Note the differingpower laws!

Cost-Efficiency of Reversibility Scenario: $1,000, 100-Watt conventional computer, w.3-year lifetime, vs. reversible computers of same storagecapacity. ~100,000× ~1,000× Best-case reversible computing Bit-operations per US dollar Worst-case reversible computing Conventional irreversible computing All curves would →0 if leakage not reduced.

Example Parallel Applications “Embarassinglyparallel” • Computation-intensive applications: • Factoring large numbers, cracking codes • Combinatorial search & optimization problems: • Find a proof of a theorem, or a solution to a puzzle • Find an optimal engineering design or data model, over a large space of possible design parameter settings • Solving a game-theory or decision-theory problem • Rendering an animated movie • Communication-intensive applications: • Physical simulations (sec. 6.2 has some examples) • Also multiplayer games, virtual work environments • File serving, transaction processing in distributed database systems

Symmetric Multiprocessors

Introduction Application Domains Symmetric Shared Memory Architectures Their performance Distributed Shared Memory Architectures Their performance Synchronization Memory consistency Multithreading Crosscutting Issues Example: Sun Wildfire Multitheading example Embedded multiprocs. Fallacies & Pitfalls Concluding remarks Historical perspective H&P chapter 6 - Multiprocessing

More about SMPs (6.3) • Caches help reduce each processor’s mem. bandwidth • Means many processors can share total memory BW • Microprocessor-based Symmetric MultiProcessors (SMPs) emerged in the 80’s • Very cost effective, up to limit of memory BW • Early SMPs had 1 CPUper board (off backplane) • Now multiple per-board,per-MCM, or even per die • Memory system caches bothshared and private (local) data • Private data in 1 cache only • Shared data may be replicated

Cache Coherence Problem • Goal: All processors should have a consistent view ofthe shared memory contents, and how they change. • Or, as nearly consistent as we can manage. • The fundamental difficulty: • Written information takes time to propagate! • E.g.,A writes, then Bwrites, then A reads (like WAW hazard) • A might see the value from A, instead of the value from B • A simple, but inefficient solution: • Have all writes cause all processors to stall (or at least, not perform any new accesses) until all have received the result of the write. • Reads, on the other hand, can be reordered amongst themselves. • But: Incurs a worst-case memory stall on each write step! • Can alleviate this by allowing writes to occur only periodically • But this reduces bandwidth for writes • And increases avg. latency for communication through shared memory

Another Interesting Method Research by Chris Carothers at RPI • Maintain a consistent system“virtual time” modeled by all processors. • Each processor asynchronously tracks its local idea of the current virtual time. (Local Virtual Time) • On a write, asynchronously send invalidate messages timestamped with the writer’s LVT. • On receiving an invalidate message stamped earlier than the reader’s LVT, • Roll back the local state to that earlier time • There are efficient techniques for doing this • If timestamped later than the reader’s LVT, • Queue it up until the reader’s LVT reaches that time (This is anexample ofspeculation.)

Frank-Lewis Rollback Method Steve Lewis’ MS thesis, UF, 2001 (Reversible MIPS Emulator & Debugger) • Fixed-size window • Limits how far back you can go. • Periodically store checkpoints of machine state • Each checkpoint records changes needed • to get back to that earlier state from next checkpoint, • or from current state if it’s the last checkpoint • Cull out older checkpoints periodically • so the total number stays logarithmic in the size of the window. • Also, store messages received during time window • To go backwards Δt steps (to time told = tcur− Δt), • Revert machine state to latest checkpoint preceding time told • Apply changes recorded in checkpoints from tcur on backwards • Compute forwards from there to time told • Technique is fairly time- and space- efficient

Definition of Coherence • A weaker condition than full consistency. • A memory system is called coherent if: • Reads return the most recent value written locally, • if no other processor wrote the location in the meantime. • A read can return the value written by another processor,if the times are far enough apart. • And, if nobody else wrote the location in between • Writes to any given location are serialized. • If A writes a location, then B writes the location, all processors first see the value written by A, then (later) the value written by B. • Avoids WAW hazards leaving cache in wrong state.

Cache Coherence Protocols • Two common types: (Differ in how they track blocks’ sharing state) • Directory-based: • sharing status of a block is kept in a centralized directory • Snooping (or “snoopy”): • Sharing status of each block is maintained (redundantly) locally by each cache • All caches monitor or snoop (eavesdrop) on the memory bus, • to notice events relevant to sharing status of blocks they have • Snooping tends to be more popular

Write Invalidate Protocols • When a processor wants to write to a block, • It first “grabs ownership” of that block, • By telling all other processors to invalidate their own local copy. • This ensures coherence, because • A block recently written is cached in 1 place only: • The cache of the processor that most recently wrote it • Anyone else who wants to write that block will first have to grab back the most recent copy. • The block is also written to memory at that time. Analogous to using RCS to lock files

Meaning of Bus Messages • Write miss on block B: • “Hey, I want to write block B. Everyone, give me the most recent copy if you’re the one who has it. And everyone, also throw away your own copy.” • Read miss on block B: • “Hey, I want to read block B. Everyone, give me the most recent copy, if you have it. But you don’t have to throw away your own copy.” • Writeback of block B: • “Here is the most recent copy of block B, which I produced. I promise not to make any more changes until I after I ask for ownership back and receive it.”

Write-Invalidate Cache Coherence

State Diagram – Another Look

Write-Update Coherence Protocol • Also called write broadcast. • Strategy: Update all cached copies of a block when the block is written. • Comparison versus write-invalidate: • More bus traffic for multiple writes by 1 processor • Less latency for data to be passed between proc’s. • Bus & memory bandwidth is a key limiting factor! • Write-invalidate usually gives best overall perf.

Computer Architecture Principles Dr. Mike Frank