430 likes | 554 Views
Computer Architecture Principles Dr. Mike Frank. CDA 5155 (UF) / CA 714-R (NTU) Summer 2003 Module #34 Introduction to Multiprocessing. Introduction Application Domains Symmetric Shared Memory Architectures Their performance Distributed Shared Memory Architectures Their performance
E N D
Computer Architecture PrinciplesDr. Mike Frank CDA 5155 (UF) / CA 714-R (NTU)Summer 2003 Module #34 Introduction to Multiprocessing
Introduction Application Domains Symmetric Shared Memory Architectures Their performance Distributed Shared Memory Architectures Their performance Synchronization Memory consistency Multithreading Crosscutting Issues Example: Sun Wildfire Multitheading example Embedded multiprocs. Fallacies & Pitfalls Concluding remarks Historical perspective H&P chapter 6 - Multiprocessing But, I will begin with some of my own material on cost-efficiencyand scalability of physically realisticparallel architectures.
Capacity Scaling – Some History • How can we increase the size & complexity of computations that can be performed? • Quantified as number of bits of memory required • Capacity scaling models: • Finite State Machines (a.k.a. Discrete Finite Automata): • Increase bits of state → Exponential increase in: • number of states & transitions, size of state-transition table • Infeasible to scale to large # bits – complex design, unphysical • Uniprocessor (serial) models: • Turing machine, von Neumann machine (RAM machine) (1940’s) • Leave processor complexity constant… • Just add more memory! • But, this is not a cost-effective way to scale capacity! • Multiprocessor models: • Von Neumann’s Cellular Automaton (CA) models (1950’s) • Keep individual processors simple, just have more of them • Design complexity stays manageable • Scale amount of processing & memory together worst best
Why Multiprocessing? • A pretty obvious idea: • Any given serial processor has a maximum speed: • Operations/second X. • Therefore, N such processors together will have a larger total max raw performance than this: • namely NX operations per second • If a computational task can be divided among these N processors, we may reduce its execution time by some speedup factor, N. • Usually is at least slightly <N, due to overheads. • Exact factor depends on the nature of the application. • In extreme cases, speedup factor may be much less than N, or even 1 (no speedup)
Multiprocessing & Cost-Efficiency • For a given application, which is more cost-effective, a uniprocessor or a multiprocessor? • N-processor system cost, 1st-order approximation: • N-processor execution time: • Focus on overall cost-per-performance (est.) Measures cost to renta machine for the job(assuming fixeddepreciation lifetime).
Cost-Efficiency Cont. • Uniprocessor cost/performance: (C/P)uni = (Cfixed + Cproc) ·Tser • N-way multiprocessor cost/performance: (C/P)N = (Cfixed + N·Cproc) ·TserN·effN • The multiprocessor wins if and only if: (C/P)N < (C/P)unieffN > (N1 + r)/(1+r)r < (effN N1)/(1 effN)where r = Cproc / Cfixed. Pick N to maximize (1+r)·N·effN /(1 + N·r)
Parallelizability • An application or algorithm is parallelizable to the extent that adding more processors can reduce its execution time. • A parallelizable application is: • Communication-intensive if its performance is primarily limited by communication latencies. • Requires a tightly-coupled parallel architecture. • Means, low communication latencies between CPUs • Computation-intensive if its performance is primarily limited by speeds of individual CPUs. • May use a loosely-coupled parallel architecture • Loose coupling may even help! (b/c of heat removal.)
Performance Models • For a given architecture, a performance model of the architecture is: • an abstract description of the architecture that allows one to predict what the execution time of given parallel programs will be on that architecture. • Naïve performance models might make dangerous simplifying assumptions, such as: • Any processor will be able to access shared memory at the maximum bandwidth at any time. • A message from any processor to any other will arrive within n seconds. • Watch out! Such assumptions may be flawed...
Classifying Parallel Architectures • What’s parallelized? Instructions / data / both? • SISD: Single Instruction, Single Data (uniproc.) • SIMD: Single Instruction, Multiple Data (vector) • MIMD: Multiple Instruction, Mult. Data (multprc.) • MISD: (Special purpose stream processors) • Memory access architectures: • Centralized shared memory (fig. 6.1) • Uniform Memory Access (UMA) • Distributed shared memory (fig. 6.2) • Non-Uniform Memory Access (NUMA) • Distributed, non-shared memory • Message Passing Machines / Multicomputers / Clusters
Centralized Shared Memory A.k.a.symmetricmultiprocessor. A typicalexamplearchitecture. Typically, only 2 to a few dozenprocessors. After this, memory BWbecomes veryrestrictive.
Distributed Shared Memory Advantages: Memory BW scales w. #procs; local mem. latency kept small
DSP vs. Multicomputers • Distributed shared-memory architectures: • Although each processor is close to some memory,all processors still share the same address space. • Memory system responsible for maintaining consistency between each processor’s view of the address space. • Distributed non-shared memory architectures: • Each processor has its own address space. • Many independent computers → “multicomputer” • COTS computers, network → “cluster” • Processors communicate w. explicit messages • Can still layer shared object abstractions on top of this infrastructure via software.
Communications in Multiprocs. • Communications performance metrics: • Node bandwidth – bit-rate in/out of each proc. • Bisection bandwidth – b-rate between mach. halves • Latency – propagation delay across mach. diameter • Tightly coupled (localized) vs.loosely coupled (distributed) multiprocessors: • Tightly coupled: High bisection BW, low latency • Loosely coupled: Low bisection BW, high latency • Of course, you can also have a loosely-coupled (wide-area) network of (internally) tightly-coupled clusters.
Shared mem. vs. Message-Passing • Advantages of shared memory: • Straightforward, compatible interfaces - OpenMP. • Ease of applic. programming & compiler design. • Lower comm. overhead for small items • Due to HW support • Use automatic caching to reduce comm. needs • Advantages of message passing: • Hardware is simpler • Communication explicit → easier to understand • Forces programmer to think about comm. costs • Encourages improved design of parallel algorithms • Enables more efficient parallel algs. than automatic caching could ever provide
Scalability & Maximal Scalability • A multiprocessor architecture & accompanying performance model is scalable if: • it can be “scaled up” to arbitrarily large problem sizes, and/or arbitrarily large numbers of processors, without the predictions of the performance model breaking down. • An architecture (& model) is maximally scalable for a given problem if • it is scalable, and if no other scalable architecture can claim asymptotically superior performance on that problem • It is universally maximally scalable (UMS) if it is maximally scalable on all problems! • I will briefly mention some characteristics of architectures that are universally maximally scalable
Universal Maximum Scalability • Existence proof for universally maximally scalable (UMS) architectures: • Physics itself can be considered a universal maximally scalable “architecture” because any real computer is just a special case of a physical system. • So, obviously, no real class of computers can beat the performance of physical systems in general. • Unfortunately, physics doesn’t give us a very simple or convenient programming model. • Comprehensive expertise at “programming physics” means mastery of all physical engineering disciplines: chemical, electrical, mechanical, optical, etc. • We’d like an easier programming model than this!
Simpler UMS Architectures • (I propose) any practical UMS architecture will have the following features: • Processing elements characterized by constant parameters (independent of # of processors) • Mesh-type message-passing interconnection network, arbitrarily scalable in 2 dimensions • w. limited scalability in 3rd dimension. • Processing elements that can be operated in an arbitrarily reversible way, at least, up to a point. • Enables improved 3-d scalability in a limited regime • (In long term) Have capability for quantum-coherent operation, for extra perf. on some probs.
Shared Memory isn’t Scalable • Any implementation of shared memory requires communication between nodes. • As the # of nodes increases, we get: • Extra contention for any shared BW • Increased latency (inevitably). • Can hide communication delays to a limited extent, by latency hiding: • Find other work to do during the latency delay slot. • But the amount of “other work” available is limited by node storage capacity, parallizability of the set of running applications, etc.
Global Unit-Time Message Passing Isn’t Scalable! • Naïve model: “Any node can pass a message to any other in a single constant-time interval” • independent of the total number of nodes • Has same scaling problems as shared memory • Even if we assume that BW contention (traffic) isn’t a problem, unit-time assumption is still a problem. • Not possible for all N, given speed-of-light limit! • Need cube root of N asymptotic time, at minimum.
Many Interconnect Topologies Aren’t Scalable! • Suppose we don’t require a node can talk to any other in 1 time unit, but only to selected others. • Some such schemes still have scalability problems, e.g.: • Hypercubes, fat hypercubes • Binary trees, fat-trees • Crossbars, butterfly networks • Any topology in which the number of unit-time hops to reach any of N nodes is of order less than N1/3 is necessarily doomed to failure! See lastyear’s exams.
Only Meshes (or subgraphs of meshes) Are Scalable • 1-D meshes • Linear chain, ring, star (w. fixed # of arms) • 2-D meshes • Square grid, hex grid, cylinder, 2-sphere, 2-torus,… • 3-D meshes • Crystal-like lattices, w. various symmetries • Amorphous networks w. local interactions in 3d • An important caveat: • Scalability in 3rd dimension is limited by energy/information I/O considerations! More later… (Vitányi, 1988)
Which Approach Will Win? • Perhaps, the best of all worlds? • Here’s one example of a near-future, parallel computing scenario that seems reasonably plausible: • SMP architectures within smallest groups of processors on the same chip (chip multiprocessors), sharing a common bus and on-chip DRAM bank. • DSM architectures w. flexible topologies to interconnect larger (but still limited-size) groups of processors in a package-level or board-level network. • Message-passing w. mesh topologies for communication between different boards in a cluster-in-a-box (blade server),or higher level conglomeration of machines. But, what about the heat removal problem?
Landauer’s Principle Famous IBMresearcher’s1961 paper • We know low-level physics is reversible: • Means, the time-evolution of a state is bijective • Change is deterministic looking backwards in time • as well as forwards • Physical information (like energy) is conserved • It cannot ever be created or destroyed, • only reversibly rearranged and transformed! • This explains the 2nd Law of Thermodynamics: • Entropy (unknown info.) in a closed, unmeasured system can only increase (as we lose track of its state) • Irreversible bit “erasure” really just moves the bit into surroundings, increasing entropy & heat
s″2N−1 s″N−1 s′N−1 sN−1 s″0 s″N s′0 s0 1 0 0 1 0 0 0 0 Landauer’s Principle from basic quantum theory Illustrating Landauer’s principle Before bit erasure: After bit erasure: Nstates … … … Unitary(1-1)evolution 2Nstates Nstates … … … … Increase in entropy: S = log 2 = k ln 2. Energy lost to heat: ST = kT ln 2
Scaling in 3rd Dimension? • Computing based on ordinary irreversible bit operations only scales in 3d up to a point. • All discarded information & associated energy must be removed thru surface. But energy flux is limited! • Even a single layer of circuitry in a high-performance CPU can barely be kept cool today! • Computing with reversible, “adiabatic” operations does better: • Scales in 3d, up to a point… • Then with square root of further increases in thickness, up to a point. (Scales in 2.5 dimensions!) • Enables much larger thickness than irreversible!
Reversible 3-D Mesh Note the differingpower laws!
Cost-Efficiency of Reversibility Scenario: $1,000, 100-Watt conventional computer, w.3-year lifetime, vs. reversible computers of same storagecapacity. ~100,000× ~1,000× Best-case reversible computing Bit-operations per US dollar Worst-case reversible computing Conventional irreversible computing All curves would →0 if leakage not reduced.
Example Parallel Applications “Embarassinglyparallel” • Computation-intensive applications: • Factoring large numbers, cracking codes • Combinatorial search & optimization problems: • Find a proof of a theorem, or a solution to a puzzle • Find an optimal engineering design or data model, over a large space of possible design parameter settings • Solving a game-theory or decision-theory problem • Rendering an animated movie • Communication-intensive applications: • Physical simulations (sec. 6.2 has some examples) • Also multiplayer games, virtual work environments • File serving, transaction processing in distributed database systems
Introduction Application Domains Symmetric Shared Memory Architectures Their performance Distributed Shared Memory Architectures Their performance Synchronization Memory consistency Multithreading Crosscutting Issues Example: Sun Wildfire Multitheading example Embedded multiprocs. Fallacies & Pitfalls Concluding remarks Historical perspective H&P chapter 6 - Multiprocessing
More about SMPs (6.3) • Caches help reduce each processor’s mem. bandwidth • Means many processors can share total memory BW • Microprocessor-based Symmetric MultiProcessors (SMPs) emerged in the 80’s • Very cost effective, up to limit of memory BW • Early SMPs had 1 CPUper board (off backplane) • Now multiple per-board,per-MCM, or even per die • Memory system caches bothshared and private (local) data • Private data in 1 cache only • Shared data may be replicated
Cache Coherence Problem • Goal: All processors should have a consistent view ofthe shared memory contents, and how they change. • Or, as nearly consistent as we can manage. • The fundamental difficulty: • Written information takes time to propagate! • E.g.,A writes, then Bwrites, then A reads (like WAW hazard) • A might see the value from A, instead of the value from B • A simple, but inefficient solution: • Have all writes cause all processors to stall (or at least, not perform any new accesses) until all have received the result of the write. • Reads, on the other hand, can be reordered amongst themselves. • But: Incurs a worst-case memory stall on each write step! • Can alleviate this by allowing writes to occur only periodically • But this reduces bandwidth for writes • And increases avg. latency for communication through shared memory
Another Interesting Method Research by Chris Carothers at RPI • Maintain a consistent system“virtual time” modeled by all processors. • Each processor asynchronously tracks its local idea of the current virtual time. (Local Virtual Time) • On a write, asynchronously send invalidate messages timestamped with the writer’s LVT. • On receiving an invalidate message stamped earlier than the reader’s LVT, • Roll back the local state to that earlier time • There are efficient techniques for doing this • If timestamped later than the reader’s LVT, • Queue it up until the reader’s LVT reaches that time (This is anexample ofspeculation.)
Frank-Lewis Rollback Method Steve Lewis’ MS thesis, UF, 2001 (Reversible MIPS Emulator & Debugger) • Fixed-size window • Limits how far back you can go. • Periodically store checkpoints of machine state • Each checkpoint records changes needed • to get back to that earlier state from next checkpoint, • or from current state if it’s the last checkpoint • Cull out older checkpoints periodically • so the total number stays logarithmic in the size of the window. • Also, store messages received during time window • To go backwards Δt steps (to time told = tcur− Δt), • Revert machine state to latest checkpoint preceding time told • Apply changes recorded in checkpoints from tcur on backwards • Compute forwards from there to time told • Technique is fairly time- and space- efficient
Definition of Coherence • A weaker condition than full consistency. • A memory system is called coherent if: • Reads return the most recent value written locally, • if no other processor wrote the location in the meantime. • A read can return the value written by another processor,if the times are far enough apart. • And, if nobody else wrote the location in between • Writes to any given location are serialized. • If A writes a location, then B writes the location, all processors first see the value written by A, then (later) the value written by B. • Avoids WAW hazards leaving cache in wrong state.
Cache Coherence Protocols • Two common types: (Differ in how they track blocks’ sharing state) • Directory-based: • sharing status of a block is kept in a centralized directory • Snooping (or “snoopy”): • Sharing status of each block is maintained (redundantly) locally by each cache • All caches monitor or snoop (eavesdrop) on the memory bus, • to notice events relevant to sharing status of blocks they have • Snooping tends to be more popular
Write Invalidate Protocols • When a processor wants to write to a block, • It first “grabs ownership” of that block, • By telling all other processors to invalidate their own local copy. • This ensures coherence, because • A block recently written is cached in 1 place only: • The cache of the processor that most recently wrote it • Anyone else who wants to write that block will first have to grab back the most recent copy. • The block is also written to memory at that time. Analogous to using RCS to lock files
Meaning of Bus Messages • Write miss on block B: • “Hey, I want to write block B. Everyone, give me the most recent copy if you’re the one who has it. And everyone, also throw away your own copy.” • Read miss on block B: • “Hey, I want to read block B. Everyone, give me the most recent copy, if you have it. But you don’t have to throw away your own copy.” • Writeback of block B: • “Here is the most recent copy of block B, which I produced. I promise not to make any more changes until I after I ask for ownership back and receive it.”
Write-Update Coherence Protocol • Also called write broadcast. • Strategy: Update all cached copies of a block when the block is written. • Comparison versus write-invalidate: • More bus traffic for multiple writes by 1 processor • Less latency for data to be passed between proc’s. • Bus & memory bandwidth is a key limiting factor! • Write-invalidate usually gives best overall perf.