680 likes | 687 Views
Discover the concepts of ILP, memory, and synchronization in parallelism through dynamic and static exploitation methods like VLIW, superscalar, and Tomasulo's algorithm. Learn about hazards, data dependency conflicts, scheduling approaches, loop unrolling, software pipelining, and advanced pipelining techniques in modern architectures.
E N D
ILP, Memory and Synchronization Joseph B. Manzano
Instruction Level Parallelism • Parallelism that is found between instructions • Dynamic and Static Exploitation • Dynamic: Hardware related. • Static: Software related (compiler and system software) • VLIW and Superscalar • Micro-Dataflow and Tomasulo’s Algorithm
Hazards • Structural Hazards • Non Pipelining Function Units • One Port Register Bank and one port memory bank • Data Hazards • For some • Forwarding • For others • Pipeline Interlock LD R1 A + R4 R1 R7 Need Bubble / Stall
B + C A A + C B A + D E E + D A Flow Dependency RAW Conflicts Anti Dependency WAR Conflicts Data Dependency: A Review B + C A E + D A RAR are not really a problem Output Dependency WAW Conflicts
Instruction Level Parallelism • Static Scheduling • Simple Scheduling • Loop Unrolling • Loop Unrolling + Scheduling • Software Pipelining • Dynamic Scheduling • Out of order execution • Data Flow computers • Speculation
Advanced Pipelining • Instruction Reordering and scheduling within loop body • Loop Unrolling • Code size suffers • Superscalar • Compact code • Multiple issued of different instruction types • VLIW
An Example X[i] + a Loop: LD F0, 0 (R1) ; load the vector element ADDD F4, F0, F2 ; add the scalar in F2 SD 0 (R1), F4 ; store the vector element SUB R1, R1, #8 ; decrement the pointer by ; 8 bytes (per DW) BNEZ R1, Loop ; branch when it’s not zero Load can by-pass the store Assume that latency for Integer ops is zero and latency for Integer load is 1
An Example X[i] + a Loop: LD F0, 0 (R1) 1 STALL 2 ADDD F4, F0, F2 3 STALL 4 STALL 5 SD 0 (R1), F4 6 SUB R1, R1, #8 7 BNEZ R1, Loop 8 STALL 9 Load Latency FP ALU Latency Load Latency This requires 9 Cycles per iteration
An Example X[i] + a Scheduling Loop: LD F0, 0 (R1) 1 STALL 2 ADDD F4, F0, F2 3 SUB R1, R1, #8 4 BNEZ R1, Loop 5 SD 8 (R1), F4 6 This requires 6 Cycles per iteration
An Example X[i] + a Unrolling Loop : LD F0, 0 (R1) 1 NOP 2 ADDD F4, F0, F2 3 NOP 4 NOP 5 SD 0 (R1), F4 6 LD F6, -8 (R1) 7 NOP 8 ADDD F8, F6, F2 9 NOP 10 NOP 11 SD -8 (R1), F8 12 LD F10, -16 (R1) 13 NOP 14 ADDD F12, F10, F2 15 NOP 16 NOP 17 SD -16 (R1), F12 18 LD F14, -24 (R1) 19 NOP 20 ADDD F16, F14, F2 21 NOP 22 NOP 23 SD -24 (R1), F16 24 SUB R1, R1, #32 25 BNEZ R1, LOOP 26 NOP 27 This requires 6.8 Cycles per iteration
An Example X[i] + a Unrolling + Scheduling Loop : LD F0, 0 (R1) 1 LD F6, - 8 (R1) 2 LD F10, -16 (R1) 3 LD F14, -24 (R1) 4 ADDD F4, F0, F2 5 ADDD F8, F6, F2 6 ADDD F12, F10, F2 7 ADDD F16, F14, F2 8 SD 0 (R1), F4 9 SD -8 (R1), F8 10 SD -16 (R1), F12 11 SUB R1, R1, #32 12 BNEZ R1, LOOP 13 SD 8 (R1), F16 14 This requires 3.5 Cycles per iteration
ILP • ILP of a program • Average Number of Instructions that a superscalar processor might be able to execute at the same time • Data dependencies • Latencies and other processor difficulties • ILP of a machine • The ability of a processor to take advantage of the ILP • Number of instructions that can be fetched and executed at the same time by such processor
Multi Issue Architectures • Super Scalar • Machines that issue multiple independent instructions per clock cycle when they are properly scheduled by the compiler and runtime scheduler • Very Long Instruction Word • A machine where the compiler has complete responsibility for creating a package of instructions that can be simultaneously issued, and the hardware does not dynamically make any decisions about multiple issue Patterson & Hennessy P317 and P318
Multiple Instruction Issue • Multiple Issue + Static Scheduling VLIW • Dynamic Scheduling • Tomasulo • Scoreboarding • Multiple Issue + Dynamic Scheduling Superscalar • Decoupled Architectures • Static Scheduling of R-R Instructions • Dynamic Scheduling of Memory Ops • Buffers
Software Pipeline • Reorganizing loops such that each iteration is composed of instruction sequences chosen from different iterations • Use less code size • Compared to Unrolling • Some Architecture has specific software support • Rotating register banks • Predicated Instructions
Software Pipelining • Overlap instructions without unrolling the loop • Give the vector M in memory, and ignoring the start-up and finishing code, we have: Loop: SD 0 (R1), F4 ;stores into M[i] ADDD F4, F0, F2 ;adds to M[i +1] LD F0, -8 (R1) ;loads M[i + 2] BNEZ R1, LOOP SUB R1, R1, #8 ;subtract indelay slot This loop can be run at a rate of 5 cycles per result, ignoring the start-up and clean-up portions.
Number of Overlapped instructions Time Prologue Epilog Software Pipeline Code Number of Overlapped instructions Time Unrolled Software Pipeline Overhead for Software Pipeline: Two times cost One for Prolog and one for epilog Overhead for Unrolled Loop: M / N times cost M Loop Executions and N unrolling
Loop Unrolling V.S. Software Pipelining • When not running at maximum rate • Unrolling: Pay m/n times overhead when m iteration and n unrolling • Software Pipelining: Pay two times • Once at prologue and once at epilog • Moreover • Code compactness • Optimal runtime • Storage constrains
Limitations of VLIW • Limited parallelism (statically schedule) code • Basic Blocks may be too small • Global Code Motion is difficult • Limited Hardware Resources • Code Size • Memory Port limitations • A Stall is serious • Cache is difficult to be used (effectively) • i-cache misses have the potential to multiply the miss rate by a factor of n where n is the issue width • Cache miss penalty is increased since the length of instruction word
An VLIW Example TMS32C62x/C67 Block Diagram Source: TMS320C600 Technical Brief. February 1999
An VLIW Example TMS32C62x/C67 Data Paths Assembly Example Source: TMS320C600 Technical Brief. February 1999
Instruction Issue Policy • It determinates the processor look ahead policy • Ability to examine instructions beyond the current PC • Look Ahead must ensure correctness at all costs • Issue policy • Protocol used to issue instructions • Note: Issue, execution and completion
Achieve High Performance in Multiple Issued Instruction Machines • Detection and resolution of storage conflicts • Extra “Shadow” registers • Special bit for reservation • Organization and control of the buses between the various units in the PU • Special controllers to detect write backs and read
Data Dependencies & SuperScalar • Hardware Mechanism (dynamic scheduling) • Scoreboarding • limited out-of-order issue/completion • centralized control • Renaming with reorder buffer is a another attractive approach (based on Tomasulo Alg.) • Micro dataflow • Advantage: exact runtime information • Load/cache miss • resolve storage location related dependence
Scoreboarding • Named after CDC 6600 • Effective when there are enough resources and no data dependencies • Out-of-order execution • Issue: checking scoreboard and WAW will cause a stall • Read operand • checking availability of operand and resolve RAW dynamically at this step • WAR will not cause stall • EX • Write result • WAR will be checked and will cause stall
The basic structure of a DLX processor with a scoreboard Data buses Registers FP mult FP mult . . . . . FP divide FP add Integer unit Scoreboard Control/ status Control/ status
Scoreboarding [CDC6600, Thorton70], [WeissSmith84] • A bit (called “scoreboard bit”) is associated with each register bit = 1: the register is reserved by a write • An instruction has a source operand with bit = 1will be issued, but put into an instruction window, with the register identifier to denote the “to-be-written” operand • Copies of valid operands also be read with pending inst (solve anti-dependence) • When the missing operand is finally written, the register id in the pending inst will be compared and value written, so it can be issued • An inst has result R reserved - will stall so the output-dependence (WAW) will be correctly handled by stall!
Micro Data Flow • Fundamental Concepts • “Data Flow” • Instructions can only be fired when operands are available • Single assignment and register renaming • Implementation • Tomasulo’s Algorithm • Reorder Buffer
Renaming/Single Assignment 1 R0 = R2 / R4; (1) R6 = R0 + R8 (2) R1[0] = R6 (3) R8 = R10 – R14 (4) R6 = R10 * R8 (5) 2 5 4 3 1 2 R0 = R2 / R4; (1) S = R0 + R8 (2) R1[0] = S (3) T = R10 – R14 (4) R6 = R10 * T (5) 5 4 3
Baseline Superscalar Model Inst Fetch Inst Decode Renaming Issue Window Wake Up Select Execution Bypass Data Cache Access Register File Bypass Data Cache Exec Register Write & Instruction Commit
Micro Data FlowConceptual Model A A R1 R1 * B R2 R2 / C R1 R4 + R1 R4 Load R1 OR4 B * C R2 OR3 R1 / OR5 R2 OR1 R4 R3 R1 + R4 OR6 R4
ROB Stages • Issue • Dispatch an instruction from the instruction queue • Reserved ROB entry and a reservation station • Execute • Stall for operands • RAW resolved • Write Result • Write back to any reservation stations waiting for it and to the ROB • Commit • Normal Commit: Update Registers • Store Commit: Update Memory • False Branch: Flush the ROB and re-begin execution
Tomasulo’s Algorithm • Tomasulo, R.M. “An Efficient Algorithm for Exploiting Multiple Arithmetic Units”, IBM J. of R&D 11:1 (Jan, 1967, p.p.232-233) • IBM 360/91 (three year after CDC 6600 and just before caches) • Features: • CDB: Common Data Bus • Reservation Units: Hardware features which allow the fetch, use and reuse of data as soon as it becomes available. It allows register renaming and it is decentralized in nature (as opposed as Scoreboarding)
Tomasulo’s Algorithm • Control and Buffers distributed with Functional Units. • HW renaming of registers • CDB broadcasting • Load / Store buffers Functional Units • Reservation Stations: • Hazard detection and Instruction control • 4-bit tag field to specify which station or buffer will produce the result • Register Renaming • Tag Assigned on IS • Tag discarded after write back
Scoreboarding Centralized Data structure and control Register bit Simple, low cost Structural hazards solved by FU Solve RAW by register bit Solve WAR in write Solve WAW stalls on issue Tomasulo’s Algoritjm Distributed control Tagged Registers + register renaming Structural Hazard stalls on Reservation Station Solve RAW by CDB Solve WAR by copying operand to Reservation Station Solve WAW by renaming Limited: CDB Broadcast 1 per cycle Comparison
The Architecture From instruction unit Form memory Floating- point operations FP registers Load buffers 6 5 4 3 2 1 - 3 Adders - 2 Multipliers - Load buffers (6) - Store buffers (3) - FP Queue - FP registers - CDB: Common Data Bus Store buffers Operand bus 3 2 1 Operation bus to memory 2 1 3 2 1 Reservation Stations FP adders FP multipliers Common data bus (CDB)
Tomasulo’s Algorithm’s Steps • Issue • Issue if empty reservation station is found, fetch operands if they are in registers, otherwise assign a tag • If no empty reservation is found, stall and wait for one to get free • Renaming is performed here and WAW and WAR are resolved • Execute • If operands are not ready, monitor the CDB for them • RAWs are resolved • When they are ready, execute the op in the FU • Write Back • Send the results to CDB and update registers and the Store buffers • Store Buffers will write to memory during this step • Exception Behavior • During Execute: No instructions are allowed to be issued until all branches before it have been completed
Tomasulo’s Algorithm • Note that: • Upon Entering a reservation station, source operands are either filled with values or renamed • The new names are 1-to-1 correspondence to FU names • Question: • How the output dependencies are resolved? • Two pending writes to a register • How to determinate that a read will get the most recent value if they complete out of order
Features of T. Alg. • The value of an operand (for any inst already issued in a reservation station) will be read from CDB. it will not be read from the reg. field. • Instructions can be issued without even the operands produced (but know they are coming from CDB)
Programming Execution Models • A set of rules to create programs • Message Passing Model • De Facto Multicomputer Programming Model • Multiple Address Space • Explicit Communication / Implicit Synchronization • Shared Memory Models • De Facto Multiprocessor Programming Model • Single Address Space • Implicit Communication / Explicit Synchronization
A set of rules for thread creation, scheduling and destruction Thread Model Memory Model Synchronization Model Rules that deal with access to shared data Shared Memory Execution Model A group of rules that deals with data replication, coherency, and memory ordering Thread Virtual Machine Shared Data Private Data Data that can be access by other threads Data that is not visible to other threads
Grand Challenge Problems • Shared Memory Multiprocessor Effective at a number of thousand units • Optimize and Compile parallel applications • Main Areas: Assumptions about • Memory Coherency • Memory Consistency
Memory [Cache] CoherencyThe Problem P1 P2 P3 4 3 1 U:? U:? 3 U:7 U:5 U:5 5 U:5 1 2 What value P1 and P2 will read?
MCMCategory of Access As Presented in Mosberger 93 Memory Access Shared Private Non-Competing Competing Synchronization Non synchronization Acquire Release Non-exclusive Exclusive Uniform V.S. Hybrid
Conventional MCM • Sequential Consistency • “… the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program.” [Lamport 79] ELEG652-07F
Memory Consistency Problem B = 0 … A = 1 L1: print B A = 0 … B = 1 L2: print A Assume that L1 and L2 are issue only after the other 4 instructions have been completed. What are the possible values that are printed on the screen? Is 0, 0 a possible combination? The MCM: A software and hardware contract