310 likes | 576 Views
Chapter3 Limitations on Instruction-Level Parallelism. Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010. Overcome Data Hazards with Dynamic Scheduling. If there is a data dependence, the hazard detection hardware stalls the pipeline
E N D
Chapter3 Limitations on Instruction-Level Parallelism Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
Overcome Data Hazards with Dynamic Scheduling • If there is a data dependence, the hazard detection hardware stalls the pipeline • No new instructions are fetched or issued until the dependence is cleared • Dynamic Scheduling:the hardware rearrange the instruction execution to reduce the stalls while maintaining data flow and exception behavior
RAW • If two instructions are data dependent, they cannot execute simultaneously or be completely overlapped • If data dependence caused a hazard in pipeline, called a Read After Write (RAW) hazard I: add r1,r2,r3 J: sub r4,r1,r3
Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 ADD F10<- F0+F8 SUB F12<- F8-F14
Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 SUB F12<- F8-F14 ADD F10<- F0+F8
Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 SUB F12<- F8-F14 ADD F10<- F0+F8 • Enables out-of-order execution and allows out-of-order completion(e.g., SUB) • In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)
Overcome Data Hazards with Dynamic Scheduling • It offers several advantages: • Simplifies the compiler • It allows code that compiled for one pipeline to run efficiently on a different pipeline • (Allow the processor to tolerate unpredictable delays such as cache misses)
Overcome Data Hazards with Dynamic Scheduling • However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder • Name dependence:when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; • There are 2 versions of name dependence
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 WAR • InstrJ writes operand before InstrI reads it • If it caused a hazard in the pipeline, called a Write After Read (WAR) hazard
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 WAW • InstrJ writes operand before InstrI writes it. • If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
Example DIV r0 <- r2 / r4 ADD r6 <- r0 + r8 SUB r8 <- r10 – r14 MUL r6 <- r10 * r7 OR r3 <- r5 or r9
For you to practice • DIV r0 <- r2 / r4 • ADD r6 <- r0 + r8 • ST r1 <- r6 • SUB r8 <- r10 - r14 • MUL r6 <- r10 * r8
Overcome Data Hazards with Dynamic Scheduling • Instructions involved in a name dependence can execute simultaneously if name used in instructions is changed so instructions do not conflict • Register renaming resolves name dependence for regs • Either by compiler or by HW
Limits to ILP Assumptions for ideal/perfect machine to start: 1. Register renaming – infinite virtual registers => all register WAW & WAR hazards are avoided 2. Branch prediction – perfect; no mispredictions 3. Perfect Cache
Performance beyond single thread ILP • There can be much higher natural parallelism in some applications • Such as “Online processing system”: which has natural parallelism among the multiple queries and updates that are presented by requests • Such as Data Mining algorithms
Thread-level parallelism (TLP) • Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • (Ch4: Data Level Parallelism: Perform identical operations on data, and lots of data)
Thread-level parallelism (TLP) • TLP explicitly represented by the use of multiple threads of execution that are inherently parallel • Goal: Use multiple instruction streams to improve • Throughput of computers that run many programs • Execution time of multi-threaded programs • TLP could be more cost-effective to exploit than ILP
New Approach: Mulithreaded Execution • Multithreading: multiple threads to share the functional units of 1 processor via overlapping • Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table
New Approach: Mulithreaded Execution • When switch? • Alternate instruction per thread (fine grain) • When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)
Fine-Grained Multithreading • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved • Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock
Fine-Grained Multithreading • Advantage is it can hide both short and long stalls, since instructions from other threads executed when one thread stalls • Disadvantage is it slows down execution of individual threads, since a thread ready to execute without stalls will be delayed by instructions from other threads
Course-Grained Multithreading • Switches threads only on costly stalls, such as L2 cache misses • Advantages • Relieves need to have very fast thread-switching • Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall
Course-Grained Multithreading • Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen • New thread must fill pipeline before instructions can complete • Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time
Multithreaded Categories Thread 4 Thread 1 Thread 2 Thread 3 Thread 5
Multithreaded Categories Superscalar Fine-Grained Coarse-Grained (2clock cycle) Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot
Multiprocessing Multiprocessing Thread 1 Thread 2