410 likes | 542 Views
Final Review. Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010. Overcome Data Hazards with Dynamic Scheduling. Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 ADD F10<- F0 +F8 SUB F12<- F8-F14. Overcome Data Hazards with Dynamic Scheduling.
E N D
Final Review Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2010
Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 ADD F10<- F0+F8 SUB F12<- F8-F14
Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 SUB F12<- F8-F14 ADD F10<- F0+F8
Overcome Data Hazards with Dynamic Scheduling • Key idea: Allow instructions behind stall to proceed DIV F0 <- F2/F4 SUB F12<- F8-F14 ADD F10<- F0+F8 • Enables out-of-order execution and allows out-of-order completion(e.g., SUB) • In a dynamically scheduled pipeline, all instructions still pass through issue stage in order (in-order issue)
Overcome Data Hazards with Dynamic Scheduling • However, Dynamic execution creates WAR and WAW hazards and makes exceptions harder • Name dependence:when 2 instructions use same register or memory location, called a name, but no flow of data between the instructions associated with that name; • There are 2 versions of name dependence
I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 WAR • InstrJ writes operand before InstrI reads it • If it caused a hazard in the pipeline, called a Write After Read (WAR) hazard
I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 WAW • InstrJ writes operand before InstrI writes it. • If anti-dependence caused a hazard in the pipeline, called a Write After Write (WAW) hazard
Thread-level parallelism (TLP) • Thread: process with own instructions and data • thread may be a process part of a parallel program of multiple processes, or it may be an independent program • Each thread has all the state (instructions, data, PC, register state, and so on) necessary to allow it to execute • (Ch4: Data Level Parallelism: Perform identical operations on data, and lots of data)
New Approach: Mulithreaded Execution • Multithreading: multiple threads to share the functional units of 1 processor via overlapping • Processor must duplicate independent state of each thread e.g., a separate copy of register file, a separate PC, and for running independent programs, a separate page table
New Approach: Mulithreaded Execution • When switch? • Alternate instruction per thread (fine grain) • When a thread is stalled, perhaps for a cache miss, another thread can be executed (coarse grain)
Fine-Grained Multithreading • Switches between threads on each instruction, causing the execution of multiples threads to be interleaved • Usually done in a round-robin fashion, skipping any stalled threads • CPU must be able to switch threads every clock
Course-Grained Multithreading • Switches threads only on costly stalls, such as L2 cache misses • Advantages • Relieves need to have very fast thread-switching • Doesn’t slow down thread, since instructions from other threads issued only when the thread encounters a costly stall
Course-Grained Multithreading • Disadvantage is hard to overcome throughput losses from shorter stalls, due to pipeline start-up costs • Since CPU issues instructions from 1 thread, when a stall occurs, the pipeline must be emptied or frozen • New thread must fill pipeline before instructions can complete • Because of this start-up overhead, coarse-grained multithreading is better for reducing penalty of high cost stalls, where pipeline refill << stall time
Multithreaded Categories Thread 4 Thread 1 Thread 2 Thread 3 Thread 5
Multithreaded Categories Superscalar Fine-Grained Coarse-Grained (2clock cycle) Time (processor cycle) Thread 1 Thread 3 Thread 5 Thread 2 Thread 4 Idle slot
Flynn’s Taxonomy M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
Back to Basics • “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” • Parallel Architecture = Computer Architecture + Communication Architecture • 2 classes of multiprocessors WRT memory: • Centralized Memory Multiprocessor • < few dozen processor chips Small enough to share single, centralized memory • Physically Distributed-Memory multiprocessor • Larger number chips and cores • BW demands Memory distributed among processors
2 Models for Communication and Memory Architecture • The first kind, communication occurs through a shared address space. • Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors
2 Models for Communication and Memory Architecture • The first kind, communication occurs through a shared address space • Even the physically separate memories can be addressed as on logically shared space • Meaning that the memory reference can be made by any processor to any memory location, (assume it has the access right) • These multiprocessors are called distributed shared memory (DSM)
2 Models for Communication and Memory Architecture • Communication occurs through a shared address space (via loads and stores): shared memory multiprocessorseither • symmetric shared memory (centralized memory MP) • distributed shared memory (distributed memory MP) • Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP
Multiprocessors Performance • Amdahl’s Law
2 Classes of Cache Coherence Protocols • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • Directory based — Sharing status of a block of physical memory is kept in just one location, the directory
Snooping • Write through: the information is written to both the block in the cache and to the block in the lower-level memory • Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed
Directory-Based Cache Coherence Protocols • To implement the operations, a directory must track the state of each cache block: • Shared (S): one or more processors have the block cached, and the value is up-to-date • Uncached (U): no processor has a copy of the cache block • Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block
Directory-based Protocol CPU 0 CPU 1 CPU 2 Interconnection Network Bit Vector X U 0 0 0 Directories X 7 Memories Caches
CPU 0 Reads X X X 7 7 CPU 0 CPU 1 CPU 2 Interconnection Network X S 1 0 0 Directories Memories Caches
CPU 2 Reads X X X X 7 7 7 CPU 0 CPU 1 CPU 2 Interconnection Network X S 1 0 1 Directories Memories Caches
CPU 0 Writes 6 to X X 7 CPU 0 CPU 1 CPU 2 Interconnection Network X E 1 0 0 Directories Memories Caches X 6
CPU 1 Reads X X X X 6 6 6 CPU 0 CPU 1 CPU 2 Interconnection Network X S 1 1 0 Directories Memories Caches
CPU 2 Writes 5 to X (Write back) X 6 CPU 0 CPU 1 CPU 2 Interconnection Network X E 0 0 1 Directories Memories X 5 Caches
CPU 0 Writes 4 to X X 5 CPU 0 CPU 1 CPU 2 Interconnection Network X E 1 0 0 Directories Memories Caches X 4
Evaluating Switch Topologies • Diameter • distance between farthest two nodes • Bisection width • Min. number of edges in a cut which roughly divides a network in two halves - determines the min. bandwidth of the network • Degree = Number of edges / node • constant degree board can be mass produced • Constant edge length? (yes/no)