1 / 29

Review

Review. Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008. Amdahl’s Law. ExTime new = ExTime old x (1 - Fraction enhanced ) + Fraction enhanced. Speedup enhanced. 1. ExTime old ExTime new. Speedup overall =. =. (1 - Fraction enhanced ) + Fraction enhanced.

whitney
Download Presentation

Review

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Review Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008

  2. Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CompSci 220 / ECE 252

  3. Review: Performance CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle “Average Cycles Per Instruction” “Instruction Frequency” Invest Resources where time is Spent! CompSci 220 / ECE 252

  4. Little’s Law • Key Relationship between latency and bandwidth: • Average number in system = arrival rate * mean holding time • Example: • How big a wine cellar should we build? • We drink (and buy) an average of 4 bottles per week • On average, I want to age my wine 5 years • bottles in cellar = 4 bottles/week * 52 weeks/year * 5 years • = 1040 bottles CompSci 220 / ECE 252

  5. Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem WrB Review: The Five Stages of a Load • Ifetch: Instruction Fetch • Fetch the instruction from the Instruction Memory • Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • WrB: Write the data back to the register file CompSci 220 / ECE 252

  6. Its Not That Easy for Computers • What could go wrong? • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions • Data hazards: Instruction depends on result of prior instruction still in the pipeline • RAW • WAW • WAR • Control hazards: Pipelining of branches & other instructions CompSci 220 / ECE 252

  7. Although Beq is fetched during Cycle 4: Target address is NOT written into the PC until the end of Cycle 7 Branch’s target is NOT fetched until Cycle 8 3-instruction delay before the branch take effect This is called a Control Hazard: Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Cycle 11 Clk Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Mem Wr 16: R-type 20: R-type Ifetch Reg/Dec Exec Mem Wr 24: R-type Ifetch Reg/Dec Exec Mem Wr 1000: Target of Br Ifetch Reg/Dec Exec Mem Wr Control Hazard 12: Beq (target is 1000) CompSci 220 / ECE 252

  8. Dynamic Branch Prediction • Solution: 2-bit counter where prediction changes only if mispredict twice: • Increment for taken, decrement for not-taken • 00,01,10,11 • Helps when target is known before condition T NT Predict Taken Predict Taken T NT T Predict Not Taken NT Predict Not Taken T NT CompSci 220 / ECE 252

  9. Need Address @ Same Time as Prediction • Branch Target Buffer (BTB): Address of branch index to get prediction AND branch address (if taken) • Note: must check for branch match now, since can’t use wrong branch address (Figure 4.22, p. 273) PC of Inst to fetch Branch Prediction: Taken or not Taken Predicted PC 0 … Branch folding? n-1 = Yes, use predicted PC No, not branch Procedure Return Addresses Predicted with a Stack CompSci 220 / ECE 252

  10. Hybrid/Competitive/Selective Branch Predictor • Different predictors work better for different branches • Pick the predictor that works best for a given branch CompSci 220 / ECE 252

  11. Unrolled Loop That Minimizes Stalls • What assumptions made when moved code? • OK to move store past SUBI even though changes register • OK to move loads before stores: get right data? • When is it safe for compiler to do such changes? 1 Loop: LD F0,0(R1) 2 LD F6,-8(R1) 3 LD F10,-16(R1) 4 LD F14,-24(R1) 5 ADDD F4,F0,F2 6 ADDD F8,F6,F2 7 ADDD F12,F10,F2 8 ADDD F16,F14,F2 9 SD 0(R1),F4 10 SD -8(R1),F8 11 SD -16(R1),F12 12 SUBI R1,R1,#32 13 BNEZ R1,LOOP 14 SD 8(R1),F16 ; 8-32 = -24 14 clock cycles, or 3.5 per iteration CompSci 220 / ECE 252

  12. SW Pipelining Example Before: Unrolled 3 times 1 LD F0,0(R1) 2 ADDD F4,F0,F2 3 SD 0(R1),F4 4 LD F6,-8(R1) 5 ADDD F8,F6,F2 6 SD -8(R1),F8 7 LD F10,-16(R1) 8 ADDD F12,F10,F2 9 SD -16(R1),F12 10 SUBI R1,R1,#24 11 BNEZ R1,LOOP After: Software Pipelined LD F0,0(R1) ADDD F4,F0,F2 LD F0,-8(R1) 1 SD 0(R1),F4; Stores M[i] 2 ADDD F4,F0,F2; Adds to M[i-1] 3 LD F0,-16(R1); loads M[i-2] 4 SUBI R1,R1,#8 5 BNEZ R1,LOOP SD 0(R1),F4 ADDD F4,F0,F2 SD -8(R1),F4 Read F4 Read F0 SD ADDD LD IF ID EX Mem WB IF ID EX Mem WB IF ID EX Mem WB Write F4 Write F0 CompSci 220 / ECE 252

  13. Tomasulo Organization From Instruction Unit From Memory FP Registers Load Buffers FP op queue Operand Bus Store Buffers To Memory FP multipliers FP adders Common Data Bus (CDB) CompSci 220 / ECE 252

  14. Tomasulo Summary • Prevents Register as bottleneck • Avoids WAR, WAW hazards of Scoreboard • Allows loop unrolling in HW • Not limited to basic blocks (provided branch prediction) • Lasting Contributions • Dynamic scheduling • Register renaming • Load/store disambiguation CompSci 220 / ECE 252

  15. Speculation (getting more ILP) • Speculation: allow an instruction to issue that is dependent on branch predicted to be taken without any consequences (including exceptions) if branch is not actually taken (“HW undo” squash) • Often combine with dynamic scheduling • Separate speculative bypassing of results from real bypassing of results • When instruction no longer speculative, write results (instruction commit) • execute out-of-order but commit in order • Memory operations (memory disambiguation) • Interrupts -> maintaining precise exceptions CompSci 220 / ECE 252

  16. HW support for More ILP • Need HW buffer for results of uncommitted instructions: reorder buffer • Reorder buffer can be operand source • Once operand commits, result is found in register • 3 fields: instr. type, destination, value • Use reorder buffer number instead of reservation station • Instructions commit in order • As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions Reorder Buffer FP Op Queue FP Regs Res Stations Res Stations FP Adder FP Adder CompSci 220 / ECE 252

  17. Recovering from Incorrect Speculation • Reorder Buffer • Register Update Unit: Reorder buffer+reservation stations combined) • P6 Style: Reorder buffer separate from reservation stations • R10K style • Separate physical register file from reorder buffer • Must maintain a map of logical to physical registers • Enables easy recovery from misprediction & exceptions • Memory Disambiguation • Load/store queue (Memory Order Buffer) CompSci 220 / ECE 252

  18. Superscalar & VLIW • Wider pipelines • Superscalar, mulitple PCs • VLIW, multiple operations for each PC • Problems w/ Superscalar • Wide fetch • Dependence check • Bypassing • Need large window to find independent ops CompSci 220 / ECE 252

  19. Trace Scheduling Reorder these instructions to improve ILP Fix-up instructions In case we were wrong CompSci 220 / ECE 252

  20. Trace Cache • Store traces • Enables fetch past next branch • Enables branch folding CompSci 220 / ECE 252

  21. Predicated/Conditional Execution (more ILP) • Avoid branch prediction by turning branches into conditionally executed instructions: if (x) then A = B op C else NOP • If false, then neither store result nor cause exception • Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr, IA-64 predicated execution. • Drawbacks to conditional instructions • Still takes a clock even if “annulled” • Stall if condition evaluated late • Complex conditions reduce effectiveness; condition becomes known late in pipeline CompSci 220 / ECE 252

  22. Review: ABCs of caches • Associativity • Block size • Capacity • Number of sets S = C/(BA) • 1-way (Direct-mapped) • A = 1, S = C/B • N-way set-associative • Fully associativity • S = 1, C = BA • Know how a specific piece of data is found • Index, tag, block offset

  23. Cache Performance CPU time = (CPU execution clock cycles + Memory stall clock cycles) x clock cycle time Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) Memory stall clock cycles = Memory accesses x Miss rate x Miss penalty CPS 220

  24. Improving Cache Performance Ave Mem Acc Time = Hit time + (miss rate x miss penalty) 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. • NUCA • Indirect Index Cache • Continual Flow Pipelines • Checkpoint recovery • Nonblocking queues to tolerate long latency operations CPS 220

  25. Virtual Memory • Paged VM • Segmented VM • TLBs (where is it) • Page coloring • Main Memory • Interleaving • Banked CompSci 220 / ECE 252

  26. Multithreading & Multiprocessing • Multithreaded vs. multicore • Simultaneous multithreaded • Shared Memory vs. Message Passing • SIMD vs. MIMD • Snooping Coherence Protocol • Directory Coherence Protocol • Locks & Synchronization • Memory Consistency • Papers • Power as Design Target • Choice in SMT • Multiscalar • Niagara CompSci 220 / ECE 252

  27. Interconnection Networks Topologies Routing Deadlock Livelock CompSci 220 / ECE 252

  28. Papers • Reliability • Diva • Argus • Security • Rifle • Virtual Machines • Grid Processors CompSci 220 / ECE 252

  29. Projects • Project Presentations in D344 LSRC • 10am to noon Tuesday Dec 2 (D344 LSRC) • Zachary Drillings, David Eitel, Alex Edelsburg, "Detouring 2: Toward Industry" • Jun Pang, Meng Zhang, "FPGA Implementation and Improvement of Core Cannabalization Architecture" • Risi Thonangi, Amre Shakimov, "Speeding up B-Tree Accesses using Flash-memory" • Abhinav Kapur, Alex Hunter, Paul Huang, "Architectural Support for Computational Biology" • Archana Ramamoorthy, Yulin Zhang, "Dynamic Branch Predication" • Xuhan Peng, Mengyuan Huang, "Improving Correlated Branch Predictors“ Thursday Dec 4 (D344 LSRC) • George Rossin, Ben Shelton, Philp Eithier, "Adaptive Cache Replacement Algorithms" • Laura Angle, Andrew First, Preeyanka Shah, "Argus Testing and Analysis" • Mustafa Lokhandwal, Dongtao Liu, "Evaluating Parallel Programming Models" • Pablo Gainza, Ryan Scudellari, "Evaluation of Prefix-Sum Parallel Application using CUDA, Pthreads and message passing" • Keven Brown, Aleks Klimas, "Dynamically Allocated Functional Units (DAFU)" • Kai Wang, Yang Jiang, Xiaoyan Yin, "Store Queue Optimization" • Amrita Halappanavar, J. P. Carafo, Abhinav Mohan, "Speculative Issue Logic" CompSci 220 / ECE 252

More Related