380 likes | 834 Views
level leads to large instruction window. 2. Pursuing high clock speed limits the size of ... It successfully deals with RAW, WAW, and WAR data dependencies. ...
E N D
Slide 1:Design Tradeoffs in Instruction Window of Superscalar Processors
Presented by: Chunming Gao MS Project Proposal Committee members: Dr. Soner Onder (Chair) Dr. Steven Carr Dr. David Poplawski Dr. Jianping Dong
Slide 2:Outline of the presentation
Part one: Introduction Part two: Background Part three: Instruction window organizations Part four: Work plan and preliminary results
Part One IntroductionSlide 4:Motivation
1. Exploring more parallelism in instruction level leads to large instruction window. 2. Pursuing high clock speed limits the size of instruction window .
Slide 5:What Will We Study
1. Central window design 2. Distributed window design 3. Dependence-based window design 4. Cluster-based window design 5. PEWs (parallel execution windows) 6. Direct wake-up based window design
Slide 6:How Do We Define Performance
1. IPC (Instructions per cycle) 2. Clock cycle time 3. Compare the ratio of IPCs to a baseline processor
Slide 7:Part TwoBackground
Slide 8:Superscalar Processor Stages
Fetch Decode Retire Complete Execute Dispatch Instruction Dispatch Issuing Completion Store buffer buffer buffer buffer buffer
Slide 9:Bottlenecks of Superscalar Processors
1. Structural hazards: A conflict between multiple instructions which require the same resource at the same time. 2. Control hazards: Instruction following a branch cannot be executed until the branch is resolved. 3. Data hazards:An instruction depends on the result of a previous instruction.
Slide 10:Data Dependencies
1. True data dependencies: RAW (Read after write) i: add r3 r2 r1; j: add r6 r3 r4; 2. False data dependencies: WAR (Write after read) k: add r6 r3 r4; l: add r3 r7 r1; WAW(Write after write) m: add r3 r2 r1; n: add r3 r7 r1;
Slide 11:Tomasulo's Algorithm
A hardware algorithm for dynamically issuing multiple instructions in a pipelined processor. It provides a general mechanism for register forwarding and data hazard detection. It successfully deals with RAW, WAW, and WAR data dependencies. Two kinds of techniques are used: Register renaming Shelving
Slide 12:Register Renaming
Example: add r3 r2 r1; # r2 + r1 -> r3; div r6 r3 r4; # r3 / r4 -> r6; (RAW) sub r3 r7 r1; # r7- r1 -> r3; (WAR, WAW) Register renaming: r3 -> rr1 r6 -> rr2 r3 -> rr3 New instruction serial: add rr1 r2 r1; # r2 + r1 -> rr1; div rr2 rr1 r4; # rr1 / r4 -> rr2; (RAW) sub rr3 r7 r1; # r7- r1 -> rr3;
Slide 13:Shelving
Reservation station: A buffer to hold decoded instructions to wait for issuing into execution. Independent instructions are detected and the RAW true data dependencies are dealt here. Possible reservation station entry components: Op Qj/Vj VBj Qk/Vk VBk Dest BusyBit
Slide 14:What's the Instruction Window About
Instruction Decode Instruction Window Holding decoded instructions Fetching operands Wake up instructions Select and issue instructions FU FU FU FU
Slide 15:Instruction Window Design Space(1)
1. Reservation stations may vary: Reservation Stations Individual RS's Group RS's Central RS's RS RS RS RS RS EU EU EU EU EU EU EU EU
Slide 16:Instruction Window Design Space(2)
2. Operand fetching scheme may vary: Reservation Station Reservation Station Reg.File Reg.File EU EU Scheme 1: Direct check of the scoreboard bits Scheme 2: Check of the explicit status bits
Slide 17:Part ThreeInstruction Window Organizations
Slide 18:Central Window DesignStructure
1. One centralized reservation station holds every kind of instructions after decoded. 2. It serves all the functional units. Reservation Station EU EU Decoded Instructions Ready Instructions
Slide 19:Central Window DesignComponents
Decoded Instructions Rs1 Rs2 Rd Identifier Entry DestReg Value Value Latest Valid No. Valid Bit Register File OC Os1/Is1 Vs1 Os2/Is2 Vs2 Rd Reservation Station OC Os1 Os2 Rd EUs Update Rd, set V-bi t Result, Rd/identifier Associative Update of Is1 Is2 with V-bits
Slide 20:Central Window DesignMerits and Drawbacks
Advantage: 1. A large register file is used, more registers can be renamed; 2. A large reservation station is used, more independent instructions can be detected; 3. Associative search, more parallelism can be exploited. Disadvantage: 1. More ports are required; 2. Long wires are required; 3. Possibly long clock cycle is induced.
Distributed Window DesignStructure 1. Two or more reservation stations hold decoded instructions. 2. They serve different functional units. Reservation Station 1 EU EU Decoded Instructions Ready Instructions Reservation Station 2Slide 22:Distributed Window DesignStructure
Identifier Entry DestReg Value Value Latest Valid No. Valid Bit Register File OC Rs1 Rs2 Rd OC Rs1 Rs2 Rd ReservationStation1 ReservationStation2 Decoded Instructions Rs1 Rs2 Rd EUs Update Rd, Set V-bit Result Rd/Identifier
Distributed Window DesignMerits and Drawbacks Advantage: 1. Reservation stations are less complicated 2. Possibly short clock cycle is achieved Disadvantage: 1. Random steering or Round Robin mode 2. The load in the different reservation stations may be unbalanced 3. More ports are still demanded to check the availability of the operands Dependence-based Window DesignStructure 1. Reservation stations are distributed. 2.The decoded instructions are steered into different FIFO queues according to dependencies. EUs Rename, Steering Dependence-based FIFOs Register File Update register fileSlide 25:Dependence-based Window DesignSteering Algorithm
For a decoded instruction I: 1. If all the operands are ready, I is steered to a new FIFO. 2. There is one operand not ready, and if there's no instruction behind this instruction in a FIFO, then put I into this FIFO; otherwise put into a new FIFO. 3. There are more than one operands not ready. Apply 2 to the first operand. If not suitable, apply to the second operand. 4. If all the FIFOs are full or if no empty FIFO is available, stall. After the last instruction in a FIFO is issued, the FIFO is set free.
Dependence-based Window DesignMerits and Drawbacks Advantage: 1. Issuing windows are distributed. 2. Only the heads of the FIFOs are checked, broadcast for wakeup is avoided. Disadvantage: An independent instruction always requires an additional FIFO to steer, if there's no FIFO available, it stalls. Hence the overall performance will be impacted. Cluster-based Window DesignStructure 1. It's based on the dependence-based window design. 2. The FIFOs are clustered, with each using a copy of the register file. EUs Rename, Steering Dependence-based FIFOs Register File1 Register File2 Dependence-based FIFOs Cluster1 Cluster2 EUs Cluster-based Window DesignMerits and Drawbacks (1) Advantage: 1. Issuing windows are distributed. 2. Only the heads of the FIFOs are checked, broadcast for wakeup is avoided. 3. The number of ports on each register file can be reduced. Updates of the register file are in parallel. 4. Local bypasses are used much more frequently than inter-cluster bypasses. Cluster-based Window DesignMerits and Drawbacks (2) Disadvantage: 1. An independent instruction always requires an additional FIFO to steer, if there's no FIFO available, it stalls. Hence the overall performance will be impacted. 2. Inter-cluster bypasses will decrease the overall performance.Slide 30:Parallel Execution Windows (PEWs)Structure
It splits the instruction window into separate execution windows(pews), with each having its own reservation station and its register file. The pews communicate with each other to get the required register data. pew0 pew1 pew3 pew2 Distributor
PEWsMerits and Drawbacks Advantage: 1. Issuing windows are distributed. 2. Local operands fetching and update are efficient. Disadvantage: More clock cycle delays are induced to pass the results to the remote pews. Direct Wakeup Window DesignStructure Rename¸Steering Reorder Buffer I Wait_rslt Wait_lop wait_rop Not ready Ready Wakeup_input_queue wait_queues Cnt=0 Cnt<>0 ready_queues EUs Wakeup wait_lop&wait_rop Not ready Ready Wakeup wait_rsltSlide 33:Direct Wakeup Window DesignMerits and Drawbacks
Advantage: 1. Broadcast method is avoided. Only the depended instructions are woken-up. 2. Stalls happen only after the resources are fully occupied, hence resource utilization is high. Disadvantage: An extra stage is introduced to balance the complicated wakeup process, which will increase the misprediction roll back penalty.
Slide 34:Part FourWork Plan and Preliminary Simulations
Slide 35:Implementation Plan
1. Study the implemented designs: Central window design; Dependence-based design; Direct wakeup based design. 2. Finish and verify the following designs: Distributed window design; Cluster-based window design; PEWs-based window design.
Slide 36:Test Plan
1. Test using Integer benchmarks and Float benchmarks 2. Test using different architecture set-ups: Vary the issue width; Vary the window size; Vary the register file size; Vary the number of functional units. 3. Write report.
Slide 37:Preliminary Results (1)
Central Window Distributed Window Dependence-based Cluster-based Direct wakeup 126.gcc 129.comprss 130.li 099.go 134.perl
Slide 38:Preliminary Results (2)
Central window Distributed window Dependence-based Cluster-based Direct wakeup 101.tomcat 102.swim 103.su2cor 104.hydro2d 107.mgrid