180 likes | 406 Views
Super-Scalar MIPS Processor. With predicted branches/jumps and write back cache. Outline. Enhanced Super Scalar datapath 256-bit Memory bus and write back cache Branch/Jump Prediction Performance Testing. Super-Scalar Datapath. Duplicated original 5-stage pipeline
E N D
Super-Scalar MIPS Processor With predicted branches/jumps and write back cache CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Outline • Enhanced Super Scalar datapath • 256-bit Memory bus and write back cache • Branch/Jump Prediction • Performance • Testing CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Super-Scalar Datapath • Duplicated original 5-stage pipeline • Dual ported instruction cache • Dual ported register file • 4 read ports and 2 write ports • 4 times the forwarding logic • 4 times the stall logic CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Enhanced Super-Scalar • Allow memory operations in both pipelines • Requires dual ported write buffer and data cache • Also must protect against special memory hazards CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Upgrades to memory system • Allow two writes to write buffer and two reads from data cache • Rest of hierarchy stays same • Data cache uses dual ported SRAMs • Write buffer made of registers • Easily extensible CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
New Memory Hazards • Lw/sw in parallel to same address • Stall sw via write buffer until lw completes • Cache miss is possible • Sw/lw in parallel to same address • No problem because sw goes to WB, lw checks there first • Sw/sw in parallel to same address • Sw in pipe A is ignored unless… CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Queuing writes in MMIO • If sw/sw in parallel to same hex-LED • Accept both into queue in MMIO module • Pipe A’s sw ahead of pipe B’s • Queue is emptied at rate of 1 store/cycle • Keep track of which is most recent write • In case we read from LEDs • Each LED has own queue CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Write back cache • Product of complete rewrite • Expanded all memory buses to 8 words • Allows for burst reads and writes • No need for asynchronous FIFO • Burst writes works well w/ write back • Write buffer no longer talks to arbiter, but only to cache CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Diagram of Write back cache writeKick readKick writeMiss readMiss Idle writeReady readDone writeHit readRetry writeKickBack CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Branch/Jump prediction • Local prediction of target based on instruction’s address • Same hardware for branches and jumps • Cache all jumps (not just jr) • Predictor replaces signal coming from ID in old 5-stage pipeline (single pipe) CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Branch Target Table • 256 entries, 58 bits wide: • Valid bit [57] • Tag [56:35] • Target [34:3] • Branch Confidence [2:1] • Jump Confidence [0] CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Using Branch Target Table • Combination of Translation buffer and history table in one SRAM • Table is always written following prediction • Data determined by outputs of branch hardware in ID CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Overall Prediction Logic • Look up PC in parallel with IF • If found then predict based on stored target and confidence bits • Else return PC+8 • On misprediction • Determined by comparing values from ID with values last predicted • Update table with new confidence bits/tag CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Prediction Logic/State • Standard 2 bit branch prediction • 2 states predict taken, 2 not taken • 1 bit jump prediction • Always taken, so bit is only confidence • Used only for updating table CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Branch in pipe B? • Branch hardware (prediction and actual) only works for pipe A • If branch decoded in pipe B, IF is flushed and branch refetched in pipe A • Handled by hazard/issue unit • Cheaper hardware, but takes more cycles compared to moving instructions via MUX CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Branch/Jump performance • Pipe A, Predicted correctly: no penalty • Pipe A, mispredicted: 2 NOPs • Pipe B, Predicted correctly: 3 NOPs • Pipe B, mispredicted: 5 NOPs • Expensive, but result is less MUXes in critical path CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Overall Performance • Current Status • Passed test0/Infinite Loop, test1/basic.v • Passed test3 with cycle count of X190 • Hardware Statistics • Number of Block RAMS: 67% • Number of Slices: 51% • Timing Statistics • Critical Path ~125ns • Clock Frequency 8MHz CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee
Testing • Integrated Testing • Ability to run instructions from boot ROM • Gradually expand set of instructions as new features are added/supported • Eventually use proper boot ROM to load instructions To DRAM • Assembly test code • Basic programs to test functionality • Simple branch, arithmetic, and memory operations • Complex programs to test overall operation • Fibonacci, Radix Sort • Used to dig up more corner cases CS152 Fall 2003; N. Sadhal, I. Koulchine, V. Venkatesh, A. Lee