Understanding the TigerSHARC ALU pipeline

Understanding the TigerSHARC ALU pipeline Determining the speed of one stage of IIR filter – Part 2Understanding the pipeline

Understanding the TigerSHARC ALU pipeline • TigerSHARC has many pipelines • If these pipelines stall – then the processor speed goes down • Need to understand how the ALU pipeline works • Learn to use the pipeline viewer • Understanding what the pipeline viewer tells in detail • Avoiding having to use the pipeline viewer • Improving code efficiency • Excel and Project (Gantt charts) are useful tool Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Register File and COMPUTE Units Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

S0 S1 S2 Simple ExampleIIR -- Biquad • For (Stages = 0 to 3) Do • S0 = Xin * H5 + S2 * H3 + S1 * H4 • Yout = S0 * H0 + S1 * H1 + S2 * H2 • S2 = S1 • S1 = S0 Horrible IIR codeexample as can’t re-use in a loop Works as asimple example for understanding TigerSHARCpipeline Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Code return float when using XR8 register – NOTE NOT XFR8 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Step 2 – Using C++ code as comments set up the coefficients XFR0 = 0.0;; Does not exist XR0 = 0.0;; DOES EXIST Bit-patternsrequireintegerregisters Leave what youwanted to dobehind ascomments Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Expect to take8 cycles to execute Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

PIPELINE STAGESSee page 8-34 of Processor manual • 10 pipeline stages, but may be completely desynchronized (happen semi-independently) • Instruction fetch -- F1, F2, F3 and F4 • Integer ALU – PreDecode, Decode, Integer, Access • Compute Block – EX1 and EX2 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Pipeline Viewer Result XR0 = 1.0 enters PD stage @ 39025, enters E2 stage at cycle 39830 is stored into XR0 at cycle 39831 -- 7 cycles execution time Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Pipeline Viewer Result XR6 = 5.5 enters PD stage at cycle 39032 enters E2 stage at cycle 39837 is stored into XR6 at cycle 39838 -- 7 cycles execution time Each instruction takes 7 cycles but one new result each cycle Result – ONCE pipeline filled 8 cycles = 8 register transfer operations Key – don’t break pipeline with any jumps Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Doing filter operations – generates different results XR8 = XR6 enters PD at 39833, enters EX2 at 39838, stored 39839 – 7 cyclesXFR23 = R9 * R4 enters PD at 39834, enters EX2 at 39839, stored 39840 – 7 cyclesXFR0 = R0 + R23 enters PD at 39835, enters EX2 at 39841, stored 39842 – 8 cycles WHY? – FIND OUT WITH MOUSE CLICK ON S MARKER THEN CONTROL Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Instruction 0x17e XFR8 = R8 + R23 is STALLED (waiting) for instruction 0x17d XFR23 = R8 * R4 to complete Bubble B means that the pipeline is doing “nothing”Meaning that the instruction shown is “place holder” (garbage) Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Information on Window Event Icons Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Result of Analysis • Can’t use Float result immediately after calculation • WritingXFR23 = R8 * R4;; XFR8 = R8 + R23;; // MUST WAIT FOR XFR23 // calculation to be completedIs the same as codingXFR23 = R8 * R4;; NOP;;  Note DOUBLE ;; -- extra cycle because of stallXFR8 = R8 + R23;; • Proof – write the code with the stalls shown in it • Writing this way means we don’t have to use the pipeline viewer all the time • Pipeline viewer is only available with (slow) simulator • #define SHOW_ALU_STALL nop Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Code withstalls shown • 8 code lines • 5 expected stalls • Expect 13 cyclesto completeif theory is correct Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Analysis approach IS correctSame speed with and without nops Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Process for coding for improved speed – code re-organization • Make a copy of the code so can test iirASM( ) and iirASM_Optimized( ) to make sure get correct result • Make a table of code showing ALU resource usage (paper, EXCEL, Project (Gantt chart) ) • Identify data dependencies • Make all “temp operations” use different register • Move instructions “forward” to fill delay slots, BUT don’t break data dependencies Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Copy and paste to makeIIRASM_Optimized( ) Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Need to re-order instructionsto fill delay slots with useful instructions • After refactoring code to fill delay slots, must run tests to ensure that still have the correct result • Change – and “retest” • NOT EASY TO DO • MUST HAVE ASYSTEMATIC PLAN TO HANDLEOPTIMIZATION • I USE EXCEL Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Show resource usage and data dependencies All temporaryregister usageinvolves theSAME XFR23register This typically stallsout the processor Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Change all temporary registers to use different register namesThen check code produces correct answer All temporaryregister usageinvolves a DIFFERENT Register ALWAYS FOLLOWTHIS PROCESSWHENOPTIMIZING Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Move instructions forward, without breaking data dependencies What appears possible! DO one thing at a time and then check that code still works Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Check that code still operates1 cycle saved Have put “our” marker stall instructionin parallel with moved instructionusing ; rather than ;; Move this instruction up in code sequence to fill delay slot Check that code still runsafter this optimization stage Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Move next multiplication up. NOTE certain stalls remain, although reason for STALL changes from why they were inserted before Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Move up the R10 and R9 assignment operations -- check 4 cycle improvement? Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

CHECK THE PIPELINE AFTER TESTING Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Are there still more improvements possible (I can see 4 more moves) Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Problems with approach • Identifying all the data dependencies • Keep track of how the data dependencies change as you move the code around • Handling all of this “automatically” • I started the following design tool as something that might work, but it actually turned out very useful.M. R. Smith and J. Miller, "Microprocessor Scheduling -- the irony of using Microsoft Project","Don’t say “CAN’T do it - Say “Gantt it”! The irony of organizing microprocessors with a big business tool" Circuit Cellar magazine, Vol. 184, pp 26 - 35, November 2005. Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Using Microsoft Project – Step 1 Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Add dependencies and resource usage – then activate level Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Microsoft Project as a microprocessor design tool • Will look at this in more detail when we start using memory operations to fill the coefficient and state arrays Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Understanding the TigerSHARC ALU pipeline • TigerSHARC has many pipelines • If these pipelines stall – then the processor speed goes down • Need to understand how the ALU pipeline works • Learn to use the pipeline viewer • Understanding what the pipeline viewer tells in detail • Avoiding having to use the pipeline viewer • Improving code efficiency • Excel and Project (Gantt charts) are useful tool Speed IIR -- stage 1, M. Smith, ECE, University of Calgary, Canada

Understanding the TigerSHARC ALU pipeline