280 likes | 294 Views
Detailed look at the TigerSHARC pipeline. Cycle counting for the IALU versionof the DC_Removal algorithm. To be tackled today. Expected and actual cycle count for J-IALU version of DC_Removal algorithm Understanding why the stalls occur and how to fix.
E N D
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm
To be tackled today • Expected and actual cycle count for J-IALU version of DC_Removal algorithm • Understanding why the stalls occur and how to fix. • Differences between first time into a function (cache empty) and second time into the function DC_Removal algorithm performance
Set up timeIn principle 1 cycle / instruction 2 + 4 instructions DC_Removal algorithm performance
First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log2N) 4 instructions N * 5 instructions 1 + 2 * log2N DC_Removal algorithm performance
Third key element – FIFO circular buffer -- Order (N) 6 3 6 * N 2 DC_Removal algorithm performance
TigerSHARC pipeline DC_Removal algorithm performance
Using the “Pipeline Viewer” • Available with the TigerSHARC simulator ONLY • VIEW | Debug Windows | Pipeline viewer • F1 to F4 – instruction fetch unit pipeline • PD, D, I -- Integer ALU pipeline • A, EX1, EX2 – Compute Block pipeline DC_Removal algorithm performance
Pipeline symbols Control - click A – Abort B – Bubble H – BTB Hit (Jumps) S – Stall W – Wait X – Illegal fetch(F1 – F4)X – Illegal instruction (PD – E2) DC_Removal algorithm performance
Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 4 4 + N * 5 1 + 2 * log2N 6 3 + 6 * N 2 --------------------------- 22 + 11 N + 2 log2N N = 128 – instructions = 1444 1444 cycles + 1100 delay cycles C++ debug mode – 9500 cycles??????? Time in theory Note other tests executed before this test. Means “cache filled” DC_Removal algorithm performance
Test environment Examine the pipeline the 2nd time around the loop“Cache’s filled”? DC_Removal algorithm performance
Set up time Expected 2 + 4 instructions Actual 2 + 4 instructions + 2 stalls Why not 4 stalls? DC_Removal algorithm performance
First time round sum loop Expected 9 instructions LC0 load – 3 stalls Each memory fetch – 4 stalls Actual 9 + 11 stalls DC_Removal algorithm performance
Other times around the loop Expected 5 instructions Each memory fetch – 4 stalls Actual 5 + 8 stalls DC_Removal algorithm performance
Shift Loop – 1st time around Expected 3 instructions No stalls on LC0 load? 4 stall on ASHIFTR BTB hit followed by 5 aborts DC_Removal algorithm performance
Shift loop2nd and later times around Expect 2 Get 2 DC_Removal algorithm performance
Store back of &left, &right Expect 6 Actual 6 + 3 stalls DC_Removal algorithm performance
Exercise 1 • Based on knowledge to this points – determine the expected stalls during the last piece of code – FIFO buffer operatio DC_Removal algorithm performance
Third key element – FIFO circular buffer-- Order (N) 6 3 6 * N 2 DC_Removal algorithm performance
Answer DC_Removal algorithm performance
Second time into function DC_Removal algorithm performance
What happens if cache not full? – first time function called? Was 2 + 2 stalls in loop Now 11 + 12 stalls in loop DC_Removal algorithm performance
First time function called2nd time around the loopDitto 3, 4, 5, 6, 7, 8 times DC_Removal algorithm performance
9th time around the loopditto 17th, 25th, 33rd, 41st , 49th DC_Removal algorithm performance
What is happening? • With cache filled – memory read accesses require 4 cycles • Unfilled – first one requires “12 cycles” • Then next 7 require 4 cycles • Total guess – is extra time associated with doing extra reads to fill the cache? DC_Removal algorithm performance
Tackled today • Expected and actual cycle count for J-IALU version of DC_Removal algorithm • Understanding why the stalls occur and how to fix. • Differences between first time into a function (cache empty) and second time into the function • Further unknowns – how memory operations really work DC_Removal algorithm performance