280 likes | 294 Views
Dive into the detailed look at the TigerSHARC pipeline cycle counting for the IALU version of the DC Removal algorithm. Explore expected and actual cycle counts, reasons for stalls, and differences in performance. Learn to fix issues and leverage the Pipeline Viewer tool for insight. Gain understanding on set up times, key algorithm elements, and more. Uncover intriguing aspects about memory operations and cycle efficiency.
E N D
Detailed look at the TigerSHARC pipeline Cycle counting for the IALU versionof the DC_Removal algorithm
To be tackled today • Expected and actual cycle count for J-IALU version of DC_Removal algorithm • Understanding why the stalls occur and how to fix. • Differences between first time into a function (cache empty) and second time into the function DC_Removal algorithm performance
Set up timeIn principle 1 cycle / instruction 2 + 4 instructions DC_Removal algorithm performance
First key element – Sum Loop -- Order (N) Second key element – Shift Loop – Order (log2N) 4 instructions N * 5 instructions 1 + 2 * log2N DC_Removal algorithm performance
Third key element – FIFO circular buffer -- Order (N) 6 3 6 * N 2 DC_Removal algorithm performance
TigerSHARC pipeline DC_Removal algorithm performance
Using the “Pipeline Viewer” • Available with the TigerSHARC simulator ONLY • VIEW | Debug Windows | Pipeline viewer • F1 to F4 – instruction fetch unit pipeline • PD, D, I -- Integer ALU pipeline • A, EX1, EX2 – Compute Block pipeline DC_Removal algorithm performance
Pipeline symbols Control - click A – Abort B – Bubble H – BTB Hit (Jumps) S – Stall W – Wait X – Illegal fetch(F1 – F4)X – Illegal instruction (PD – E2) DC_Removal algorithm performance
Set up pointers to buffers Insert values into buffers SUM LOOP SHIFT LOOP Update outgoing parameters Update FIFO Function return 2 4 4 + N * 5 1 + 2 * log2N 6 3 + 6 * N 2 --------------------------- 22 + 11 N + 2 log2N N = 128 – instructions = 1444 1444 cycles + 1100 delay cycles C++ debug mode – 9500 cycles??????? Time in theory Note other tests executed before this test. Means “cache filled” DC_Removal algorithm performance
Test environment Examine the pipeline the 2nd time around the loop“Cache’s filled”? DC_Removal algorithm performance
Set up time Expected 2 + 4 instructions Actual 2 + 4 instructions + 2 stalls Why not 4 stalls? DC_Removal algorithm performance
First time round sum loop Expected 9 instructions LC0 load – 3 stalls Each memory fetch – 4 stalls Actual 9 + 11 stalls DC_Removal algorithm performance
Other times around the loop Expected 5 instructions Each memory fetch – 4 stalls Actual 5 + 8 stalls DC_Removal algorithm performance
Shift Loop – 1st time around Expected 3 instructions No stalls on LC0 load? 4 stall on ASHIFTR BTB hit followed by 5 aborts DC_Removal algorithm performance
Shift loop2nd and later times around Expect 2 Get 2 DC_Removal algorithm performance
Store back of &left, &right Expect 6 Actual 6 + 3 stalls DC_Removal algorithm performance
Exercise 1 • Based on knowledge to this points – determine the expected stalls during the last piece of code – FIFO buffer operatio DC_Removal algorithm performance
Third key element – FIFO circular buffer-- Order (N) 6 3 6 * N 2 DC_Removal algorithm performance
Answer DC_Removal algorithm performance
Second time into function DC_Removal algorithm performance
What happens if cache not full? – first time function called? Was 2 + 2 stalls in loop Now 11 + 12 stalls in loop DC_Removal algorithm performance
First time function called2nd time around the loopDitto 3, 4, 5, 6, 7, 8 times DC_Removal algorithm performance
9th time around the loopditto 17th, 25th, 33rd, 41st , 49th DC_Removal algorithm performance
What is happening? • With cache filled – memory read accesses require 4 cycles • Unfilled – first one requires “12 cycles” • Then next 7 require 4 cycles • Total guess – is extra time associated with doing extra reads to fill the cache? DC_Removal algorithm performance
Tackled today • Expected and actual cycle count for J-IALU version of DC_Removal algorithm • Understanding why the stalls occur and how to fix. • Differences between first time into a function (cache empty) and second time into the function • Further unknowns – how memory operations really work DC_Removal algorithm performance