Lab 2 Ideas

Lab 2 Ideas Various forms of non-optimized FIR code

Demonstrate progress on Lab.1 1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests. Demonstrate at the start of the Lab. Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Lab. notes Some information and suggestions about Lab. 2 in the laboratory notes. More information here Minor changes in code needs once first asm FIR is running Keep old versions for reference and possible reanalysis as you learn more Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Lab. 2 – Preparation for Lab. 3 where we optimize code • Step 1 – Generate ASM tests based on C++ tests from Lab. 1 • Step 2 – Convert 2 C++ routines into assembly code and test • FIR_ONLINE_ASM( Xin, Yout, FIRcoeffs, FIR_N) For all data points – call FIR_ONLINE_ASM( ) • Time C++ code calling FIR_ONLINE_ASM with one loop • FIR_OFFLINE_ASM(XinArray, YoutArray, M, FirCoeffs, FIR_N) • Time FIR_OFFLINE_ASM with double zero-overhead loop Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Lab. 2 – no optimization • However, as I mentioned before • Write the code with “planned to optimize” in mind • Get the ASM code to work in the best way you can • Then “prepare for optimization” – called “refactoring for speed” – see next slides Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Version 1 and 2 – no parallel code For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time with software and then hardware loop – leave both code versions behind Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Expected part of report • Use post-modify addressing • Test code for N = 32 • Time code for N large to minimize timing errors – getting into / out of timing code • Calculate the theoretical time for loop • Number of instructions plus number of stalls • Show stalls in code • Expect theory time = actual time within 1% Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Version 2 – no parallel codePut hardware loop jump with add – why not For (I = 0 to N-1, I++) read data[i]; J-Bus Time = N * 6 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Version 3 – no parallel codeUnroll loop For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Version 3A – no parallel codeUnroll loop but with “extra temporary registers” to prepare for making parallel late For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Version 4 – no parallel codeShift to using K-bus for coeff[ ] For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; K-Bus loop jump // addMEMORY NO_OP ? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Version 4A – no parallel execution – J and K-bus access same memory block For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus , read coeff[i]; K-Bus MEMORY NO_OP? MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP Time = N / 2 * 12 add X-COMPUTE loop jump // add read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP? MEMORY NO_OP? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Expected cache issues For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time with software and then hardware loop – leave both code versions behind Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

First time into loop – cache on For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = 2 * N * Mtime-not cached + 2 * N + N * # of stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Second time into loop – cache on For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = N * time data fetch + N * coeff fetch + 2 * N + N * stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Second time into loop – cache on For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = (N – 1) * Mtime-cached + Mtime – (cache flush, cache reload) data fetch + N * Mtime –cached – coeffs) + 2 * N + N * stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Different types of memory timing for read operations Read from external memory Read from external memory + cache store Read from internal memory Read from internal memory + cache store Read from cache Note – what happens if the cache is full Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

First time into loop – cache onBut now processor doing quad fetches into cache For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = 2 * N / 4 * Mtime-not cached + 2 * 3 N / 4 * Mtime -cached + 2 * N + N * # of stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Note The hardware is doing quad fetches into cache You ARE NOT doing quad fetches in your code So why would that help Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Typical Cache behaviourTrue for TigerSHARC – don’t know! • You issue Memory read request • Processor sends 2 memory read requests • One to true memory • One to cache • If cache replies “I have that value” then the value is fetched from cache and the Memory read request is aborted Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Typical Cache behaviourTrue for TigerSHARC – don’t know! • You issue Memory read request • Processor sends 2 memory read requests • One to true memory • One to cache • If cache replies “No value” then the value is fetched Memory and stored in cache and sent to user. • No rule that says memory has to give only one values to the cache Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

What if cache is full?Expected behaviour • One existing cache line is thrown away • Least used – random • Write operations can change cache • If the cache line being thrown away (has changed), then that value must be written to memory before the cache line is changed • Does that happen in parallel with user code – depends on algorithm characteristics Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

See TigerSHARC hardware manual for cache details If the timing behaviour is not what you are expecting – then work out why. In your report explainr you analysis Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Final part of Lab. 2 – Version 4B Run assembly code timing tests with data placed in dm memory and FIR coefficients placed in pm memory by compiler Will only need a name change of version 4 to meet prototype change FIR_ASM(*data, *fir, N)  FIR_ASM(dm *data, dm *fir, N) FIR_ASM(dm *data, pm *fir, N) Version 4B C++ prototype Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

Lab 2 Ideas

Lab 2 Ideas

Presentation Transcript

Lab 2

Lab 2

Lab # 2

Lab 2

LAB [ 2 ]

LAB 2

Lab 2

Lab #2

Lab 2

Lab #2 Computer Lab

Lab #2

LAB #2

Lab #2

Lab 2

LAB 2

Lab #2 Maze Lab

Lab #2

Kids Ideas Lab

Lab 2

Lab 2

Lab 2