1 / 24

Lab 2 Ideas

Lab 2 Ideas. Various forms of non-optimized FIR code. Demonstrate progress on Lab.1. 1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests. Demonstrate at the start of the Lab. Lab. notes.

jariah
Download Presentation

Lab 2 Ideas

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lab 2 Ideas Various forms of non-optimized FIR code

  2. Demonstrate progress on Lab.1 1.5% of term mark is associated with demonstrating progress on developing C++ code (FIR) and associated tests. Demonstrate at the start of the Lab. Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  3. Lab. notes Some information and suggestions about Lab. 2 in the laboratory notes. More information here Minor changes in code needs once first asm FIR is running Keep old versions for reference and possible reanalysis as you learn more Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  4. Lab. 2 – Preparation for Lab. 3 where we optimize code • Step 1 – Generate ASM tests based on C++ tests from Lab. 1 • Step 2 – Convert 2 C++ routines into assembly code and test • FIR_ONLINE_ASM( Xin, Yout, FIRcoeffs, FIR_N) For all data points – call FIR_ONLINE_ASM( ) • Time C++ code calling FIR_ONLINE_ASM with one loop • FIR_OFFLINE_ASM(XinArray, YoutArray, M, FirCoeffs, FIR_N) • Time FIR_OFFLINE_ASM with double zero-overhead loop Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  5. Lab. 2 – no optimization • However, as I mentioned before • Write the code with “planned to optimize” in mind • Get the ASM code to work in the best way you can • Then “prepare for optimization” – called “refactoring for speed” – see next slides Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  6. Version 1 and 2 – no parallel code For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time with software and then hardware loop – leave both code versions behind Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  7. Expected part of report • Use post-modify addressing • Test code for N = 32 • Time code for N large to minimize timing errors – getting into / out of timing code • Calculate the theoretical time for loop • Number of instructions plus number of stalls • Show stalls in code • Expect theory time = actual time within 1% Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  8. Version 2 – no parallel codePut hardware loop jump with add – why not For (I = 0 to N-1, I++) read data[i]; J-Bus Time = N * 6 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  9. Version 3 – no parallel codeUnroll loop For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  10. Version 3A – no parallel codeUnroll loop but with “extra temporary registers” to prepare for making parallel late For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; J-Bus loop jump // addMEMORY NO_OP? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE read data[I + 1]; J-Bus read coeff[I + 1]; J-Bus MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  11. Version 4 – no parallel codeShift to using K-bus for coeff[ ] For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus Time = N / 2 * 12 read coeff[i]; K-Bus loop jump // addMEMORY NO_OP ? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  12. Version 4A – no parallel execution – J and K-bus access same memory block For (I = 0 to N-1, I += 2) -- N factor of 2 read data[i]; J-Bus , read coeff[i]; K-Bus MEMORY NO_OP? MEMORY NO_OP? multiply X-COMPUTECOMPUTE NO_OP Time = N / 2 * 12 add X-COMPUTE loop jump // add read data[I + 1]; J-Bus read coeff[I + 1]; K-Bus MEMORY NO_OP? MEMORY NO_OP? No speed difference expected multiply X-COMPUTECOMPUTE NO_OP add X-COMPUTE END_FOR Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  13. Expected cache issues For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time with software and then hardware loop – leave both code versions behind Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  14. First time into loop – cache on For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = 2 * N * Mtime-not cached + 2 * N + N * # of stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  15. Second time into loop – cache on For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = N * time data fetch + N * coeff fetch + 2 * N + N * stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  16. Second time into loop – cache on For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = (N – 1) * Mtime-cached + Mtime – (cache flush, cache reload) data fetch + N * Mtime –cached – coeffs) + 2 * N + N * stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  17. Different types of memory timing for read operations Read from external memory Read from external memory + cache store Read from internal memory Read from internal memory + cache store Read from cache Note – what happens if the cache is full Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  18. First time into loop – cache onBut now processor doing quad fetches into cache For (I = 0 to N-1, I++) read data[i]; J-Bus read coeff[i]; J-Bus multiply X-COMPUTE add X-COMPUTE END_FOR Time = 2 * N / 4 * Mtime-not cached + 2 * 3 N / 4 * Mtime -cached + 2 * N + N * # of stalls Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  19. Note The hardware is doing quad fetches into cache You ARE NOT doing quad fetches in your code So why would that help Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  20. Typical Cache behaviourTrue for TigerSHARC – don’t know! • You issue Memory read request • Processor sends 2 memory read requests • One to true memory • One to cache • If cache replies “I have that value” then the value is fetched from cache and the Memory read request is aborted Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  21. Typical Cache behaviourTrue for TigerSHARC – don’t know! • You issue Memory read request • Processor sends 2 memory read requests • One to true memory • One to cache • If cache replies “No value” then the value is fetched Memory and stored in cache and sent to user. • No rule that says memory has to give only one values to the cache Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  22. What if cache is full?Expected behaviour • One existing cache line is thrown away • Least used – random • Write operations can change cache • If the cache line being thrown away (has changed), then that value must be written to memory before the cache line is changed • Does that happen in parallel with user code – depends on algorithm characteristics Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  23. See TigerSHARC hardware manual for cache details If the timing behaviour is not what you are expecting – then work out why. In your report explainr you analysis Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

  24. Final part of Lab. 2 – Version 4B Run assembly code timing tests with data placed in dm memory and FIR coefficients placed in pm memory by compiler Will only need a name change of version 4 to meet prototype change FIR_ASM(*data, *fir, N)  FIR_ASM(dm *data, dm *fir, N) FIR_ASM(dm *data, pm *fir, N) Version 4B C++ prototype Speed IIR -- stage 5 M. Smith, ECE, University of Calgary, Canada

More Related