1 / 29

CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line

CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line. M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact -- smithmr @ ucalgary.ca. Series of Talks and Workshops.

nellis
Download Presentation

CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CACHE-DSP ToolHow to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact -- smithmr @ ucalgary.ca

  2. Series of Talks and Workshops • CACHE-DSP – Talk on a simple process tool to identify cache conflicts in DSP code. • SQUISH-DSP – Talk on using a project management tool to automate identification of parallel DSP processor instructions . • SHARC Ecology 101 – Workshop showing how to systematically write parallel 2106X code. • SHARC Ecology 201 – Workshop on SQUISH-DSP and CACHE-DSP tools. Cache-DSP Tool smithmr@ucalgary.ca

  3. Concepts to be discussed • Concept behind 2106X instruction cache • Cache operation • Introduction of CACHE THRASHING • Solutions to avoid a Cache Thrash without delaying product release • Basis of Cache-DSP tool • Acknowledgements Cache-DSP Tool smithmr@ucalgary.ca

  4. Purpose of SHARC instruction cache • Harvard Processor Architecture • One bus for fetching instructions • Another bus for fetching data • Twin bus architecture avoids instruction/data fetch conflicts • DSP algorithms • Addition and multiplication intensive • Multiple simultaneous access to data structures are typically needed • Twin bus architecture does not avoid data/data fetch conflicts Cache-DSP Tool smithmr@ucalgary.ca

  5. Solutions to data/data fetch conflicts • Cache single instruction • Single instruction loop • Frees up instruction bus for use as data bus to fetch from separate data memory • Very limited in application • Three bus processor • Expensive to implement for all memory ADSP21XXX approach is to have a three bus processor architecture available for a limited number of instructions on a ‘as needed’ basis – instruction cache Cache-DSP Tool smithmr@ucalgary.ca

  6. Example • C-code Converts temperature array from C to F • Assembly code has 6 PM( ) operations Cache-DSP Tool smithmr@ucalgary.ca

  7. Example Cache-DSP Tool smithmr@ucalgary.ca

  8. Fetch Decode Execute Instr. on PM F1=, r0=dm Instr. on PMF13=,r2=dm, pm= Instr.F1=, r0=dm Instr. on PMF8=, r0=dm Instr. F13=,r2=dm, pm= Data on DM F1=, r0=dm Instr.F8=, r0=dm Data on DM, PM F13=,r2=dm, pm= First Time round loop -- STALL Instr. on PM/To Cache F12=, r2=dm, pm= Cache-DSP Tool smithmr@ucalgary.ca

  9. Fetch Decode Execute Instr. on PM F1=, r0=dm Instr. on PMF13=,r2=dm, pm= Instr.F1=, r0=dm Instr. on PMF8=, r0=dm Instr. F13=,r2=dm, pm= Data on DM F1=, r0=dm Instr. From Cache F12=, r2=dm, pm= Instr.F8=, r0=dm Data on DM, PM F13=,r2=dm, pm= Instr. F12=, r2=dm, pm= Data on DMF8=, r0=dm 2nd Time – 3 bus operation Cache-DSP Tool smithmr@ucalgary.ca

  10. Instruction Cache Characteristics • 32 cache locations • 32 locations looks small in number • but is used ONLY when data access on PM bus conflicts with instruction access on PM bus • Typically satisfactory for tight DSP algorithm loops up to 100+ atomic operations. Cache-DSP Tool smithmr@ucalgary.ca

  11. MAJOR LIMITATION POSSIBLE • Cache is 2-way associative • 32 cache locations grouped in groups of 2 • Instruction storage location in cache determined by last 4 bits of address • Instruction N stored at Cache location N modulus 16 • Also a least recently used bit (LRU) • LRU instruction replaced on a cache miss. • Possible to induce -- CACHE THRASH Cache-DSP Tool smithmr@ucalgary.ca

  12. Simple Example • Assume that cache is 2-way associative with 8(not 32) locations • 6 cache operations to be placed into 8 cache locations 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Cache-DSP Tool smithmr@ucalgary.ca

  13. Simple Example -- First Cache Op • Instruction 2 forces Instruction 4 into cache line %00 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Cache line %00 Cache-DSP Tool smithmr@ucalgary.ca

  14. Simple Example • Next 2 cache operations place instructions 6 and 9 into cache 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 4 -- %00 6 -- %10 9 -- %01 Cache-DSP Tool smithmr@ucalgary.ca

  15. Simple Example • 4th and 5th Cache operations set LRU bits for cache lines %00 and %10 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00 Cache-DSP Tool smithmr@ucalgary.ca

  16. Execution of Instruction 12 • Execution of instruction 12 occurs during Fetch of instruction 2 in loop • 3rd Cache operation involving cache line %10 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Instruction 2 to cache %10 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00 Cache-DSP Tool smithmr@ucalgary.ca

  17. Summary of Cache Operations • First time round loop • Instr. 2 pushes Instr. 4 to cache line %00 • Instr. 4 pushes Instr. 6 to cache line %10 • Instr. 7 pushes Instr. 9 to cache line %01 • Instr. 8 pushes Instr. 10 to cache line %10 • Instr. 10 pushes Instr. 12 to cache line %00 • INSTR. 12 pushes INSTR. 2 to cache line %10 WHERE IT REPLACES INSTR. 4 (LRU) Cache-DSP Tool smithmr@ucalgary.ca

  18. Cache Thrash starts operating • Second time round loop • Instr. 4 from cache line %00 • Instr. 4 pushes Instr. 6 to cache line %10 REPLACING INSTR. 10 (LRU for %10) • Instr. 9 from cache line %01 • Instr. 8 pushes Instr. 10 to cache line %10 REPLACING INSTR. 2 (LRU for %10) • Instr. 12 from cache line %00 • Instr. 12 pushes Instr. 2 to cache line %10REPLACINGINSTR. 6 (LRU for %10) • Losing 3 cycles each time around loop Cache-DSP Tool smithmr@ucalgary.ca

  19. Easy to fix in this example • Can delay PM from INSTR. 2 till 3 • This forces INSTR 5 to cache (%01) where it does not replace anything 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 2 -- %10 4 -- %00 5 -- %01 6 -- %10 9 -- %01 LRU 10 = %10 11 = %11 12 = %00 PM = Cache-DSP Tool smithmr@ucalgary.ca

  20. Real Life more difficult • Larger number of instructions in Loop • Jump operations (conditional or not) • Register Dependencies • May need to move many PM operations • All this takes time • Need a systematic approach to gain speed while getting the product out-the-door in shortest time • ADD-A-NOP – waste 1 cycle to gain 3 Cache-DSP Tool smithmr@ucalgary.ca

  21. ADD A CACHE FREEZE at end of the loop 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 • CACHE THRASH (3 cycles waste) replaced by STALL (instruction can’t go into cache) and Freeze instruction (2 cycles wasted) Instruction 1 stalls 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 LRU 10 = %10 12 = %00 BIT SET MODE2 CAFRZ Cache Freeze BIT CLR MODE2 CAFRZ Cache Unfreeze Cache-DSP Tool smithmr@ucalgary.ca

  22. ADD A NOP at end of the loop 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 • CACHE THRASH (3 cycles waste) IS AVOIDED with a loss of only 1 cycle/loop because of additional NOP instruction Instruction 1 to cache %01 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 LRU 10 = %10 12 = %00 NOP Cache-DSP Tool smithmr@ucalgary.ca

  23. Cache-DSP tool concept • Original Code – Loop Cycles = C1 1, 2, 3, 4, 5, 6, 7, endloop • Trial 1 – Loop Cycles = C2 1, 2, 3, 4, 5, 6, 7, NOP, endloop • Trial 2– Loop Cycles = C3 1, 2, 3, 4, 5, 6, NOP, 7, endloop • Trial 3 – Loop Cycles = C4 1, 2, 3, 4, 5, NOP, 6, 7, endloop Cache-DSP Tool smithmr@ucalgary.ca

  24. Cache-DSP tool • Identifies the number of cache operations and cache thrashes in current code • Calculates the advantage of adding NOP after/before each instruction in loop in reducing cache thrashes • Remembers the best case scenario • Then determines the effect of placing 2 NOPs (3, 4 etc) somewhere in the code (preferably at end of loop). Cache-DSP Tool smithmr@ucalgary.ca

  25. Advantages • Typical DSP loops small • Can use brute force approach in identifying where NOPs should be placed • If meet time constraints of your project -- then ship with NOPs included • If does not meet time constraints then position of NOPs gives hints as to which PM( ) operations to delay • Works with any processor architecture Cache-DSP Tool smithmr@ucalgary.ca

  26. Hint -- Instruction PM( ) Key 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 • Reformat loop so that Instr. 1 is outside loop and repeated as Instr. 13 with Instr. 12 PM( ) moved • Now we have removed cache thrash with no waste Instruction 1 outside loop Instruction 3 to cache %11 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00 F1=, ro=dm( ), pm( ) = Cache-DSP Tool smithmr@ucalgary.ca

  27. Problems to overcome • Jumps inside loops • Complicates which instructions get cached • Conditional jump changes which instruction gets cached (dynamic effect) • Complicated to the effect of placing a NOP into a delay slot and displacing an instruction out of the delay slot • Effect of loops inside loops Cache-DSP Tool smithmr@ucalgary.ca

  28. Concepts discussed • Concept behind ADI instruction cache • Cache operation • Introduction of CACHE THRASHING • Solutions to avoid a Cache Thrash without delaying product release • Introduction of NOP instructions into code -- wasting one cycle to save 3 cycles • Identification of PM( ) operations to move • Basis of Cache-DSP tool Cache-DSP Tool smithmr@ucalgary.ca

  29. Acknowledgements • Financial support of Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of Calgary • Financial support from Analog Devices through ADI University professorship for 2001/2002 (Dr. Smith) • Future work will be financed in part by the Alberta Government through Alberta Software Engineering Research Consortium (ASERC) Cache-DSP Tool smithmr@ucalgary.ca

More Related