CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line

CACHE-DSP ToolHow to avoid having a SHARC thrashing on a cache-line M. Smith, University of Calgary, Canada B. Howse, Cell-Loc, Calgary, Canada Contact -- smithmr @ ucalgary.ca

Series of Talks and Workshops • CACHE-DSP – Talk on a simple process tool to identify cache conflicts in DSP code. • SQUISH-DSP – Talk on using a project management tool to automate identification of parallel DSP processor instructions . • SHARC Ecology 101 – Workshop showing how to systematically write parallel 2106X code. • SHARC Ecology 201 – Workshop on SQUISH-DSP and CACHE-DSP tools. Cache-DSP Tool smithmr@ucalgary.ca

Concepts to be discussed • Concept behind 2106X instruction cache • Cache operation • Introduction of CACHE THRASHING • Solutions to avoid a Cache Thrash without delaying product release • Basis of Cache-DSP tool • Acknowledgements Cache-DSP Tool smithmr@ucalgary.ca

Purpose of SHARC instruction cache • Harvard Processor Architecture • One bus for fetching instructions • Another bus for fetching data • Twin bus architecture avoids instruction/data fetch conflicts • DSP algorithms • Addition and multiplication intensive • Multiple simultaneous access to data structures are typically needed • Twin bus architecture does not avoid data/data fetch conflicts Cache-DSP Tool smithmr@ucalgary.ca

Solutions to data/data fetch conflicts • Cache single instruction • Single instruction loop • Frees up instruction bus for use as data bus to fetch from separate data memory • Very limited in application • Three bus processor • Expensive to implement for all memory ADSP21XXX approach is to have a three bus processor architecture available for a limited number of instructions on a ‘as needed’ basis – instruction cache Cache-DSP Tool smithmr@ucalgary.ca

Example • C-code Converts temperature array from C to F • Assembly code has 6 PM( ) operations Cache-DSP Tool smithmr@ucalgary.ca

Example Cache-DSP Tool smithmr@ucalgary.ca

Fetch Decode Execute Instr. on PM F1=, r0=dm Instr. on PMF13=,r2=dm, pm= Instr.F1=, r0=dm Instr. on PMF8=, r0=dm Instr. F13=,r2=dm, pm= Data on DM F1=, r0=dm Instr.F8=, r0=dm Data on DM, PM F13=,r2=dm, pm= First Time round loop -- STALL Instr. on PM/To Cache F12=, r2=dm, pm= Cache-DSP Tool smithmr@ucalgary.ca

Fetch Decode Execute Instr. on PM F1=, r0=dm Instr. on PMF13=,r2=dm, pm= Instr.F1=, r0=dm Instr. on PMF8=, r0=dm Instr. F13=,r2=dm, pm= Data on DM F1=, r0=dm Instr. From Cache F12=, r2=dm, pm= Instr.F8=, r0=dm Data on DM, PM F13=,r2=dm, pm= Instr. F12=, r2=dm, pm= Data on DMF8=, r0=dm 2nd Time – 3 bus operation Cache-DSP Tool smithmr@ucalgary.ca

Instruction Cache Characteristics • 32 cache locations • 32 locations looks small in number • but is used ONLY when data access on PM bus conflicts with instruction access on PM bus • Typically satisfactory for tight DSP algorithm loops up to 100+ atomic operations. Cache-DSP Tool smithmr@ucalgary.ca

MAJOR LIMITATION POSSIBLE • Cache is 2-way associative • 32 cache locations grouped in groups of 2 • Instruction storage location in cache determined by last 4 bits of address • Instruction N stored at Cache location N modulus 16 • Also a least recently used bit (LRU) • LRU instruction replaced on a cache miss. • Possible to induce -- CACHE THRASH Cache-DSP Tool smithmr@ucalgary.ca

Simple Example • Assume that cache is 2-way associative with 8(not 32) locations • 6 cache operations to be placed into 8 cache locations 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Cache-DSP Tool smithmr@ucalgary.ca

Simple Example -- First Cache Op • Instruction 2 forces Instruction 4 into cache line %00 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Cache line %00 Cache-DSP Tool smithmr@ucalgary.ca

Simple Example • Next 2 cache operations place instructions 6 and 9 into cache 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 4 -- %00 6 -- %10 9 -- %01 Cache-DSP Tool smithmr@ucalgary.ca

Simple Example • 4th and 5th Cache operations set LRU bits for cache lines %00 and %10 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00 Cache-DSP Tool smithmr@ucalgary.ca

Execution of Instruction 12 • Execution of instruction 12 occurs during Fetch of instruction 2 in loop • 3rd Cache operation involving cache line %10 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 Instruction 2 to cache %10 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00 Cache-DSP Tool smithmr@ucalgary.ca

Summary of Cache Operations • First time round loop • Instr. 2 pushes Instr. 4 to cache line %00 • Instr. 4 pushes Instr. 6 to cache line %10 • Instr. 7 pushes Instr. 9 to cache line %01 • Instr. 8 pushes Instr. 10 to cache line %10 • Instr. 10 pushes Instr. 12 to cache line %00 • INSTR. 12 pushes INSTR. 2 to cache line %10 WHERE IT REPLACES INSTR. 4 (LRU) Cache-DSP Tool smithmr@ucalgary.ca

Cache Thrash starts operating • Second time round loop • Instr. 4 from cache line %00 • Instr. 4 pushes Instr. 6 to cache line %10 REPLACING INSTR. 10 (LRU for %10) • Instr. 9 from cache line %01 • Instr. 8 pushes Instr. 10 to cache line %10 REPLACING INSTR. 2 (LRU for %10) • Instr. 12 from cache line %00 • Instr. 12 pushes Instr. 2 to cache line %10REPLACINGINSTR. 6 (LRU for %10) • Losing 3 cycles each time around loop Cache-DSP Tool smithmr@ucalgary.ca

Easy to fix in this example • Can delay PM from INSTR. 2 till 3 • This forces INSTR 5 to cache (%01) where it does not replace anything 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 2 -- %10 4 -- %00 5 -- %01 6 -- %10 9 -- %01 LRU 10 = %10 11 = %11 12 = %00 PM = Cache-DSP Tool smithmr@ucalgary.ca

Real Life more difficult • Larger number of instructions in Loop • Jump operations (conditional or not) • Register Dependencies • May need to move many PM operations • All this takes time • Need a systematic approach to gain speed while getting the product out-the-door in shortest time • ADD-A-NOP – waste 1 cycle to gain 3 Cache-DSP Tool smithmr@ucalgary.ca

ADD A CACHE FREEZE at end of the loop 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 • CACHE THRASH (3 cycles waste) replaced by STALL (instruction can’t go into cache) and Freeze instruction (2 cycles wasted) Instruction 1 stalls 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 LRU 10 = %10 12 = %00 BIT SET MODE2 CAFRZ Cache Freeze BIT CLR MODE2 CAFRZ Cache Unfreeze Cache-DSP Tool smithmr@ucalgary.ca

ADD A NOP at end of the loop 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 • CACHE THRASH (3 cycles waste) IS AVOIDED with a loss of only 1 cycle/loop because of additional NOP instruction Instruction 1 to cache %01 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 LRU 10 = %10 12 = %00 NOP Cache-DSP Tool smithmr@ucalgary.ca

Cache-DSP tool concept • Original Code – Loop Cycles = C1 1, 2, 3, 4, 5, 6, 7, endloop • Trial 1 – Loop Cycles = C2 1, 2, 3, 4, 5, 6, 7, NOP, endloop • Trial 2– Loop Cycles = C3 1, 2, 3, 4, 5, 6, NOP, 7, endloop • Trial 3 – Loop Cycles = C4 1, 2, 3, 4, 5, NOP, 6, 7, endloop Cache-DSP Tool smithmr@ucalgary.ca

Cache-DSP tool • Identifies the number of cache operations and cache thrashes in current code • Calculates the advantage of adding NOP after/before each instruction in loop in reducing cache thrashes • Remembers the best case scenario • Then determines the effect of placing 2 NOPs (3, 4 etc) somewhere in the code (preferably at end of loop). Cache-DSP Tool smithmr@ucalgary.ca

Advantages • Typical DSP loops small • Can use brute force approach in identifying where NOPs should be placed • If meet time constraints of your project -- then ship with NOPs included • If does not meet time constraints then position of NOPs gives hints as to which PM( ) operations to delay • Works with any processor architecture Cache-DSP Tool smithmr@ucalgary.ca

Hint -- Instruction PM( ) Key 0 = %00 1 = %01 2 = %10 3 = %11 4 = %00 5 = %01 6 = %10 7 = %11 8 = %00 9 = %01 10 = %10 11 = %11 12 = %00 13 = %01 • Reformat loop so that Instr. 1 is outside loop and repeated as Instr. 13 with Instr. 12 PM( ) moved • Now we have removed cache thrash with no waste Instruction 1 outside loop Instruction 3 to cache %11 4 -- %00 LRU 6 -- %10 LRU 9 -- %01 10 = %10 12 = %00 F1=, ro=dm( ), pm( ) = Cache-DSP Tool smithmr@ucalgary.ca

Problems to overcome • Jumps inside loops • Complicates which instructions get cached • Conditional jump changes which instruction gets cached (dynamic effect) • Complicated to the effect of placing a NOP into a delay slot and displacing an instruction out of the delay slot • Effect of loops inside loops Cache-DSP Tool smithmr@ucalgary.ca

Concepts discussed • Concept behind ADI instruction cache • Cache operation • Introduction of CACHE THRASHING • Solutions to avoid a Cache Thrash without delaying product release • Introduction of NOP instructions into code -- wasting one cycle to save 3 cycles • Identification of PM( ) operations to move • Basis of Cache-DSP tool Cache-DSP Tool smithmr@ucalgary.ca

Acknowledgements • Financial support of Natural Sciences and Engineering Research Council of Canada (NSERC) and the University of Calgary • Financial support from Analog Devices through ADI University professorship for 2001/2002 (Dr. Smith) • Future work will be financed in part by the Alberta Government through Alberta Software Engineering Research Consortium (ASERC) Cache-DSP Tool smithmr@ucalgary.ca

CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line

CACHE-DSP Tool How to avoid having a SHARC thrashing on a cache-line

Presentation Transcript

Cache

Squirrel: A peer-to-peer web cache

Cache

Whose Cache Line Is It Anyway?

Cache

Fast Configurable-Cache Tuning with a Unified Second-Level Cache

Cache

A Comparison of Cache-conscious and Cache-oblivious Programs

Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol

A Commentary on CACHE and Software

Improving Data Cache Performance Under a Cache Miss

Squirrel: A peer-to-peer web cache

Summary Cache: A Scalable Wide-Area Web Cache Sharing Protocol

Towards a Distributed WMS Cache...

How to Build a CPU Cache

Cache

A Self-Tuning Configurable Cache

Cache?

How to Refresh Browser Cache on iPhone

A Self-Tuning Configurable Cache

Cache