Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems

Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems - Nalini Kumar GauravChitroda KomalKasat

OUTLINE • Introduction • Scratch pad memory • Cache memory • Proposed methodology • Results • Conclusions Spring 2010, EEL 6935, Embedded Systems

INTRODUCTION • Scratch pad memory • Cache memory • Proposed methodology • Results • Conclusions Spring 2010, EEL 6935, Embedded Systems

INTRODUCTION • Scratch pad memory: A high speed internal memory used for temporary storage of calculations, data and other work in progress. • It is next closest memory to the ALU after the internal registers. • Scratch pad based systems have NUMA(Non-Uniform Memory Access) latencies, and use explicit instructions to move data. DMA based data transfer is often used. • On chip caches using SRAM consume power in the range of 25% to 45% of the total chip power • Current embedded processors for multimedia applications have on-chip scratch pad memories Spring 2010, EEL 6935, Embedded Systems

INTRODUCTION • Scratchpad vs. Cache: • A scratchpad doesn’t contain a copy of data that is stored in the main memory. • Scratchpad memory is directly manipulated by applications. • In cache memory systems mapping of program elements is done during runtime, in scratch pad memory systems it is done either by the user or by the compiler using a suitable algorithm • Prior studies on scratch pad memories do not address the impact on area Spring 2010, EEL 6935, Embedded Systems

CONTRIBUTIONS • The paper proposes scratchpad memory as an alternative to cache memory as on-chip memory for computationally intensive applications. • CACTI tool is used for computing area and energy for AT91M40400 target architecture. • The results establish scratchpad memory as a low power alternative in most situations with an average energy reduction of 40% Spring 2010, EEL 6935, Embedded Systems

Introduction • SCRATCH PAD MEMORY • Cache memory • Proposed methodology • Results • Conclusions Spring 2010, EEL 6935, Embedded Systems

SCRATCH PAD MEMORY • Memory array with the decoding and the column circuitry logic • Memory objects are mapped to the scratch pad in the last stage of the compiler • It occupies one distant part of the memory address space. No need to check for data/instr. availability in the scratch pad • Reduces the comparator and the signal miss/hit acknowledging circuitry Memory Cell Memory Array Spring 2010, EEL 6935, Embedded Systems 6 Transistor Static RAM Figure: Scratch Memory Array

SCRATCH PAD MEMORY • Area of scratchpad, As As = Asde + Asda + Asco + Aspr + Asse + Asou • Energy Consumption is estimated from the energy consumption of the components Escratchpad = Edecoder + Ememcol • Components: Data decoder, data array area, column multiplexers, pre charge circuit, data sense amplifiers, output driver circuitry • Memory array is the major consumer of energy • CACTI tool first computes the capacitances for each unit then estimates the energy Spring 2010, EEL 6935, Embedded Systems

ESTIMATING THE ENERGY CONSUMPTION • For the memory array: Ememcol = Cmemcol * Vdd2 * P0->1 • Cmemcol is the capacitance of the memory array unit and is calculated as Cmemcol = ncols * (Cpre + Creadwrite) • P0->1 is the probability of bit toggle, 0.5 • Only two word lines are switched regardless of the change in the address bits • Total energy spent in the scratch pad memory is Esptotal = SPaccess* E scratchpad • The only case that holds good is read or write access Spring 2010, EEL 6935, Embedded Systems

Introduction • Scratch pad memory • CACHE MEMORY • Proposed methodology • Results • Conclusions Spring 2010, EEL 6935, Embedded Systems

CACHE MEMORY • Area model is based on the transistor count in the circuitry • Area of the cache, Ac = Atag + Adata where Atag= Adt + Ata + Aco + Apr + Ase + Acom + Amu and Adata= Ade + Ada + Acol + Apre + Asen + Aout Tag Array Data Array Spring 2010, EEL 6935, Embedded Systems Figure: Cache Memory Organization

Introduction • Scratch pad memory • Cache memory • PROPOSED METHODOLOGY • Results • Conclusions Spring 2010, EEL 6935, Embedded Systems

EXPERIMENTAL SETUP • Compare same size cache with scratchpad memory (the delay of cache is higher than scratchpad for the same technology) • Identification and Assignment of critical data structures to scratch pad in based on a packing algorithm • Total number of clock cycles determines the performance • Larger the number of clock cycles, lower the performance because on-chip configuration doesn’t change the clock period Spring 2010, EEL 6935, Embedded Systems

SCRATCH PAD MEMORY ACCESS • Performance estimation from the trace file. • An appropriate latency is added to the overall program delay on scratchpad access: • one for scratch pad read/write access, • one cycle and one wait cycle for 16 bit main memory access, • one cycle plus three wait states for main memory 32 bit access Spring 2010, EEL 6935, Embedded Systems

CACHE MEMORY ACCESS • Authors assume a write through cache • Read Hit:Tag array is accessed. No write to cache and no access to main memory • Read Miss: One cache read operation, L (line size) words written to cache. One main memory read event of size L and no main memory write • Write Hit: Cache write followed by memory write • Write Miss: One cache tag read and main memory write. No cache update. Spring 2010, EEL 6935, Embedded Systems

FLOW DIAGRAM C Benchmark Cache Number of Cycles ARMulator trace analysis Energy Aware Compiler Mapping Algorithm Energy Estimates CACTI Spring 2010, EEL 6935, Embedded Systems Cache/Scratch Pad Size Compiler Support Analytical model Area Estimates Scratchpad Number of cycles Trace Analysis

EXPERIMENTAL SETUP • Target architecture: • AT91M40400, based on embedded ARM 7TDMI embedded processor • High performance RSIC processor with a very low power consumption • On-chip scratch memory of 4KB. 32 bit data path and two instruction sets. • encc – energy aware complier, uses a special packing algorithm- knapsack algorithm for assigning code and data blocks to the scratch pad memory • The binary output of the compiler is simulated on the ARMulator to produce a trace file. • ARMulator accepts the cache size as a parameter for on-chip cache configuration and generates the performance as number of cycles. The area and performance estimates are made for the 0.5um technology Spring 2010, EEL 6935, Embedded Systems

Introduction • Scratch pad memory • Cache memory • Proposed methodology • RESULTS • Conclusions Spring 2010, EEL 6935, Embedded Systems

RESULTS The average area, time and AT product reductions are 34% 18% and 46% Table: Energy per access of various devices Table: Area/Performance ratios for bubble-sort Spring 2010, EEL 6935, Embedded Systems

RESULTS Spring 2010, EEL 6935, Embedded Systems Figure: Energy consumed by the memory system Figure: Comparison of cache and scratch pad memory area

Introduction • Scratch pad memory • Cache memory • Proposed methodology • Results • CONCLUSION Spring 2010, EEL 6935, Embedded Systems

CONCLUSION • Presents an approach for selection of on-chip memory configurations • Results show that scratch pad based compile time memory outperforms cache-based run-time memory on almost counts. • 40% average reduction for the application considered • Authors propose study of DRAM based memory comparisons since memory bandwidth and on-chip memory capacity are limiting factors for many applications. • Also, the energy models for both cache and scratchpad need to be validated by real measurements Spring 2010, EEL 6935, Embedded Systems

QUESTIONS Spring 2010, EEL 6935, Embedded Systems

Scratchpad Memories: A Design Alternative for Cache On-chip Memory in Embedded Systems