Block Cache for Embedded Systems

Block Cache for Embedded Systems Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany

Outline • Motivation • Related Work • State of the art: “Instruction Cache” • Our approach: ”Block cache” • Workflow (Instruction Selection / Simulation) • Assumptions & Constrains • Algorithm • Results • Summary

Motivation • Area is expected to increase enormously(!) On-Chip Off-Chip Off-chip memory CPU Block Cache I-Cache I-Cache CPU I-Cache I-Cache CPU I-Cache I-Cache David A. Patterson „Latency lags bandwidth” Commun. ACM 2004” Efficiency Power consumption Area 1.. N Memory blocks of instructions(SRAM cells) Generally caches consume more power than on-chip memory [1,2,3]

Related Work • S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02 • Statically partition on- and off-chip memory • S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02 • Statically determine code copying points • P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04 • DMA for acceleration in on-chip memory for data • B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06 • MMU to map between on- and off-chip memory (we share the µTLB)

“State of the Art”: Instruction Cache On-Chip Off-Chip Off-chip memory CPU Block Cache CPU I-Cache CPU I-Cache

MUX MUX Architecture: Instruction Cache Offset Tag Set T T T T O ... ... = O ... ... T = O T ... ... = T O ... ... = Data Tag O

Our approach: Block Cache “State of the Art”: Instruction Cache On-Chip Off-Chip Off-chip memory CPU Block Cache CPU I-Cache CPU I-Cache

B1 Memory Blocks (SRAM cells) B2 + Logic .. BN Our approach: Block Cache

Architectural Overview: Block Cache Memory blocks Off-chip memory On-chip B1 CPU B2 Instructions DMA .. = BN Block load Instruction address µTLB ControlUnit Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size)

B1 B2 .. BN Architectural Overview: Block Cache Memory blocks On-chip F1 PUSH R1 PUSH R2 …. POP R2 POP R1 RET (Assembler) 010101010010110111 100010 (Binary) F2 1..N Function(s) = …. FN

Function to Block Mapping F6 B1 F2 F3 F1 B2 F5 F9 F8 F4 F15 B3 F20 F10 FN F19 F12 F19 F14 F17 F7 F16 F18 F6a F6c Eviction: LRU, Round Robin, ARC, Belady F6b

Design Flow : Analysis Executed Instruction Trace Software Component Instrumented Execution / Simulation Dynamic Call Graph Input Data / Parameters Disassemble + Functions not called during profiling(need to be included) Static Call Graph Trace: function enter/exit function address

Besides:Assumptions & Constrains • Software Behavior Analysis • Component level • Trace composition reflects deployment usage( parameters / input set ) • Hardware • External memory: High bandwidth / high latency • Block size (fixed) / Number of code blocks (fixed) • Compiler / Linker • Function splitting (function size < block size)

Design Flow : Analysis Executed Instruction Trace Application (component) Instrumented Execution / Simulation Dynamic Call Graph Input Data / Parameters Disassemble Static Call Graph Trace: function enter/exit function address

Design Flow : Block composition Dynamic Call Graph Block composition algorithm Linker File Static Call Graph

Design Flow : Re-linking Re-linked Binary Original Binary Linker File Code block 1 Function 1 done Function 2 X Function 3 Code block 2 Function 4 Code block 3 Function 5 Function 6 …. Code block 4

Design Flow : Re-linking Code block 1 Data section size Original code sectionsize Code sectionsize after re-linking Data Reference Compiler supplies: Relocation table Symbol table ELF headers Function Pointer Function Reference

Overview: Algorithm • Input:Dynamic function call graph (Node = function) • Output:Block graph (Node = 1..n functions) • Challenge: “Merge appropriate functions into a block” • 3 steps (differ in merging distance): (1) combine_neighbor (2) merge_direct_children (3) bubble_merge

Algorithm Step 1/3 combine_neighbor F1 Dynamic Call Graph 100 1 4 F2 F4 F3 30 1e6 1e4 1e6 1e8 F5 F6 1e10 F8 F9 F7 Function size (architecture)

Algorithm Step 1/3 combine_neighbor Dynamic Call Graph F1 0.00 100 1 4 0.00 F2 1.00 F4 0.00 F3 30 1e6 1e4 1e6 1e8 F5 F6 1e10 F8 F9 0.00 F7 0.00 0.00 0.99 0.01 Centrality Measure: F4,7

Algorithm Step 2/3 merge_direct_children Dynamic Call Graph F3 30 1e6 1e4 1e6 F5 F6 F7 F8 F5 F6 F7 F8

Algorithm Step 2/3 merge_direct_children Dynamic Call Graph F3 30 1e6 1e4 1e6 F5 F6 F7 F8 F5 F6 F7 F8 F6 F7 F8 F5 F6,7,8 F6,7

Algorithm Step 2/3 merge_direct_children Dynamic Call Graph F3 30 1e6 1e6+1e6+1e4 1e4 1e6 F5 F6 F7 F8 F5 F6,7,8 F5

Algorithm Step 3/3 bubble_merge F1 F1 Dynamic Call Graph 100 1 4 F2 F4 F3 30 1e6 1e4 1e6 1e8 F5 F5 F6 F6 1e10 F8 F9 F7 F7

Algorithm Step 3/3 bubble_merge F1 F1 Dynamic Call Graph 100 1 4 F2 F2 F4 F4 F3 30 1e6 1e4 1e6 1e8 F5 F5 F6 F6 1e10 F8 F8 F9 F9 F7 F7

Algorithm Step 3/3 bubble_merge F1 F1 Dynamic Call Graph 100 1 4 F2 F2 F4 F4 F3 30 1e6 1e4 1e6 1e8 F5 F5 F6 F6 1e10 F8 F8 F9 F9 F7 F7 F3,F8

Results • What is interesting ? • Memory efficiency: Block Fragmentation • Technology scaling: Misses • Energy: Amount of transferred data • Performance: Number of cycles • Benchmark: MediaBench (CJPEG)

Results: Function size distribution Results: Block Fragmentation x-axis: Binary size [Byte] Block size [Byte] CJPEG – JPEG encoding (MediaBench)

Results: Misses : LRU: [6-12 blocks] X-axis: total cache size [Byte] CJPEG – JPEG encoding (MediaBench)

Results: Transferred Code : LRU [6-12 blocks] X-axis: total cache size [Byte] CJPEG – JPEG encoding (MediaBench)

Results: LRU/ARC/RR Transferred Code [8 blocks] X-axis: total cache size [Byte] CJPEG – JPEG encoding (MediaBench)

Results: Copy cycles : LRU : [6-12 blocks] X-axis: total cache size [Byte] CJPEG – JPEG encoding (MediaBench)

Summary • Introduced: Block Cache for Embedded Systems • Area increase / External memory latency • Utilization / Suitability of traditional designs • Scalability: on-chip memories (Megabytes) • Block Cache: • Hardware • Simple hardware structure: Logic + Memory (SRAM not cache memory) • Design Flow • Execute software component, block composition (algorithm, 3 steps), re-link the binary • Results • Exploits high-bandwidth memory • Good performance

References • [1] David A. Patterson „Latency lags bandwidth”,Commun. ACM – 2004 • [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002 • [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004 • [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002

Motivation Off-chip memory Bandwidth improves but latency not [1] CPU I-Cache Generally caches consume more power than on-chip memory [2,3,4] On-chip area will increase enormously CPU I-Cache A significant amount of power will be spent in the memory hierarchy CPU I-Cache

Motivation Off-chip memory CPU I-Cache CPU I-Cache CPU I-Cache

Motivation Off-chip memory CPU I-Cache CPU I-Cache B-Cache CPU I-Cache

Architectural Overview: Block Cache Off-chip memory Code blocks B1 B2 CPU Exploit burst transfers (DRAM Memory) Instructions -Area efficient (SRAM cells) -Scalable (up to application size) DMA B3 = … Block load Instruction address Addr. B1 Block status Addr. B1 ControlUnit Addr. B1 µTLB On-chip ……

Function to Block Mapping F6 B1 F2 F3 F1 B2 F8 F7 F4 F5 F20 F9 F10 FN F12 F19 F14 F17 F15 F16 F18 F19 Exploit burst transfers (DRAM Memory) -Area efficient (SRAM cells) -Scalable (up to application size)

Block Cache for Embedded Systems

Block Cache for Embedded Systems

Presentation Transcript

Java for embedded systems

EMBEDDED SYSTEMS

Computing for Embedded Systems

XBC - eXtended Block Cache

A Self-Tuning Cache architecture for Embedded Systems

ROBTIC : On chip I-cache design for low power embedded systems

Software for Embedded Systems

Middleware for Embedded Systems

Papyrus for Embedded Systems

UIs for Embedded Systems

Software for Embedded Systems

A Highly Configurable Cache Architecture for Embedded Systems

Networking for Embedded Systems

SACR: Scheduling-Aware Cache Reconfiguration for Real-Time Embedded Systems

Compressed Tag Architecture for Low-Power Embedded Cache Systems

Networking for Embedded Systems

Embedded Systems Course | Best Institute for Embedded Systems Course

Processors for Embedded Systems

Middleware for Embedded Systems

OS for Embedded Systems

XBC - eXtended Block Cache

Processors for Embedded Systems