Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses

Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses Peter Marwedel12, Heiko Falk1, Christian Ferdinand3Paul Lokuciejewski1, Manish Verma1, Lars Wehmeyer12 1Universität Dortmund, Informatik 12 2 Informatik Centrum Dortmund (ICD) 3 AbsInt GmbH, Saarbrücken

Key properties of embedded systems Strong correlation between embedded and real-time systems Strong correlation between embedded and reactive systems embedded „A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Bergé, 1995] embedded real-time real-time

Serious mismatch • Despite considerable progress in software and hardware techniques, when embedded computing systems absolutely must meet tight timing constraints, many of the advances in computing become part of the problem rather than part of the solution. What would it take to achieve concurrent and networked embedded software that was absolutely positively on time … ? ..What is needed is nearly a reinvention of computer science. Edward A. Lee: Absolutely Positively On Time: What Would It Take?, Editorial, Draft version: May 18, 2005, Published in: Embedded Systems Column, IEEE Computer, July, 2005

Speed 8 4 CPU (1.5-2 p.a.)  2x every 2 years 2 DRAM (1.07 p.a.) 1 years 0 1 2 3 4 5 Technology "advances" will make the situation worse Increasing gap between processor and memory speeds Future semiconductor technology will be inherently unreliable, e.g. due to quantum effects and will require fault tolerance mechanisms to be used. Timing "redundancy" used?

Scratchpad seen to help with timing problems • Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpadmemories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability… • [E.Lee, 2005]

Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable • Address space 0 Small; no tag memory scratch pad memory main memory FFF.. Example Called “tightly coupled memory” by ARM ARM7TDMI cores, well-known for low power consumption

Worst case timing analysis using aiT C program SP size encc Actualperformance ARMulator executable aiT WCET

Results for G.721 Using Scratchpad: Using Unified Cache: • L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004 • L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005

Impact on access time and energy consumption • Small memories also provide faster access time and reduced energy consumption Energy Access times CACTI model for SRAM

Energy savings for memory system energy

Static allocation of memory objects For i .{ } for j ..{ } while ... Repeat call ... Example: Which object (array, function, etc.) to be stored in SPM? Gain gk and size sk for eachobject k. Maximise gain G = gk, respecting size of SPM sk ≤ SSP. Static memory allocation: Solution: knapsack algorithm. board Main memory Array ... ? Scratch pad memory,capacity SSP Array Processor Int ...

Dynamic replacement within scratch pad CPU • Effectively results in a kind of compiler-controlled swapping for SPM • Address assignment within SPM required(paging or segmentation-like) SPM Memory Memory M.Verma, P.Marwedel (U. Dortmund): Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004

T3 Dynamic replacement of data within scratch pad: based on liveness analysis • SP Size = |A| = |T3| B1 DEF A B2 B9 SPILL_STORE(A); SPILL_LOAD(T3); MOD A USE T3 B3 B5 Solution: A  SP & T3  SP B4 B6 USE T3 USE A B10 B7 SPILL_LOAD(A); B8 USE A

Dynamic replacement within scratch pad- Results for edge detection relative to static allocation -

Impact of partitioning scratch pads 0 Scratch pad 0, 256 entries Scratch pad 1, 2 k entries addresses Scratch pad 2, 16 k entries "main" memory

Results for parts of GSM coder/decoder „Working set“ A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.

Process P1 Process P1 Process P2 Process P2 Process P3 Process P3 Multiple Processes:Non-Saving Context Switch • Non-Saving Context Switch(Non-Saving) • Partitions SPM into disjoint regions • Each process is assigned a SPM region • Copies contents during initialization • Good for large scratchpads P3 P2 P1 Scratchpad

Process P1 Process P2 Process P3 Saving/Restoring Context Switch Saving/Restoring at context switch • Saving Context Switch (Saving) • Utilizes SPM as a common region shared all processes • Contents of processes are copied on/off the SPM at context switch • Good for small scratchpads Saving/Restoring at context switch P3 P2 P1 Scratchpad

Process P1 Process P1 Process P2 Process P2 Process P3 Process P3 Process P1,P2, P3 Process P1 Process P2 Process P3 Hybrid Context Switch • Hybrid Context Switch (Hybrid) • Disjoint + Shared SPM regions • Good for all scratchpads • Analysis is similar to Non-Saving Approach • Runtime: O(nM3) P3 P2 P1 Scratchpad

Multi-process Scratchpad Allocation: Results SPA: Single Process Approach • Hybrid is the best for all SPM sizes. • Energy reduction @ 4kB SPM is 27% for Hybrid approach. • Avoids poor timing predictability of cache-based system after context switch. 27% edge detection, adpcm, g721, mpeg

Multi-processor ARM (MPARM) Framework ARM ARM ARM ARM SPM SPM SPM SPM • Homogenous SMP ~ CELL processor • Processing Unit : ARM7T processor • Shared Coherent Main Memory • Private Memory: Scratchpad Memory Interconnect (AMBA or STBus) Shared Main Memory Interrupt Device Semaphore Device

Using optimization in an gcc-based tool flow • Source is split into 2 different file by specially developed memory optimizer tool *. applicationsource main mem. src .c Memory Optimizer .c spm src. ARM-GCC Compiler .c ICD-C Compiler .txt executable profile Info. .ld .exe linker script *Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)

Results (MOMPARM) DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines Energy values from ST Microelectronics Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.

State of the art of SPM algorithms

Extension: WCET-aware compiler Standard input to aiT ANSI-CProgramm Parse Tree ANSI-C Frontend ValueAnalysis Loop bounds analysis IR-Code Generator CacheAnalysis Medium Level IR CRL2 PipelineAnalysis LLIR-Code Generator PathAnalysis LLIR2crl Optimization Techniques Low Level IR CRL2 with WCET Info crl2llir Analyses WCET optimized assembly code Code Generator ARTIST2

Opportunities • Precise WCET information for run-time optimizations • Single implementation of hardware timing models • Accurate information on pipeline influence • Accurate information on timing of memory • Trade-off Cache vs. Scratchpad Optimization • Pass additional information (flow facts) to aiTPotential for tighter bounds?(e.g. due to pointer disambiguation) • Aggressive optimizations for code on WCET path • Respecting WCET constraints during compilation • Reduction of jitter in multimedia applications • Alternative input to aiT (compare compiler output)

Conclusion Timeliness and timing predictability seriously missing in key concepts of current information technology • Scratchpads are seen as a potential contribution towards new architectural concepts • Comprehensive set of allocation methods has been developed • Static allocation • Dynamic allocation • Full integration of WCET tools into compiler tool chain enables further explicit considerations of time.

Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses