1 / 27

Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses

Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses. Peter Marwedel 12 , Heiko Falk 1 , Christian Ferdinand 3 Paul Lokuciejewski 1 , Manish Verma 1 , Lars Wehmeyer 12 1 Universität Dortmund, Informatik 12 2 Informatik Centrum Dortmund (ICD)

herne
Download Presentation

Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Memory-aware compilation enables fast, energy-efficient, timing predictable memory accesses Peter Marwedel12, Heiko Falk1, Christian Ferdinand3Paul Lokuciejewski1, Manish Verma1, Lars Wehmeyer12 1Universität Dortmund, Informatik 12 2 Informatik Centrum Dortmund (ICD) 3 AbsInt GmbH, Saarbrücken

  2. Key properties of embedded systems Strong correlation between embedded and real-time systems Strong correlation between embedded and reactive systems embedded „A reactive system is one which is in continual interaction with is environment and executes at a pace determined by that environment“ [Bergé, 1995] embedded real-time real-time

  3. Serious mismatch • Despite considerable progress in software and hardware techniques, when embedded computing systems absolutely must meet tight timing constraints, many of the advances in computing become part of the problem rather than part of the solution. What would it take to achieve concurrent and networked embedded software that was absolutely positively on time … ? ..What is needed is nearly a reinvention of computer science. Edward A. Lee: Absolutely Positively On Time: What Would It Take?, Editorial, Draft version: May 18, 2005, Published in: Embedded Systems Column, IEEE Computer, July, 2005

  4. Speed 8 4 CPU (1.5-2 p.a.)  2x every 2 years 2 DRAM (1.07 p.a.) 1 years 0 1 2 3 4 5 Technology "advances" will make the situation worse Increasing gap between processor and memory speeds Future semiconductor technology will be inherently unreliable, e.g. due to quantum effects and will require fault tolerance mechanisms to be used. Timing "redundancy" used?

  5. Scratchpad seen to help with timing problems • Fortunately, there is quite a bit to draw on. To name a few examples, architecture techniques such as software-managed caches (scratchpadmemories) promise to deliver much of the benefit of memory hierarchy without the timing unpredictability… • [E.Lee, 2005]

  6. Scratch pad memories (SPM):Fast, energy-efficient, timing-predictable • Address space 0 Small; no tag memory scratch pad memory main memory FFF.. Example Called “tightly coupled memory” by ARM ARM7TDMI cores, well-known for low power consumption

  7. Worst case timing analysis using aiT C program SP size encc Actualperformance ARMulator executable aiT WCET

  8. Results for G.721 Using Scratchpad: Using Unified Cache: • L. Wehmeyer, P. Marwedel: Influence of Onchip Scratchpad Memories on WCET: 4th Intl Workshop on worst-case execution time analysis, (WCET), 2004 • L. Wehmeyer, P. Marwedel: Influence of Memory Hierarchies on Predictability for Time Constrained Embedded Software, Design Automation and Test in Europe (DATE), 2005

  9. Impact on access time and energy consumption • Small memories also provide faster access time and reduced energy consumption Energy Access times CACTI model for SRAM

  10. Energy savings for memory system energy

  11. Static allocation of memory objects For i .{ } for j ..{ } while ... Repeat call ... Example: Which object (array, function, etc.) to be stored in SPM? Gain gk and size sk for eachobject k. Maximise gain G = gk, respecting size of SPM sk ≤ SSP. Static memory allocation: Solution: knapsack algorithm. board Main memory Array ... ? Scratch pad memory,capacity SSP Array Processor Int ...

  12. Dynamic replacement within scratch pad CPU • Effectively results in a kind of compiler-controlled swapping for SPM • Address assignment within SPM required(paging or segmentation-like) SPM Memory Memory M.Verma, P.Marwedel (U. Dortmund): Dynamic Overlay of Scratchpad Memory for Energy Minimization, ISSS, 2004

  13. T3 Dynamic replacement of data within scratch pad: based on liveness analysis • SP Size = |A| = |T3| B1 DEF A B2 B9 SPILL_STORE(A); SPILL_LOAD(T3); MOD A USE T3 B3 B5 Solution: A  SP & T3  SP B4 B6 USE T3 USE A B10 B7 SPILL_LOAD(A); B8 USE A

  14. Dynamic replacement within scratch pad- Results for edge detection relative to static allocation -

  15. Impact of partitioning scratch pads 0 Scratch pad 0, 256 entries Scratch pad 1, 2 k entries addresses Scratch pad 2, 16 k entries "main" memory

  16. Results for parts of GSM coder/decoder „Working set“ A key advantage of partitioned scratchpads for multiple applications is their ability to adapt to the size of the current working set.

  17. Process P1 Process P1 Process P2 Process P2 Process P3 Process P3 Multiple Processes:Non-Saving Context Switch • Non-Saving Context Switch(Non-Saving) • Partitions SPM into disjoint regions • Each process is assigned a SPM region • Copies contents during initialization • Good for large scratchpads P3 P2 P1 Scratchpad

  18. Process P1 Process P2 Process P3 Saving/Restoring Context Switch Saving/Restoring at context switch • Saving Context Switch (Saving) • Utilizes SPM as a common region shared all processes • Contents of processes are copied on/off the SPM at context switch • Good for small scratchpads Saving/Restoring at context switch P3 P2 P1 Scratchpad

  19. Process P1 Process P1 Process P2 Process P2 Process P3 Process P3 Process P1,P2, P3 Process P1 Process P2 Process P3 Hybrid Context Switch • Hybrid Context Switch (Hybrid) • Disjoint + Shared SPM regions • Good for all scratchpads • Analysis is similar to Non-Saving Approach • Runtime: O(nM3) P3 P2 P1 Scratchpad

  20. Multi-process Scratchpad Allocation: Results SPA: Single Process Approach • Hybrid is the best for all SPM sizes. • Energy reduction @ 4kB SPM is 27% for Hybrid approach. • Avoids poor timing predictability of cache-based system after context switch. 27% edge detection, adpcm, g721, mpeg

  21. Multi-processor ARM (MPARM) Framework ARM ARM ARM ARM SPM SPM SPM SPM • Homogenous SMP ~ CELL processor • Processing Unit : ARM7T processor • Shared Coherent Main Memory • Private Memory: Scratchpad Memory Interconnect (AMBA or STBus) Shared Main Memory Interrupt Device Semaphore Device

  22. Using optimization in an gcc-based tool flow • Source is split into 2 different file by specially developed memory optimizer tool *. applicationsource main mem. src .c Memory Optimizer .c spm src. ARM-GCC Compiler .c ICD-C Compiler .txt executable profile Info. .ld .exe linker script *Built with new tool design suite ICD-C available from ICD (see www.icd.de/es)

  23. Results (MOMPARM) DES-Encryption: 4 processors: 2 Controllers+2 Compute Engines Energy values from ST Microelectronics Result of ongoing cooperation between U. Bologna and U. Dortmund supported by ARTIST2 network of excellence.

  24. State of the art of SPM algorithms

  25. Extension: WCET-aware compiler Standard input to aiT ANSI-CProgramm Parse Tree ANSI-C Frontend ValueAnalysis Loop bounds analysis IR-Code Generator CacheAnalysis Medium Level IR CRL2 PipelineAnalysis LLIR-Code Generator PathAnalysis LLIR2crl Optimization Techniques Low Level IR CRL2 with WCET Info crl2llir Analyses WCET optimized assembly code Code Generator ARTIST2

  26. Opportunities • Precise WCET information for run-time optimizations • Single implementation of hardware timing models • Accurate information on pipeline influence • Accurate information on timing of memory • Trade-off Cache vs. Scratchpad Optimization • Pass additional information (flow facts) to aiTPotential for tighter bounds?(e.g. due to pointer disambiguation) • Aggressive optimizations for code on WCET path • Respecting WCET constraints during compilation • Reduction of jitter in multimedia applications • Alternative input to aiT (compare compiler output)

  27. Conclusion Timeliness and timing predictability seriously missing in key concepts of current information technology • Scratchpads are seen as a potential contribution towards new architectural concepts • Comprehensive set of allocation methods has been developed • Static allocation • Dynamic allocation • Full integration of WCET tools into compiler tool chain enables further explicit considerations of time.

More Related