590 likes | 765 Views
David Atienza (DACYA/UCM), Stelios Mamagkakis (VLSI D&T Center, Xanthi), Marc Leeman (IMEC vzw, Leuven), Francky Catthoor (IMEC vzw, Leuven), José M. Mendías (DACYA/UCM), Dimitrios Soudris (VLSI D&T Center, Xanthi). Dynamic Memory Management for new embedded systems.
E N D
David Atienza (DACYA/UCM), Stelios Mamagkakis (VLSI D&T Center, Xanthi), Marc Leeman (IMEC vzw, Leuven), Francky Catthoor (IMEC vzw, Leuven), José M. Mendías (DACYA/UCM), Dimitrios Soudris (VLSI D&T Center, Xanthi) Dynamic Memory Management for new embedded systems
New embedded systems? - New consumer devices (e.g. mobiles, PDAs): • Main features: • 1) More complex than traditional embedded devices (complex memory hierarchy, cpu, Real-Time OSes) 2) Portables - limited batteries • 3) Preserve “relatively” high performance (real time) 4) Many applications are usually running concurrently
New applications? - Multimedia and wireless network protocols: 3D Virtual reality games Scalable video rendering Wireless protocols • Main features: • 1) Complex high level design (e.g. C++, Java) • 2) Very dynamic (variable use of resources) • 3) Power hungry • 4) Intensive memory use (accesses and footprint)
Platforms New applications • No real-time! Why optimizing these systems? No optimizations: low battery • Out of memory or battery fast! Users do not want these problems… new embedded systems need to be optimized!
Outline: 1) Memory subsystem in new embedded devices (Static vs dynamic memory allocation) 2) Dynamic Memory management mechanism 3) Dynamic memory subsystem refinement: 3.1) Dynamic Data Type Refinement (Application Level) 3.2) Dynamic Memory Management Refinement (OS Level) 4) Real life case studies and results 5) Enhanced Dynamic Memory Management (Multi-level) – Real life case study and results 6) Conclusions and future work
Processor Core ARM MMU AMBA BUS Data Cache Instruction Cache Scratchpad Memory Memory subsystem in new embedded devices - Highly optimized: 1) Multi-level memory hierarchy (e.g. caches, etc.) 2) High-performance buses (e.g. AMBA bus) - Not enough… Memory subsystem up to 70% of total system power and high performance degradation (>20x) [Vijaykrisnan2002, Catthoor2000] Access and storage optimizations for DM needed! Main Memory
Data NO! Object3 Object2 Object1 time Object 1 Object 2 Object 3 Static memory vs dynamic memory Scenario 1 - Compile-time (worst case) Memory size t1 t2 t3 t4 Memory: Scalable 3D decoding (per object): Medium quality Low quality High quality
Object4 Object 3 Object5 Object5 Object3 Object2 Object1 Object 1 Object 2 Object 4 Static memory vs dynamic memoryScenario 2 – Run time allocation Memory size Data OK! Memory usage scales to current input! t1 t2 t4 time t3 t5 Memory: Scalable 3D decoding (per object): Medium quality Low quality High quality
Overview of options for memory allocation(results 3D Image Reconstruction case study) • Worst case static memory solutions not possible or do not work in extreme cases of input data • Dynamic solutions achieve better results • As shown later: Well-designed custom solutions can improve further standard DM mechanisms
O5 O5 O3 O2 O1 O4 O1 Fragmentation!! O2 O3 DM management works at 2 levels 1) Applications use SW functions, C++: new()/delete() new(O1) new(O5) new(O4) new(O3) delete(O2) new(O2) O4 Data Time t6 t3 t2 t1 t4 t5 2) Real time OS support: DM manager RTOS Dynamic Memory Manager Heap Main memory
PA(key1) LAR(key1) Layer 1 Layer 1 data data data data PA(key2) Layer 2 Layer 2 data data AR(key2) AR(key2) data data data 1) Which parts use the DM SW functions? Embedded application: Dynamic Data Types: Dynamic Data (e.g. objects to render) Static Data (e.g. frames) Algorithms (Functionality) Structured data (sets of objects) … DDT 1 DDT n new(Object):
Used Used Used New request New request Use header header header header header header header header Use Free Use Free Free Free Free Free New request Free 2) RTOS support, DM manager to use? - Partition manager: suitable for one allocation size • One fixed block size • First fit allocation order Simple - low energy consumption Fragmentation!! Global Info of manager Free Free Used Used Used Free Free Used Used Free • - Region allocator: suitable for several allocation sizes • Many block sizes, doubly-linked list • First fit, splitting, coalescing (best fit approximation) Complex – higher energy consumption Global Info of Manager Free Free Free
Design flow for new embedded systems Specification of the system at very high-level (e.g. C++ or UML) Refinement of DM subsystem (2 levels): 1. Dynamic Data Types Refinement (DDTR) 2. DM Managers Refinement (DMMR) Further optimizations in static data Final implementation
DDTs in new embedded applications 1) Complex control flow (many sub-algorithms) => Many DDTs with different (and irregular) behaviour interacting in time 2) Implementations thought from functional point of view, not efficient access or mapping to memory (e.g. clustering of data) 3) Complex implementations => combinations of trees, arrays, single linked lists, doubly linked lists, … Result: Very expensive (e.g. energy) if not well designed
Library of auto-profiled DDTs Error-prone Automated ways? Structured reports generation Time-consuming Huge design space of DDTs Heuristics to limit the exploration High-level Estimations? Analytical high level estimations Proposed Dynamic Data Type Refinement steps Specification of the required DDTs (from the algorithm) Solutions proposed: PROBLEMS: Working implementation of DDTs (e.g. C++) Insertion of profiling Run time simulation and profiling acquisition Refinement of current design (interaction and implementation of DDTs) Final implementation of DDTs
Library of auto-profiled DDTs 1.- Library of complex multi-level DDTs: • Based on initial exploration in Matlab with traces of real programs • Multiplatform compatible: ANSI C++ compliant • Basic data types (e.g. int) or user-defined (e.g. objects) • List of relevant DDTs (i.e. 14) for multimedia included: 1) Basic data types: single and doubly linked lists, array (AR) and pointer-AR 2) 2-layered combinations of ARs and pointer-ARs: 3) Single linked list – AR 4) Doubly linked list – AR 5) Binary trees 6) DDTs with pattern optimizations (e.g. fast access to the last element) 2.- Extension: Mechanism to create new DDTs in the library based on template classes (or “mixins”)
template<class SuperClass> class Mixin : public SuperClass{ // mixin definitions } template<class SuperClass> class Myclass { SuperClass *data; // template class definitions } “Mixin” concept and how we use it • Y. Smaragdakis (2002): A method to specify extensions of classes without defining up-front which classes exactly • Uses in our library of DM managers: 1) Specifying a subclass while the parent class is a template parameter: 2) Using a template class inside another class:
Examples of DDT definitions: • Generic and basic DDTs: template<int Size,typename T,class Super> class Array{ ... }; template<typename T, class Super> class DLList{ ... }; template<typename T, class Super> class BTTree{ ... }; • One-level concrete DDTs: class F_array : public Array<256,float>{}; class I_DLList: public DLList<int,int>{}; class D_BTTree: public BTTree<double,double>{}; • New multi-level concrete DDTs: class ARARInteg: public Array<20, int, Array<128,int> >{}; class DLLARInteg: public DLList<int, Array<128, int > >{}; class BTSLLDoub: public BTTree<double,SLList<double > >{}; class ARARAREmployee: public Array<2,Employee, \ Array<4,Employee, Array<2, Employee> > >{};
Structured reports of DDTs implementations at run-time • 1) Profiling already inserted in all the DDT implementations of the library • 2) Information reported at run-time from the DDTs: • 1. Read and write accesses • 2. Memory footprint behaviour • 3. Access pattern to the data => Methods calls to data (e.g. sequential) • 3) Graphical Tool (based on Gtk/Perl) to perform code parsing and profiling insertion in new DDTs
Start exploration Heuristics evaluation Finished? Yes Evaluation Exploration finished No Acquiring profiling Normal execution speed of the application (with instrumented DDTs) Profile objects Library of DDTs for Multimedia Our run time exploration of DDTs 1) Heuristics based on clustering of blocks => Possible up to one per DDT 2) Unified exploration loop during usual execution: 3) Refinement is done in a post-processing phase.
Post-processing phase (refinement) • Automated refinement process: Analytical power model (0.8 to 0.l3 tech. node) (based on Cacti v3.0) Acquired run-time profiling information Graphical evaluation tool Global optimal (Pareto) points, trade-offs: Power consumption / Memory footprint / Execution time • Further refinements possible with run time behaviour information of DDTs: global control flow simplification => Intermediate Variable Elimination (additional global gains!)
Simplification of control flow: Intermediate Variable Elimination phase • Interaction between DDTs in a global context: 1) Complex algorithms consist of many smaller ones: data generation and consumptions (DSP) 2) Each step performs “some” transformations: filtering of points, proximity, selection, … 3) Injective Relationship: Remove buffers when index function simple enough and not intermediate results are needed later Very significant additional global gains!
Design flow for new embedded systems Specification of the system at very high-level (e.g. C++ or UML) Refinement of DM subsystem (2 levels): 1. Dynamic Data Types Refinement (DDTR) 2. DM Managers Refinement (DMMR) Further optimizations in static data Final implementation
Problems to create custom DM managers • DM management left to the OS => General-purpose DM managers, not custom ones! (Lea Allocator – Linux-based systems 2003) • Custom DM managers? No guidelines! Only designers’ experience (try-test phase): 1)Huge design space to manually explore (e.g. organization of memory blocks, fit algorithms) (Wilson et al. 1995) 2)Frameworks to build and profile custom DM managers are not available(Berger2001,Attardi98)
header header header header header header Small example of different choices in DM managers Several options to decide: • DM manager with information in the blocks? One block size Several block sizes • DM manager with coalescing service or not? No coalescing Coalescing Both options are possible, the best for current application? New methodologies to decide the best options needed!
1) Proposed DM management methodology • Main phases: 1) Profiling of application’s dynamic behavior to detect most commonly occuring data type access 2) Systematic exploration of possible DM management solutions from structured (orthogonalized) design space, for a certain cost function 3) Efficient code implementationand empirical evaluationof promising DM management solutions using our own high-level C++ library to create them
Proposed Dynamic Memory Management Refinement steps Profiling of application’s dynamic behavior (identification of DDTs access patterns) Exploration of possible DM management solutions for certain constraints (e.g. Power, performance, etc) Implementation and run-time evaluation of promising custom DM management candidates Final selection of custom DM manager
Profiling of application’s DM behavior Dynamic data = organized data structures: Dynamic Data Types (DDTs) • Allocation sizes of each structure in the DDTs • Temporal behavior of each DDT • Memory footprint of each DDT • Interaction of the DDTs (spatial locality) => From our Dynamic Data Type Refinement
Exploration of possible DM management solutions to minimize memory footprint 1)Definition of structured design space for DM management: • Orthogonal categories to create custom DM managers • Categories propagate dependencies and make feasible the design space exploration 2)Definition of a suitable order to traverse the design space reducing a certain cost function/s.
Our design space for DM management • According to basic blocks defined in DM managers: • Orthogonal decision trees inside each category
Complete DM management design space • All important state-of-the-art DM managers covered within our DM design space: • Binary buddy, Double buddy • Simple segregated fit • First fit, next fit, best fit allocation orders • Kingsley allocator (among the fastest, N-Gage) • Region allocators (fast, embedded RTOS: RTEMS) • Complex region-segregated fit – Win XP real time (fast) • Doug Lea Allocator (Linux, best trade-off) • Obstacks (custom, optimized for stack behavior, gcc) • Xalloc (custom, variation of regions-stack, Apache) • … Main problem: Huge design space! Order of decision trees?
header header header 1) (2 header header header Interdependencies help to explore the DM management design space Interdependencies exist between orthogonal trees: • A2) Block sizes in the DM manager: one or several? One block size Several block sizes • A5) Flexible blocks size DM manager: coalesce or not? No coalescing Coalescing These interdependencies make the exploration feasible: All combinations of trees not realistic!
E.g. Final order for reduced memory footprint(interdependencies and factors of influence) (10) (1) (11) (7) (8) (2) (12) (6) (4) (9) (5) (3)
Our approach to implement and evaluate custom DM managers 1) Object-oriented library to compose DM managers: - ANSI C++ code with “mixins” that can be efficiently optimised by current compilers (e.g. gcc, Visual C++) - Custom DM managers composed by basic components (e.g. fit algorithms, memory blocks organizations, etc.) - Fast creation and debugging of custom DM managers 2) Profiling framework can be easily inserted in the library to profile the candidates (e.g. memory footprint, energy, etc. ) 3) Post-processing phase to compare the DM candidates
Advantages of our mixin-based approach for DM managers to traditional implementations 1) Direct equivalences between our DM design space and its implementation classes 2) Independent layers => Good maintainability and fast changes in parts of DM managers (e.g. Lea Alloc., 15000 lines of complex C code) 3) Extensive code reuse => implementation classes reused among different DM managers using common interfaces of methods 4) Profiling can be done in a similar way to DDTs
Graphical example of custom DM manager created by composition of our basic blocks DMMHeap: Final custom DM manager (it chooses which manager according to alloc. size) Best Fit First Fit Doubly-linked list blocks structure Binary tree blocks structure OS interface heap (4B physical blocks) OS interface heap (8B physical blocks)
Example of custom DM manager created with our categories of basic blocks /* Basic blocks for heap requests to the system */ template<typename MyT> class BasicHeap: public TypeClass<MyT,mheap>{}; /* Data types of the Dynamic Memory (DM) manager: DLL –> Doubly Linked List and BTT -> Binary Tree */ template<typename MyT, class SuperClass> class DLList {/*Implem. DLL*/ }; template<typename MyType, class SuperClass> class BTTree { /*Imp. BTT */}; /* Two basic data types instantiated for the memory manager */ class I_DLList : public DLList<int, BasicHeap<int> > {}; class D_BTTree : public BTTree<double, BasicHeap<double> >{}; /* DM manager with 2 seg. fit lists of data types, best or first fit policy */ class DMMHeap: public SegLists<list_Sizes, // List of sizes for segList numElemFirstList, // Number of lists with type 1st segList BestFit<I_DLLList>, // 1st segList FirstFit<D_BTTree> // 2nd segList > {};
Profiling framework for DM managers 1) Reuse of the interface for the profiling of DDTs 2) Profiling already inserted in all the classes of our library of DM managers 3) Information reported at run-time from the DM managers: • Memory footprint behaviour • Access pattern due to DM managers to the data (e.g. Allocations/Deallocations, etc.) • Fragmentation in the managers • Classified for each implementation part of the DM managers (e.g. fit algorithms, internal data structures, etc.)
Code example of integration of profiling objects in our custom DM managers - Easy insertion of our C++ profiling framework in the original structure of custom DM managers: /* Declaration of profile objects of our common profile framework */ _profile *prof1, *prof2, *prof3; class DMMHeap: public SegLists<prof1, // Profile object at this level list_Sizes, // List of memory block sizes for the segList numElemFirstList, // Number of lists with type 1st segList BestFit<I_DLLList<prof2> >, // 1st segList with profile object FirstFit<D_BTTree<prof3> > // 2nd segList with profile object > > > {}; Few new parts are required!
Case study 1: 3D reconstruction algorithm Matching of points in sequent frames (“like” motion detection) • 1,500 2D points to ‘match‘ on average • - Size: 700000 lines of C++ code • Sources of uncertainty: • 1) Unknown input image sizes • 2) Additional intermediate DDTs
Initial and optimised DDTs in the 3D reconstruction algorithm –DDTR phase Final DDTs implementation • Initial DDTs implementation
Results obtained with our DMMR phase Memory footprint of different DM managers (2 frames) 2.50 2.00 • Memory footprint reduction: 1.50 Memory footrpint (MBytes) 1.00 0.50 0 Kingsley (Win32) Regions Our DM manager Overhead due to DM managers 2 • Execution time reduced, overhead added to total execution time is not significant: 600 frames, 20s. 1.5 Time (secs) 1 0.5 0 Our custom DM manager Kingsley (Win32) Regions
Final results 3D Reconstruction case study Overall gains of almost 2 orders of magnitude!
Case Study 2: Virtual reality game • Real images interact with 3D generated objects • Initially designed for embedded devices (e.g. Trimedia 1300) • Unpredictable behaviour: • Objects on the screen • Wall detection • User movements Initial image Processed image
Overall results trying to minimize power consumption • Final comparison with original DM implementation: • Total memory saving: 22.48% • Total power consumption saving: 75.3% Global Pareto points for DDTs DDTs behaviour with 6 images Global pareto points for the DDTs
Examples of control flow simplification: Intermediate Variable Elimination
Case study 3: 3D rendering system • Scalable rendering of objects according to available system resources (QoS): • Size: 5000 lines C++ code • - Usual “static” mem. footprint: 12 MB (real scenes with 7 objects) • Sources of uncertainty: • 1) Movements of the user (Qos) • 2) Several DDTs: 3D points and 3D faces Scalable 3D coding (3D Mesh + 2D Texture) Low-quality decoding High-quality decoding
Results DDTR reducing memory footprint • Memory footprint reduction (35%): • Energy results with two different memory hierarchies (with and without cache): Very different results! Without cache With cache
Results DMMR (reducing memory footprint) Memory footprint of different DM managers (5 objects) • Memory footprint improved: 4.50 4.00 3.50 3.00 2.50 2.00 Memory footprint (MBytes) 1.50 1.00 0.50 0 Lea (Linux) Kingsley (Win32) Obstacks Our DM manager Execution time of different DM managers • Execution may be slower: 10 8 (Trade-offs between execution time, memory footprint and power) 6 Time (secs) 4 2 0 Our DM manager Obstacks Lea (Linux) Kingsley (Win32)