280 likes | 408 Views
The GLIMPSES Toolkit Rapid code prototyping for SPEs. Jaswanth Sreeram, Santosh Pande. Overview of Toolkit. GLIMPSES Toolkit : GL obal I nterprocedural M emory and P aralleli S m E stimator for S PUs Profile instrumentation support Profile parsers and interpreters.
E N D
The GLIMPSES ToolkitRapid code prototyping for SPEs Jaswanth Sreeram, Santosh Pande
Overview of Toolkit • GLIMPSES Toolkit : GLobal Interprocedural Memory and ParalleliSm Estimator for SPUs • Profile instrumentation support • Profile parsers and interpreters. • Analyzers for memory allocation & access behavior • Visualization Engine
GLIMPSES toolkit • One of two tools available in public domain • Rapid Prototyping, Legacy Code Migration and Performance Tuning on Cell SPEs • Second one is asmvis • Released on source-forge in mid July: http://glimpses.sourceforge.net • OSI certified open source license(s). • Has received interest for adoption in academia and industry • Samsung Korea, Codecs and Media computing Group. • Sony Computer Entertainment America (SCEA)
GLIMPSES : Motivation • Prototyping large codebases for porting to SPEs is challenging • Find a partition (set of functions) • Find a set of upward exposed references • DMA transfer them and lay them out – alignment • After execution store the results back • Make sure memory requirements do not exceed capacity
Motivation – contd. • Challenges due to architectural attributes • Limited local store • High branch penalty • Suited for vectorizable code rather than scalar code • SPE/PPE interactions • Provide programmer with tools to • Understand program behavior (esp. memory usage) • Quickly construct candidates partitions for SPE • Evaluate/Quantify partitions’ suitability for SPEs
GLIMPSES : Details • Memory Estimation tools enable programmer to: • Estimate static & dynamic memory usage • Code, Stack, Heap • Understand program behavior • Detect program objects affecting dynamic memory behavior • Show the correlation between these program objects and memory usage. • Rank program segments • Criteria: Memory requirements, vectorizability, branching, etc. • Visualize results interactively.
Features overview • Dynamic Call Graph visualization – ability to select a call tree • Memory Requirements • Dynamic • Analytical – ‘what if’ scenario calculator for memory capacity • Memory Access Patterns • Locality (spatial, temporal, neighbor affinity) • Ranking • Criteria based estimates • Alias and safe pre-fetching information • Multiple alias analyses available
C/C++ program LLVM compiler flow Bytecode Analysis & Instrumentation Passes Instru. Bytecode Runtime Link Execute Overview Analytical Memory Estimator Partition Estimator Dyn. Memory Estimator GraphML Trace Visualization Engine Test Inputs Profile Trace
Visualization Graph Visualization Area Results Display Panel
Visualization …contd • Zoom view • Shows dynamic call chains for a program run (in this case the program is mpeg2-decode)
Visualization …contd Function Characteristics Alias Analysis Algorithm used Type of Aliases displayed (“Must Alias”, “May Alias”, “No Alias”) Aliasing information for pairs of variables/memory regions.
Analytical Memory Estimation • Correlate dynamic memory usage with program objects • Dynamic memory usage depends on inputs, etc. • Compiler Analysis • From each malloc, do a backward traversal to find instructions that influence the arguments to malloc. • Construct an arithmetic expression for amount of memory allocated, in terms of inputs or other program objects. • Handles control flow constructs (if-then-else, loops etc)
Memory Behavior: Analytical Estimation __Malloc_size__1 = Picture_Width*Picture_Height __Malloc_size__2 = Picture_Width*Picture_Height __Malloc_size__3 = Picture_Width*Picture_Height __Malloc_size__4 = Picture_Width*Picture_Height __Malloc_size__5 = Chroma_Width*Chroma_Height __Malloc_size__6 = Chroma_Width*Chroma_Height __Malloc_size__7 = Chroma_Width*Chroma_Height __Malloc_size__8 = Chroma_Width*Chroma_Height if (cc==0) size = Picture_Width * Picture_Height; else size = Chroma_Width * Chroma_Height; ….. …… for(….) { if (…..) malloc(size); if (…..) malloc(size); }
Memory References • Memory reference metrics • Temporal (frequency) • Spatial • Neighbor affinity • Metrics measured per memory line • Per function metrics or per-partition metrics • Visually represented via a color map • Pale Violet (low) -> Bright Red (high)
Memory Ref. Frequency (mpeg2decode) Memory Reference map (per partition) with 1024B memory lines
Neighbor Affinity • Metric to describe how well memory layout is suited to caching • Consider a slice S of length w of the whole memory access trace and two loads L1, L2 Є S If |L1addr – L2addr| < line size then L1, L2 exhibit neighbor affinity for slice size w
Alias Analysis for libode • Basic AA (least precise, fastest) • Aggressive local analysis • Non context sensitive • Non-flow sensitive • Total number of queries 119520497 • “No Alias” 35924925 • “May Alias” 83492482 • “Must Alias” 103090
Alias Analysis (contd) • Globals Mod/Ref • context-sensitive mod/ref and alias analysis for internal global variables • Very fast, very precise, limited scope • Total number of queries 119520497 • “No Alias” 35944215 • “May Alias” 83473192 • “Must Alias” 103090
Alias Analysis (contd) • Anderson’s AA algorithm • Subset-based, flow-insensitive, context-insensitive, and field-insensitive alias analysis • Very precise, but slow. • Total number of queries 119520497 • “No Alias” 79361105 • “May Alias” 40057171 • “Must Alias” 102221
Ranking (MPEG2Encode) • Criteria based • Code Size (csize) • Stack Size (ssize) • Heap Size (hsize) • Branch density (br_density) • Autovectorizable loops (av_loops) • Is LS memory limit likely to be hit (ls_limit) Rank = w1*csize + w2*ssize + w3*hsize + w4*br_density + w5/(1 + av_loops) + w6* ls_limit (wi are weights for each criteria)
Partitioning • Preprocessing: Propogate ranks upwards in the call graph Rank(n) = Rank(n) + ∑ Rank(n→child[i]) • Input: Call graph consisting of nodes annotated with ranks • Output: Graph partitions that are suitable for execution on the SPEs • A partition P is deemed “suitable” if Rank(P→root) < Threshold
Effect of threshold on partitions mpeg2decode
GLIMPSES status Beta version available for download at: http://glimpses.sourceforge.net 300MB source code package (includes visualizer) Lines of code (C/C++): 447,000 Third party tools integrated: LLVM (Compiler), Prefuse (Visualization) Executable Size: 422 MB (x86 binaries) Typical trace size : 900 MB (LIBODE) Man-hour effort: ~750 Releases : v.0.8 : based on LLVM version 1.8 (July 7th) v.1.0 : based on LLVM version 2.0 (undergoing testing) Tested to work with large codebases: LIBODE (115000 lines of code), mpeg2 (10000 lines of code etc.), SPEC INT 2000 etc. 26
Ongoing and future work • More Validation • Compare partitions produced with those generated by expert programmers • An inter-procedural, flow-sensitive, context-sensitive alias analysis algorithm
Ongoing and future work • Function data dependence graph • Encapsulates data flow between functions • Arguments, aliases, globals • Important factor in partitioning decisions – “affinity between pairs of functions”