190 likes | 308 Views
Semester report summary. Adam Leko 1/25/2005 HCS Research Laboratory University of Florida. Programming practices overview. Programming practices: CAMEL. CAMEL: parallelization of an existing cipher written by members of HCS lab MPI and UPC versions written
E N D
Semester report summary Adam Leko 1/25/2005 HCS Research Laboratory University of Florida
Programming practices:CAMEL • CAMEL: parallelization of an existing cipher written by members of HCS lab • MPI and UPC versions written • Spinlock implementation in MPI version forced rewrite of master/worker-style code • Relatively easy to port existing C code, only slight restructuring of application required • Conclusions • Good overall performance • Not much difference between MPI/UPC or platforms • MPI code longer (100 LoC) than UPC
Programming practices:Bench9 (Mod 2N inverse) & convolution • Convolution: simple image/signal processing operation • Embarrassingly parallel operation • MPI, UPC, and SHMEM versions written • Bench9: part of the NSA benchmark suite • Quick, embarrassingly-parallel computation (memory intensive) • Bandwidth-intensive, sequential check phase • MPI, UPC, and SHMEM versions written • Conclusions • UPC compiler can add overhead • MPI most difficult to write (necessary to map out all communication manually) • One-sided SHMEM get/put simplified things • bench9 • UPC easiest to write, but worst performance • UPC also most sensitive to performance optimizations • Convolution • Near-linear speedup obtained for all platforms and versions
Programming practices:Concurrent wave equation • Wave equation: Parallelization of an existing program to simulate waveforms of a stationary plucked string • Compute-bound, memory-intensive algorithm • Conclusions • Near-linear speedup obtained • Construct performance difference: array+j performed slightly better than &(array[j])
Programming practices:Depth-first search • Depth-first search: tree searching algorithm • Represent 2-ary tree via array • Simple to implement sequentially • UPC implementation strategy: “spawn” workers as depth of search increases • Conclusions • UPC doesn’t directly support dynamically spawning threads! • Optimizations can have large effect (see left) • Negative speedup obtained due to communication overhead
Optimizations reviewed • Broken into categories of when optimization is to be performed • Sequential compiler optimizations – specific to sequential compilers, includes such techniques as loop unrolling, software pipelining, etc • Pre-compilation optimization methods – deals with high-level issues such as data placement and load balancing • Compile-time optimization – strategies used by HPF, Co-Array Fortran, and OpenMP compilers • Runtime optimizations – dynamic load balancing, etc • Post-runtime optimizations – analyze trace files, etc
Sequential compiler optimizations • Reduction transformations • Purpose • Eliminate duplicated work • Transform individual statements to equivalent statements of lesser cost • Examples • Replace X^2 with X * X (algebraic simplification and strength reduction) • Store common subexpressions so they are computed only once (common subexpression elimination) • Short-circuiting the evaluation of boolean expressions (short-circuiting) • Function transformations • Purpose: reduce overhead of function calls • Examples • Store arguments to functions in registers instead of on the stack (parameter promotion) • Replicating function code to eliminate function call overhead (function inlining) • Storing results from functions that have no side effects (function memoization) • Transforming code loops • Purpose: • Reduce computational complexity • Increase parallelism • Improve memory access characteristics • Examples • Moving loop invariant code outside of loop to reduce computation per iteration (loop-invariant code motion) • Reordering instructions to pipeline memory accesses (loop pipelining) • Merging different loops to reduce loop counter overhead (loop fusion) • Splitting loops into different pieces to vectorize operations (strip mining, loop tiling) • Memory access transformations • Purpose • Reduce cost of memory operations • Restructure program to reduce number of memory • Examples • Padding arrays so they fit in cache line sizes (array padding) • Replicating code in binary to improve I-cache efficiency (code co-location) • Keeping commonly used memory locations pegged in registers (scalar replacement)
Pre-compilation optimizations • Tiling • Purpose: automatically parallelize sequential loops • Similar to loop tiling performed by vectorizing sequential compilers • Works for programs that make heavy use of nested for loops • Takes loops and transforms them into atomic pieces that can be independently executed • Issues: tile shapes, mapping tiles to processors • Augmented data access descriptors • Purpose: automatically parallelize Fortran do loops • Instead of analyzing loop dependencies, ADADs represent how sections of code affect each other • Can apply loop fusion and other loop parallelization techniques directly to ADADs • Potentially lets compilers use ADADs to choose between different optimization techniques
Compile-time optimizations • General compile-time strategies • Purpose • Eliminate unnecessary communication • Reduce cost of communications • Examples • Aligning arrays to fit in shared-memory cache line sizes (cache alignment) • Grouping data together before sending out as to reduce # of messages sent (message vectorization, coalescing, and aggregation) • Overlapping communication and computation by splitting receive operation into two phases (message pipelining) • Existing compilers • PARADIGM • HPF compiler that uses an abstract model to determine how to decompose HPF statements • Optimizations performed: Message coalescing, vectorization, pipelining; overlapping of loops that cannot be parallelized due to loop-carried dependencies (course-grained pipelining) • McKinley’s algorithm • Splits compilation phase into 4 stages: optimization, fusing, parallelization, and enabling • Uses a wide variety of optimization techniques • Author argues all techniques are necessary to get good performance out of “dusty deck” (unmodified sequential) codes • ASTI compiler • Existing sequential compiler developed by IBM extended to support SMP machines • Uses • Models of cache misses and TLB access costs in addition to many sequential optimizations • “Function outlining” (opposite of function inlining) to simplify thread storage • Dynamic self-scheduling load balancing library • Not very good results on 4-CPU machine compared to hand-tuned code • dHPF compiler • High-performance Fortran compiler developed at Rice to automatically parallelize HPF code • Uses many (previously-listed) existing communication optimizations • Adds two which are necessary for good performance on NAS benchmarks • Bringing in local copies of read-only, loop invariant (minus antidependencies) variables for each thread • Replication of computation via special LOCALIZE statement to reduce unnecessary communication with quick computations • Competitive results obtained on NAS benchmarks
Runtime optimizations • Why do optimizations at runtime? • Less costly to do earlier • But, for irregular applications, only choice • Inspector/executor scheme • Created for applications whose work distribution is not known until runtime • Inspector creates “plan” for work distribution at runtime • Executor in charge of orchestrating execution of plan created by inspector • Overhead of inspector must be balanced with overall work distribution • Implemented in PARTI library • Nikolopolous’ method • OpenMP-specific method which uses unmodified OpenMP APIs • Uses a few short probing iterations • Probing iterations indicate where work imbalance exists • Greedy method redistributes work among processors to even things out • Worked well for such a simple method (within 33% of hand-tuned MPI code)
Post-runtime optimizations • Ad-hoc methods • Rely on rudimentary analysis to guide programmer on what to work on • Uses code, instrument, run, analyze, code, instrument, … loop • Relies heavily on luck & skill of programmer • Most widely-used method today! • PARADISE • Analyzes trace files generated by Charm++ parallel library/runtime system (developed at UIUC) • Optimizations suggested deal with distributed object-based systems • Can be automatically performed by means of a “hint” file given to the Charm++ runtime system • KAPPA-PI • Knowledge-based system which identifies bottlenecks using a rule-based system • Bottlenecks are presented to user, correlated with source code • Also has recommendations on how to fix problems identified • Seems very rudimentary & aimed at novice programmers • Difficult problem, but seems potentially very valuable
Performance modeling overview • Why? Several reasons • Grid systems: need a way to estimate how long a program will take (billing/scheduling issues) • Could be used in conjunction with optimization methods to suggest improvements to user • Also can guide user on what kind of benefit can be expected from optimizing aspects of code • Figure out how far code is from optimal performance • Indirectly detect problems: if a section of code is not performing as predicted, it probably has cache locality problems/etc • Challenge • Many models already exist, with varying degrees of accuracy and speed • Choose best model to fit in UPC/SHMEM PAT • Existing performance models categorized into different categories • Formal models (process algebras, petri nets) • General models that provide “mental pictures” of hardware/performance • Predictive models that try to estimate timing information
Formal performance models • Least useful for our purposes • Formal methods are strongly rooted in math • Can make strong statements and guarantees • However, difficult to adapt and automate for new programs • Petri nets • Specialized graphs that represent processes and systems • Very generic method of modeling many different things • Older (invented 1962), more mature, but Petri nets don’t provide much groundwork for parallel program modeling • Process algebras • Formal algebra for specifying parallel processes and how they interact • Hoare’s CSP, Milner’s CCS • Entire books devoted to this subject • Complicated to use, but can prove things like deadlock-free algorithms • Queuing theory • Very strongly rooted in math (ECE course on the subject) • Hard to apply to real-world programs • PAMELA • C-style imperative language used to model concurrent and time-related operations • Similar to process algebras, but geared towards simulation of models created in the PAMELA language • Much work required to create PAMELA models directly from source code or trace files • Models encode high-level parallel information about what is going on in a program • Reductions are necessary to reduce size of PAMELA models for feasible simulation times
General performance models • Provide user with “mental picture” • Rules of thumb for cost of operations • Guides strategies used while creating programs • PRAM • Classic model that uses unit cost operations for all memory accesses • Useful for determining the parallel complexity of an algorithm • Very easy to deal with • Not very accurate • No synchronization costs • Uniform memory access cost • Simplistic contention model (combination of concurrent/exclusive reads and writes) • BSP • Aims to provide a bridging tool between software and hardware, much as the von Neumann model has done for sequential programming • Breaks communication and computation into phases (supersteps) • Barriers performed between all supersteps • Uses a simplistic communication model (processors are assumed to send fixed # messages in each superstep) • Reasonable accuracy (~20% for CFD) • LogP • Model that only takes into consideration communication cost • Latency, overhead, gap, # processors • Simple model to work with • Predicts network performance well (extensions such as needed for modern networks) • Has been applied to predict memory performance in the past with only moderate success (memory LogP) • Other interesting finds • One paper modeled compiler overhead introduced by Dataparallel C compilers (scaling factor) • Application-specific models are not useful to a PAT • Adding in tons of parameters from microbenchmarks gives complicated equations but not necessarily better accuracy
Predictive performance models [1] • Models that specifically predict performance of parallel codes • Lost cycles analysis • Geared towards real-world usage • Simple idea : anything that is not computation is not useful • Program state recorded by setting flags (manual(?) instrumentation) • States are sampled or logged • Predicates are used to determine if a program is losing cycles • E.g.: Load Imbalance(x)≣Work Exists ^ Processors Idle(x) • Authors assert predicates they use are orthogonal and complete • Good accuracy (within ~12.5% for FFT) • Not clear how to relate information to source level • Task graphs • Common technique similar to process algebras and PAMELA • Graphically model amount of parallelism inherent in given program • Also takes into account dependencies between tasks • Complex program control is approximated via mean execution times (assumed deterministic) • Graphs are used in conjunction with system models to quickly predict performance • Have good accuracy (although based on quality of system models) • Open-ended enough to adapt to a PAT • Can also incorporate analysis into task graph since it represents program structure • Generating task graphs may be difficult, even from program traces • VFCS • Vienna Fortran Compilation System, a parallelizing Fortran compiler that uses a predictive model to parallelize code • Uses a profiling stage on sequential code to determine sequential code characteristics • Predictive model uses analytic techniques (earlier version used simulation) • GUI/IDE incorporates “cost” of each statement during coding phase • Cannot extend (old, large code base), but useful to examine techniques used by the system
Predictive performance models [2] • PACE • Novel idea: generate predictive traces that can be viewed by existing tools (SvPablo) • Geared towards grid applications • Uses performance language CHIP3S to model program performance • Models are compiled and can be quickly evaluated • No standard way of creating performance models illustrated • Convolution • Uses several existing tools to predict performance of MPI applications • “Convolves” system characteristics with application characteristics • System: memory performance (MAPS) & network performance (PMB) • Application: memory accesses (MetaSim tracer) & network accesses (MPITrace) • Fairly good accuracy (within ~20% for predicting matrix multiply on another platform) • Currently limited to Alpha platform • Requires programs to be run before predictions are made • Convolution method is not detailed in any available papers • Several other models considered (see report, section 5)