160 likes | 251 Views
AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS. Chirag Dave and Rudolf Eigenmann Purdue University. GOALS. Automatic parallelization without loss of performance Use automatic detection of parallelism Parallelization is overzealous Remove overhead -inducing parallelism
E N D
AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University
GOALS • Automatic parallelization without loss of performance • Use automatic detection of parallelism • Parallelization is overzealous • Remove overhead-inducingparallelism • Ensure no performance loss over original program • Generic tuning framework • Empirical approach • Use program execution to measure benefits • Offline tuning
AUTO Vs. MANUALPARALLELIZATION Significant development time Source Program Hand parallelized Parallel Program Parallelizing Compiler User tunes the program for performance State-of-the-art auto-parallelization in the order of minutes
AUTO-PARALLELISM OVERHEAD Loop level parallelism intfoo() { #pragmaompprivate(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragmaompprivate(j,t) #pragmaomp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } } } fork Fork/Join overheads Load balancing Work in parallel section join
NEED FOR AUTOMATIC TUNING • Identify, at compile time, the optimization strategy for maximum performance • Beneficial parallelism • Which loops to parallelize • Parallel loop coverage
OUR APPROACH Best combination of loops to parallelize Offline tuning Decisions based on actual execution time
SEARCH SPACE NAVIGATION • Search Space -> The set of parallelizable loops • Generic Tuning Algorithm • Capture Interaction • Use program execution time as decision metric • COMBINED ELIMINATION • Each loop is an on/off optimization • Selective parallelization • Pan, Z., Eigenmann, R.: Fast and effective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330
TUNING ALGORITHM BATCH ELIMINATION ITERATIVE ELIMINATION • Considers separately, the effects of each optimization • Instant elimination -Considers interactions -More tuning time COMBINED ELIMINATION New Base Case • Considers interactions amongst a subset • Iterates over the smaller subset and performs batch elimination
CETUNE INTERFACE intfoo() { #pragmacetusparallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; } for (i=0; i<10; i++) { a[i] = c; #pragmacetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } } } cetus –ompGen –tune-ompGen=“1,1” Parallelize both loops cetus –ompGen –tune-ompGen=“1,0” cetus –ompGen –tune-ompGen=“0,1” Parallelize one and serialize the other cetus –ompGen –tune-ompGen=“0,0” Serialize both loops
Next point in the search space Version generation using tuner input Decision based on RIP Back end code generation Runtime performance measurement Train data set EMPIRICAL MEASUREMENT Input source code (train data set) Automatic parallelization using Cetus Start configuration ICC Intel Xeon Dual Quad-core Final configuration
CONTRIBUTIONS • Described a compiler + empirical system that detects parallel loops in serial and parallel programs and selects the combination of parallel loops that gives highest performance • Finding profitable parallelism can be done using a generic tuning method • The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections • Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program