1 / 16

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS. Chirag Dave and Rudolf Eigenmann Purdue University. GOALS. Automatic parallelization without loss of performance Use automatic detection of parallelism Parallelization is overzealous Remove overhead -inducing parallelism

tavi
Download Presentation

AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. AUTOMATICALLY TUNING PARALLEL AND PARALLELIZED PROGRAMS Chirag Dave and Rudolf Eigenmann Purdue University

  2. GOALS • Automatic parallelization without loss of performance • Use automatic detection of parallelism • Parallelization is overzealous • Remove overhead-inducingparallelism • Ensure no performance loss over original program • Generic tuning framework • Empirical approach • Use program execution to measure benefits • Offline tuning

  3. AUTO Vs. MANUALPARALLELIZATION Significant development time Source Program Hand parallelized Parallel Program Parallelizing Compiler User tunes the program for performance State-of-the-art auto-parallelization in the order of minutes

  4. AUTO-PARALLELISM OVERHEAD Loop level parallelism intfoo() { #pragmaompprivate(i,j,t) for (i=0; i<10; i++) { a[i] = c; #pragmaompprivate(j,t) #pragmaomp parallel for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } } } fork Fork/Join overheads Load balancing Work in parallel section join

  5. NEED FOR AUTOMATIC TUNING • Identify, at compile time, the optimization strategy for maximum performance • Beneficial parallelism • Which loops to parallelize • Parallel loop coverage

  6. OUR APPROACH Best combination of loops to parallelize Offline tuning Decisions based on actual execution time

  7. CETUS: VERSION GENERATION

  8. SEARCH SPACE NAVIGATION • Search Space -> The set of parallelizable loops • Generic Tuning Algorithm • Capture Interaction • Use program execution time as decision metric • COMBINED ELIMINATION • Each loop is an on/off optimization • Selective parallelization • Pan, Z., Eigenmann, R.: Fast and effective orchestration of compiler optimizations for automatic performance tuning. In: The 4th Annual International Symposium on Code Generation and Optimization (CGO). (March 2006) 319–330

  9. TUNING ALGORITHM BATCH ELIMINATION ITERATIVE ELIMINATION • Considers separately, the effects of each optimization • Instant elimination -Considers interactions -More tuning time COMBINED ELIMINATION New Base Case • Considers interactions amongst a subset • Iterates over the smaller subset and performs batch elimination

  10. CETUNE INTERFACE intfoo() { #pragmacetusparallel… for (i=0; i<50; i++) { t = a[i]; a[i+50] = t + (a[i+50] + b[i])/2.0; } for (i=0; i<10; i++) { a[i] = c; #pragmacetus parallel… for (j=0; j<10; j++) { t = a[i-1]; b[j] = (t*b[j])/2.0; } } } cetus –ompGen –tune-ompGen=“1,1” Parallelize both loops cetus –ompGen –tune-ompGen=“1,0” cetus –ompGen –tune-ompGen=“0,1” Parallelize one and serialize the other cetus –ompGen –tune-ompGen=“0,0” Serialize both loops

  11. Next point in the search space Version generation using tuner input Decision based on RIP Back end code generation Runtime performance measurement Train data set EMPIRICAL MEASUREMENT Input source code (train data set) Automatic parallelization using Cetus Start configuration ICC Intel Xeon Dual Quad-core Final configuration

  12. RESULTS

  13. RESULTS

  14. RESULTS

  15. CONTRIBUTIONS • Described a compiler + empirical system that detects parallel loops in serial and parallel programs and selects the combination of parallel loops that gives highest performance • Finding profitable parallelism can be done using a generic tuning method • The method can be applied on a section-by-section basis, thus allowing fine-grained tuning of program sections • Using a set of NAS and OMP 2001 benchmarks, we show that the auto-parallelized and tuned version near-equals or improves performance over the original serial or parallel program

  16. THANK YOU!

More Related