1 / 17

Prospector : A Toolchain To Help Parallel Programming

Prospector : A Toolchain To Help Parallel Programming . Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel. This work will be also supported by Samsung. Motivation (1/2). Parallel programming is hard What if there is a tool that helps parallel programming?

jacqui
Download Presentation

Prospector : A Toolchain To Help Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Prospector: A Toolchain To Help Parallel Programming Minjang Kim, Hyesoon Kim, HPArch Lab, and Chi-Keung Luk Intel This work will be also supported by Samsung

  2. Motivation (1/2) • Parallel programming is hard • What if there is a tool that helps parallel programming? • Already we have some tools like race detectors • However, not many tools on guiding parallel programming itself • A program wants to parallelize a serial code • Where to parallelize? • How to parallelize?

  3. Motivation (2/2) • We propose Prospector • A set of dynamic program analyzers to help parallelization of serial code • Goals • Give information to find right parallelization targets • Provide advices on writing correct and optimized parallelized code

  4. Overview of Prospector • Parallelism Pattern Advisor • Parallel Performance Analyzer • Parallelizable Section Finder • Parallel Speedup Predictor • Architecture Advisor • Loop-Centric Profiler Func1(){ Loop1; Loop2; Func2(); } Input Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Loop3 { Statements; Lock(); Statements; Unlock(); Statements; } Func1(){ Loop1; Loop2; Func2(); } Func2() { Loop3 } Source code or Binary Loop1 Invocation: Iteration: Max Iter: Min Iter: 8 5,000 1,600 40 Speedup Speedup CPU 2 4 GPU 8 # of core

  5. Prospector: Loop-Centric Profiler • Q: Which code section would good for parallelization? • Mostly frequently executed loops • Legacy profilers only report hot functions and instructions • We provide details of loop execution • # of trip count  Sufficient work? • # of invocation  Low fork/join overhead? • Stats of the length of loop iteration  Balanced? • Min, Max, Stdev Loop1 Invocation: Iteration: Max Iter: Min Iter: 8 5,000 1,600 40

  6. Prospector: Parallel Speedup Predictor (1/2) • Q: What would be expected speedup? • Analytical models (e.g., Amdahl’s Law) are not practical to predict speedup in the presence of locks • Our approach • Dynamically predicting speedup based on light profiling • Challenges • How to model architecture factors (e.g., caches, memory)? Speedup 2 4 8 # of core

  7. Prospector: Parallel Speedup Predictor (2/2) • Mechanisms • Programmers annotate the serial code • Describe the behaviors of parallel execution + locks • Fast and light profiling • Measure time between annotations • Emulation • Obtain estimated parallel execution time for speedup • Modeling architectural parameters • Sampling memory accesses • Using an analytical model for cache hit/miss prediction

  8. Prospector: Parallelizable Section Finder (1/3) • Q: Is this code section parallelizable? • Data dependences determine the parallelizability • Compilers may not be good due to pointers and complex control flows • Our approach • Dynamic data-dependence profiling • Provides detailed dependence information for a given input • Challenges • Too much overhead; Smart algorithm is needed Func1(){ Loop1; Loop2; Func2(); } Parallelizable!

  9. Prospector: Parallelizable Section Finder (2/3) • Mechanisms • A dynamic profiler by using instrumentations • Instrumentation can be either binary and source level • At instrumentation time (or static time) • Analyzes control flow graphs and loop structures • At runtime • We observe memory addresses (no pointer-to analysis) • These memory addresses are stored and analyzed to discover data dependences

  10. Prospector: Parallelizable Section Finder (3/3) • Mechanisms • Scalability • Current tools require too much memory and time to analyze data dependence • Prospector implements a new scalable algorithm for data dependence profiling • Key ideas • Using compression and parallelization (MICRO ‘10)

  11. Prospector: Parallelism Pattern Advisor • Q: How can I transform the serial code? • If dependences are easily removable • I.e., Embarrassingly parallel loops with some reductions • Guide parallelization strategy directly • E.g., Use OpenMP pragma here • If severe dependences exist • Can we give advice on avoiding these dependences? • General solutions are extremely hard • Instead data-dependence pattern analysis • E.g., pipeline parallelism, a certain form of locking Loop3 { Statements; Lock(); Statements; Unlock(); Statements; }

  12. Prospector: Parallel Architecture Advisor • Q: Which parallel hardware would be better? • Can we predict performances on different hardware? • E.g., Speedups on multicore and GPGPU • Challenges • Need to model more architectural factors Speedup CPU GPU

  13. Prospector: Parallel Performance Analyzer • Q: What is the reason of poor speedup? • There are a couple of profiler for this purpose • Analyzes the degree of concurrency • Profiles lock contentions (wait time) • Too low-level information to understand problems • Alternative • Macroscopic profiling of parallelized programs • An alternative form of visualizations Loop3 { Statements; Lock(); Statements; Unlock(); Statements; }

  14. Related Work • State-of-the-art tools • Parallel Advisor from Intel Parallel Studio 2011 • Speedup Predictor: cannot model architectures • Parallelizable Section Finder: scalability issues • vfAnalystfrom VectorFabric • Parallelizable Section Finder: scalability issues

  15. Current Status and Timeline • June 2010 • Initial Prospector’s idea is presented in HotPar‘10 • Dec 2010 • Scalable data-dependence profiling algorithm (for Parallelizable Section Finder and Pattern Advisor) will be presented in MICRO ’10 • Beta version will be released as open source • Loop-centric profiler • Parallelizable Section Finder (i.e. Data-Dependence profiler) • Parallel speedup predictor • Mar 2010 • Parallel Speedup Predictor will be released • Aug 2010 • First Parallelism Pattern Advisor will be released

  16. Conclusion • We need a new type of tool to help parallel programming • Prospector is a set of parallel programming advisor based on dynamic program analysis • Finds good parallelization target • Analyzes serial code to understand the behavior • Predicts speedup • Provides advice on code changes

  17. Thank you! • Q&A • References • Overall tool architecture • Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "Prospector: Helping Parallel Programming by A Data-Dependence Profiler", 2nd USENIX Workshop on Hot Topics in Parallelism (HotPar '10), June 2010. • Scalable data-dependence profiling • Minjang Kim, Hyesoon Kim, Chi-Keung Luk, "SD3: A Scalable Approach To Dynamic Data-Dependence Profiling", Proceedings of the 43rd IEEE/ACM International Symposium on Microarchitecture (MICRO), December 2010.

More Related