1 / 39

Programming Models for Multi-Cores

This document provides an introduction to programming models for multi-core processors, specifically focusing on the Cell/B.E. architecture. It discusses available models, their strengths and weaknesses, and explores whether a single standard model can be used. The document concludes with an overview of the OpenCL standard as a potential solution.

reedsteven
Download Presentation

Programming Models for Multi-Cores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Models for Multi-Cores Programming Models for multi-cores Ana Lucia Varbanescu TUDelft / Vrije Universiteit Amsterdam with acknowledgements to Maik Nijhuis @ VU Xavier Matorell @ UPC, Rosa Badia @ BSC

  2. Outline • An introduction • Programming the Cell/B.E. • available models. • … can we compare them ?! • More processors, more models …? • CUDA, Brook, TBB, ct, Sun Studio, … • … or a single standard one ? • OpenCL standard = the solution? • Conclusions

  3. An introduction

  4. The Problem • Cell/B.E. = High performance • Cell/B.E. != Programmability • Is there a way to match the two ?

  5. Cell/B.E. • 1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB) • 8 x SPE cores (LS: 256KB, SIMD machines) • Hybrid memory model • Cell blades (QS20/21): 2xCell / PS3: 1xCell (6 SPEs only) • Thread-based model, push/pull data • Thread scheduling by user • Five layers of parallelism: • Task parallelism (MPMD) • Data parallelism (SPMD) • Data streaming parallelism (DMA double buffering) • Vector parallelism (SIMD – up to 16-ways) • Pipeline parallelism (dual-pipelined SPEs)

  6. Programming the Cell/B.E. High-level Mapping Core-level • A view from the application: • High-level parallelization => application task-graph • Mapping/Scheduling => mapped graph • In-core optimizations => optimized code for each core • A high-level programming model should “capture” all three aspects of Cell applications!

  7. Expressing the task graph High-level Mapping Core-level • Task definition • A task is a tuple : <Inputs,Outputs,Computation[,Data-Par]> • Task interconnections • Express top level application parallelism and data dependencies • Task synchronization • Allow for barriers and other mechanisms, external from the tasks • Task composition • Tasks should be able to split/merge with other tasks.

  8. Mapping/scheduling High-level Mapping Core-level • Task-graph “expansion” (Auto) • Data parallelism and synchronization are transformed in nodes and edges • Application mapping (User-aid) • All potential mappings should be considered • Mapping optimizations (User-aid) • Merge/split tasks to fit the target core and minimize communication • Scheduling (User) • Establish how to deal with contention at the core level

  9. Core-level High-level Mapping Core-level • Computation (User) • Allow the user to write the computation code (sequential) • Core optimizations (User/Auto) • Per-core optimizations (different on PPE and SPEs) • Memory access (Auto) • Hide explicit DMA • Optimize DMA (Auto) • Overlap computation with communication

  10. Extra’s High-level Mapping Core-level • Performance estimation • Application performance should be roughly predicted based on the task graph • Warnings and hints • Better warnings and hints to replace the standard SDK messages

  11. Available Cell/B.E. programming models • SDK-based models • IDL, ALF • Code-reuse models • MPI (micro-tasks), OpenMP • CellSS • Abstract models • Sequoia • Charm++ and the Offload API • SP@CE • Industry • PeakStream, RapidMind, the MultiCore Framework • Other approaches • MultiGrain Parallelism Scheduling, BlockLib, Sieve++

  12. IDL and the Function-Offload Model • Offloads computation-intensive tasks on SPEs • Programmer provides: • Sequential code to run on PPE • SPE implementations for offloaded functions • IDL specification for function behaviour • Dynamic scheduling, based on distributed SPE queues

  13. Accelerated Library Framework (ALF) • SPMD applications on a host-accelerator platform • Programmer provides: • Accelerator libraries - collections of accelerated code • Application usage of the accelerator libraries • Runtime scheduling

  14. MPI micro-tasks • MPI front-end on the Cell/B.E. • Programmer provides: • MPI application • Preprocessor generates application graph with basic tasks • Basic tasks are merged together such that the graph is SP • The SP graph is mapped automatically • Core-level communication optimizations are automatic

  15. OpenMP • Based on pragma's • Enables code re-use • Programmer provides: • OpenMP application • Core-level optimizations • DMA optimizations • Mapping and scheduling: automated • Most work on the compiler side

  16. Cell SuperScalar (CellSS) • Very good for quick porting of applications on the Cell/B.E. • Programmer provides: • Sequential C application • Pragma’s to separate functions to be offloaded • Additional data distribution information • Based on a compiler and a run-time system • The compiler separates the annotated application into a PPE application and the SPE application • The runtime system maintains a dynamic data dependency graph with all these active tasks, updating it each time a task starts/ends • Dynamic scheduling • based on the runtime calculation of the data dependency graph.

  17. Sequoia • High-level abstract model, suitable for divide-and-conquer applications • Uses the memory locality as a first parallelization criteria • Application = hierarchy of parameterized, recursively decomposed tasks • Tasks run in isolation (data locality) • Programmer provides: • Application hierarchical graph • A mapping of the graph on the platform • (Optimized) Code for the leaf-nodes • A flexible environment for tuning and testing application performance

  18. SP@CE • Dedicated to streaming applications • An application is a collection of kernels that communicate only by data streaming • Programmer provides: • Application streaming graph (XML) • (Library of) Optimized kernels for the SPEs • Dynamic scheduling, based on a centralized job-queue • Run-time system on the SPEs, to optimize (some) communication overhead

  19. Charm++ and the Offload API • An application = a collection of chares • Communicate through messages • Created and/or destroyed at runtime • A chare has a list of work requests to run on SPE • PPE: uses the offload API to manage the work requests (data flow, execution, completion) • SPE: a small runtime system for local management and optimizations • Programmer provides: • Charm++ application • Work requests and their SPE code

  20. RapidMind • Based on “SPMD streaming” • tasks are executed on parallelized streams of data. • A kernel (“program”) is a computation on elements of a vector • An application is a combination of regular code and RapidMind code => compiler translates into PPE code and SPEs code • Programmer provides: • C++ application • Computation kernels inside the application • Kernels can execute asynchronously => achieve task-parallelism

  21. MultiCore Framework SDK (Mercury) • A master-worker model • focused on data parallelism and data distributions • An application = manager (on PPE) and workers (on SPEs) • Data communication is based on: • virtual channels: between manager and worker(s) • data objects: to specify data granularity and distribution • elements read/written are different at the channel ends • Programmer provides: • C code for the kernels • The channels interconnections via read/write ops • Data distribution objects for each channel • No parallelization support, no core optimizations, no application-level design.

  22. Brief overview

  23. Features - revisited

  24. How to compare performance? • Implement one application from scratch • Impractical and very time consuming • Use an already given benchmark • Matrix multiplication is available

  25. Performance • See examples …

  26. Are results relevant ? • Only partially ! • MMUL is NOT a good benchmark for high-level programming models • The results reveal the low-level optimizations success • The implementations are VERY different • Hard to measure computation only • Data distribution issues are very differently addressed • Overall, a better approach for performance comparison is needed!!! • Benchmark application • Set of metrics

  27. Still … • Low-level optimizations are not part of the programming model’s targets => can/should be designed separately and heavily reused • The performance overhead induced by the design and/or implementation in a high-level model decreases with the size of the application • The programming effort spent on SPE optimizations increases the overall application implementation design with a constant factor, independent of the chosen programming model.

  28. Usability

  29. The Answers [1/2] High-level > 90% Mapping 0-100% Core-level > 50% • High-level programming models cover enough features to support application design and implementation at all levels. • Low-level optimizations and high-level algorithm parallelization remain difficult tasks for the programmer. • No single Cell/B.E. programming model that can address all application types

  30. The Answers [2/2] • Alleviate the programmability issue: 60 % • Preserve the high Cell/B.E. performance: 90 % • Are easy to use ? 10-90 % • Allow for automation ? 50 % • Is there an ideal one ? NO

  31. GPU Models [1/2] • GPGPU used to be fancy • OpenCL • cG • RapidMind • NVIDIA GPUs • CUDA is an original HW-SW codesign approach • Extremely popular • Considered easy to use • ATI/AMD GPUs • Originally Brook • Currently ATI Stream SDK

  32. GPU Models [2/2] • NVIDIA GPUs • CUDA is an original HW-SW codesign approach • Extremely popular • Considered easy to use • ATI/AMD GPUs • Originally Brook • Currently ATI Stream SDK

  33. OpenCL [1/4] • Currently up and running for : • AMD/ATI, IBM, NVIDIA, Apple • Other members of the Khronos consortium to follow • ARM, Intel [?] • See examples …

  34. OpenCL [2/4] • Language Specification • C-based cross-platform programming interface • Subset of ISO C99 with language extensions - familiar to developers • Online or offline compilation and build of compute kernel executables • Platform Layer API • A hardware abstraction layer over diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues • Runtime API • Execute compute kernels • Manage scheduling, compute, and memory resources

  35. OpenCL [3/4] – memory model • multi-level memory model • private memory visible only to the individual compute units in the device • global memory visible to all compute units on the device. • depending on the HW, memory spaces can be collapsed together. • 4 memory spaces • Private memory: a single compute unit (think registers). • Local memory: work-items in a work-group. • Constant memory: stores constant data for read-only access • Global memory: used by all the compute units on the device.

  36. OpenCL [4/4] – execution model • Execution model • Compute kernels can be thought either data-parallel (for for GPUs), or task-parallel, which is well-matched to the architecture of CPUs. • A compute kernel is the basic unit of executable code and can be thought of as similar to a C function. • kernels execution can be in-order or out-of-order • Events for the developer to check on the status of runtime requests. • The execution domain of a kernel • an N-dimensional computation domain. • each element in the execution domain is a work-item • work-items can be clustered into work-groups for synchronization and communication.

  37. Conclusions • A multitude of programming models • Aboundent for Cell/B.E., due to original lack of high-level programming • Less so for GPUs, due to CUDA • Simple programming models are key to platform adoption • CUDA • Essential features are: • Tackling *all* parallelism layers of a platform • Both automagically and with user-intervention • Portability • Ease-of-use or a very steep learning curve (C-based works) • (Control over) Performance • Most of the times, efficiency

  38. Take home messages • Application parallelization remains the programmer's task • Programming models should facilitate quick implementation and evaluation • Programming models are hard to compare • Application-specific or platform-specific • Often user-specific • Low portability is considered worse than performance drops • Performance trade-offs are smaller than expected • OpenCL’s portability is responsible (so far) for his appeal

  39. Thank you! • Questions ? A.L.Varbanescu@tudelft.nl analucia@cs.vu.nl http://www.pds.ewi.tudelft.nl/~varbanescu

More Related