390 likes | 410 Views
This document provides an introduction to programming models for multi-core processors, specifically focusing on the Cell/B.E. architecture. It discusses available models, their strengths and weaknesses, and explores whether a single standard model can be used. The document concludes with an overview of the OpenCL standard as a potential solution.
E N D
Programming Models for Multi-Cores Programming Models for multi-cores Ana Lucia Varbanescu TUDelft / Vrije Universiteit Amsterdam with acknowledgements to Maik Nijhuis @ VU Xavier Matorell @ UPC, Rosa Badia @ BSC
Outline • An introduction • Programming the Cell/B.E. • available models. • … can we compare them ?! • More processors, more models …? • CUDA, Brook, TBB, ct, Sun Studio, … • … or a single standard one ? • OpenCL standard = the solution? • Conclusions
The Problem • Cell/B.E. = High performance • Cell/B.E. != Programmability • Is there a way to match the two ?
Cell/B.E. • 1 x PPE 64-bit PowerPC (L1: 32KB I$ + 32 KB D$; L2: 512 KB) • 8 x SPE cores (LS: 256KB, SIMD machines) • Hybrid memory model • Cell blades (QS20/21): 2xCell / PS3: 1xCell (6 SPEs only) • Thread-based model, push/pull data • Thread scheduling by user • Five layers of parallelism: • Task parallelism (MPMD) • Data parallelism (SPMD) • Data streaming parallelism (DMA double buffering) • Vector parallelism (SIMD – up to 16-ways) • Pipeline parallelism (dual-pipelined SPEs)
Programming the Cell/B.E. High-level Mapping Core-level • A view from the application: • High-level parallelization => application task-graph • Mapping/Scheduling => mapped graph • In-core optimizations => optimized code for each core • A high-level programming model should “capture” all three aspects of Cell applications!
Expressing the task graph High-level Mapping Core-level • Task definition • A task is a tuple : <Inputs,Outputs,Computation[,Data-Par]> • Task interconnections • Express top level application parallelism and data dependencies • Task synchronization • Allow for barriers and other mechanisms, external from the tasks • Task composition • Tasks should be able to split/merge with other tasks.
Mapping/scheduling High-level Mapping Core-level • Task-graph “expansion” (Auto) • Data parallelism and synchronization are transformed in nodes and edges • Application mapping (User-aid) • All potential mappings should be considered • Mapping optimizations (User-aid) • Merge/split tasks to fit the target core and minimize communication • Scheduling (User) • Establish how to deal with contention at the core level
Core-level High-level Mapping Core-level • Computation (User) • Allow the user to write the computation code (sequential) • Core optimizations (User/Auto) • Per-core optimizations (different on PPE and SPEs) • Memory access (Auto) • Hide explicit DMA • Optimize DMA (Auto) • Overlap computation with communication
Extra’s High-level Mapping Core-level • Performance estimation • Application performance should be roughly predicted based on the task graph • Warnings and hints • Better warnings and hints to replace the standard SDK messages
Available Cell/B.E. programming models • SDK-based models • IDL, ALF • Code-reuse models • MPI (micro-tasks), OpenMP • CellSS • Abstract models • Sequoia • Charm++ and the Offload API • SP@CE • Industry • PeakStream, RapidMind, the MultiCore Framework • Other approaches • MultiGrain Parallelism Scheduling, BlockLib, Sieve++
IDL and the Function-Offload Model • Offloads computation-intensive tasks on SPEs • Programmer provides: • Sequential code to run on PPE • SPE implementations for offloaded functions • IDL specification for function behaviour • Dynamic scheduling, based on distributed SPE queues
Accelerated Library Framework (ALF) • SPMD applications on a host-accelerator platform • Programmer provides: • Accelerator libraries - collections of accelerated code • Application usage of the accelerator libraries • Runtime scheduling
MPI micro-tasks • MPI front-end on the Cell/B.E. • Programmer provides: • MPI application • Preprocessor generates application graph with basic tasks • Basic tasks are merged together such that the graph is SP • The SP graph is mapped automatically • Core-level communication optimizations are automatic
OpenMP • Based on pragma's • Enables code re-use • Programmer provides: • OpenMP application • Core-level optimizations • DMA optimizations • Mapping and scheduling: automated • Most work on the compiler side
Cell SuperScalar (CellSS) • Very good for quick porting of applications on the Cell/B.E. • Programmer provides: • Sequential C application • Pragma’s to separate functions to be offloaded • Additional data distribution information • Based on a compiler and a run-time system • The compiler separates the annotated application into a PPE application and the SPE application • The runtime system maintains a dynamic data dependency graph with all these active tasks, updating it each time a task starts/ends • Dynamic scheduling • based on the runtime calculation of the data dependency graph.
Sequoia • High-level abstract model, suitable for divide-and-conquer applications • Uses the memory locality as a first parallelization criteria • Application = hierarchy of parameterized, recursively decomposed tasks • Tasks run in isolation (data locality) • Programmer provides: • Application hierarchical graph • A mapping of the graph on the platform • (Optimized) Code for the leaf-nodes • A flexible environment for tuning and testing application performance
SP@CE • Dedicated to streaming applications • An application is a collection of kernels that communicate only by data streaming • Programmer provides: • Application streaming graph (XML) • (Library of) Optimized kernels for the SPEs • Dynamic scheduling, based on a centralized job-queue • Run-time system on the SPEs, to optimize (some) communication overhead
Charm++ and the Offload API • An application = a collection of chares • Communicate through messages • Created and/or destroyed at runtime • A chare has a list of work requests to run on SPE • PPE: uses the offload API to manage the work requests (data flow, execution, completion) • SPE: a small runtime system for local management and optimizations • Programmer provides: • Charm++ application • Work requests and their SPE code
RapidMind • Based on “SPMD streaming” • tasks are executed on parallelized streams of data. • A kernel (“program”) is a computation on elements of a vector • An application is a combination of regular code and RapidMind code => compiler translates into PPE code and SPEs code • Programmer provides: • C++ application • Computation kernels inside the application • Kernels can execute asynchronously => achieve task-parallelism
MultiCore Framework SDK (Mercury) • A master-worker model • focused on data parallelism and data distributions • An application = manager (on PPE) and workers (on SPEs) • Data communication is based on: • virtual channels: between manager and worker(s) • data objects: to specify data granularity and distribution • elements read/written are different at the channel ends • Programmer provides: • C code for the kernels • The channels interconnections via read/write ops • Data distribution objects for each channel • No parallelization support, no core optimizations, no application-level design.
How to compare performance? • Implement one application from scratch • Impractical and very time consuming • Use an already given benchmark • Matrix multiplication is available
Performance • See examples …
Are results relevant ? • Only partially ! • MMUL is NOT a good benchmark for high-level programming models • The results reveal the low-level optimizations success • The implementations are VERY different • Hard to measure computation only • Data distribution issues are very differently addressed • Overall, a better approach for performance comparison is needed!!! • Benchmark application • Set of metrics
Still … • Low-level optimizations are not part of the programming model’s targets => can/should be designed separately and heavily reused • The performance overhead induced by the design and/or implementation in a high-level model decreases with the size of the application • The programming effort spent on SPE optimizations increases the overall application implementation design with a constant factor, independent of the chosen programming model.
The Answers [1/2] High-level > 90% Mapping 0-100% Core-level > 50% • High-level programming models cover enough features to support application design and implementation at all levels. • Low-level optimizations and high-level algorithm parallelization remain difficult tasks for the programmer. • No single Cell/B.E. programming model that can address all application types
The Answers [2/2] • Alleviate the programmability issue: 60 % • Preserve the high Cell/B.E. performance: 90 % • Are easy to use ? 10-90 % • Allow for automation ? 50 % • Is there an ideal one ? NO
GPU Models [1/2] • GPGPU used to be fancy • OpenCL • cG • RapidMind • NVIDIA GPUs • CUDA is an original HW-SW codesign approach • Extremely popular • Considered easy to use • ATI/AMD GPUs • Originally Brook • Currently ATI Stream SDK
GPU Models [2/2] • NVIDIA GPUs • CUDA is an original HW-SW codesign approach • Extremely popular • Considered easy to use • ATI/AMD GPUs • Originally Brook • Currently ATI Stream SDK
OpenCL [1/4] • Currently up and running for : • AMD/ATI, IBM, NVIDIA, Apple • Other members of the Khronos consortium to follow • ARM, Intel [?] • See examples …
OpenCL [2/4] • Language Specification • C-based cross-platform programming interface • Subset of ISO C99 with language extensions - familiar to developers • Online or offline compilation and build of compute kernel executables • Platform Layer API • A hardware abstraction layer over diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues • Runtime API • Execute compute kernels • Manage scheduling, compute, and memory resources
OpenCL [3/4] – memory model • multi-level memory model • private memory visible only to the individual compute units in the device • global memory visible to all compute units on the device. • depending on the HW, memory spaces can be collapsed together. • 4 memory spaces • Private memory: a single compute unit (think registers). • Local memory: work-items in a work-group. • Constant memory: stores constant data for read-only access • Global memory: used by all the compute units on the device.
OpenCL [4/4] – execution model • Execution model • Compute kernels can be thought either data-parallel (for for GPUs), or task-parallel, which is well-matched to the architecture of CPUs. • A compute kernel is the basic unit of executable code and can be thought of as similar to a C function. • kernels execution can be in-order or out-of-order • Events for the developer to check on the status of runtime requests. • The execution domain of a kernel • an N-dimensional computation domain. • each element in the execution domain is a work-item • work-items can be clustered into work-groups for synchronization and communication.
Conclusions • A multitude of programming models • Aboundent for Cell/B.E., due to original lack of high-level programming • Less so for GPUs, due to CUDA • Simple programming models are key to platform adoption • CUDA • Essential features are: • Tackling *all* parallelism layers of a platform • Both automagically and with user-intervention • Portability • Ease-of-use or a very steep learning curve (C-based works) • (Control over) Performance • Most of the times, efficiency
Take home messages • Application parallelization remains the programmer's task • Programming models should facilitate quick implementation and evaluation • Programming models are hard to compare • Application-specific or platform-specific • Often user-specific • Low portability is considered worse than performance drops • Performance trade-offs are smaller than expected • OpenCL’s portability is responsible (so far) for his appeal
Thank you! • Questions ? A.L.Varbanescu@tudelft.nl analucia@cs.vu.nl http://www.pds.ewi.tudelft.nl/~varbanescu