240 likes | 356 Views
Performance Portability and Programmability for Heterogeneous Many- core Architectures (PEPPHER). Siegfried Benkner (on behalf of the PEPPHER Consortium ) Research Group Scientific Computing Faculty of Computer Science University of Vienna Austria. EU Project PEPPHER.
E N D
Performance Portability and Programmability for Heterogeneous Many-coreArchitectures (PEPPHER) Siegfried Benkner (on behalf ofthe PEPPHER Consortium) Research Group Scientific Computing Faculty of Computer Science University of Vienna Austria
EU Project PEPPHER • Performance Portability & Programmability for • Heterogeneous Manycore Architectures • ICT FP7, Computing Systems; 3 years; finished Feb. 2013 • 9 Partners, Coordinated by University of Vienna • http://www.peppher.eu • Goal: Enable portable, productive and efficient programming of single-node heterogeneous many-core systems. • Holistic Approach • Component-Based High-Level Program Development • Auto-tuned Algorithms & Data Structures • Compilation Strategies • Runtime Systems • Hardware Mechanisms
Performance.Portability.Programmability • Methodology & framework for development of performance portable code. • Execute same application efficiently on different heterogeneous architectures. • Support multiple parallel APIs: OpenMP, OpenCL, CUDA, TBB, ... Application (C/C++) PEPPHER Framework Many-core CPU CPU+GPU PEPPHERSim PePU (Movidius) Intel Xeon Phi Focus: Single-node/chipheterogeneousarchitectures • Approach • Multi-architectural, performance-aware components multiple implementation variants of functions; each with a performance model • Task-based execution model &intelligent runtime system runtime selection of best task implementation variant for given platform
Platform Descriptors (PDL) C2 C1 PEPPHER Approach Expert Programmer (Compiler/Autotuner) Target Platforms C1 C1 C1 C2 C1 Component impl. variants for different platforms, algorithms, inputs ... Feed-back of measured performance Mainstream Programmer ::: • Programmer • Annotate calls to performance-critical functions (=components) • Provide implementation variantsfor different platforms (PDL) • Provide component meta-data C1 C1 C2 ... C1 C1 ::: C2 C2 C2 Component-based application with annotations
C2 C1 PEPPHER Approach Expert Programmer (Autotuner) Target Platforms Transformation/Composition Runtime System C1 C1 C1 C2 C1 Heterogenous Task Scheduler Component impl. variants for different platforms, algorithms, inputs ... Intermediate task-based representation Dynamic selection of ”best” implementation variant Feed-back of measured performance Mainstream Programmer ::: • Programmer • Annotate calls to performance-critical functions (=components) • Provide implementation variantsfor • different platforms (PDL) • Provide component meta-data • PEPPHER framework • Management of components and implementation variants • Transformation / Composition • Implementation variant selection • Dynamic, performance-aware task scheduling (StarPU runtime) C1 C2 C1 ... C1 C1 ::: C2 C2 C2 Component-based application with annotations
PEPPHER Framework Applications Embedded General Purpose HPC C/C++ source code with annotated component calls Asnynchronouscalls, Data distribution Patterns, SkePU Skeletons High-Level Coordination/Patterns/Skeletons Components C/C++, OpenMP, CUDA, OpenCL, TBB, Offload Component implementationvariants for different core architectures... algorithms, ... Autotuned Algorithms Data Structures Autotuned Data Structures & Algorithms Componentgluecode Static variant selection (ifany) Transformation Tool Composition Tool Componenttaskgraph with explicit datadependecies PEPPHER Taskgraph Performance Models Performance-aware, data-aware dynamicschedulingof „best“ componentvariantsontofreeexecutionunits PEPPHER Run-time (StarPU) Scheduling Strategy Scheduling Strategy Drivers (CUDA, OpenCL, OpenMP) Single-nodeheterogeneousmanycore SIM = PEPPHER simulator PePU = Peppherproc. unit (Movidius) CPU GPU Xeon Phi SIM PePU
PEPPHER Components «interface» C • Component Interface • Specification of functionality • Used by mainstream programmers • Implementation Variants • Different architectures/platforms • Different algorithms/data structures • Different input characteristics • Different performance goals • Writtenby expert programmers • (orgenerated, e.g. auto-tuning • cf. EU Autotune Project) f(param-list) Interface meta-data Variant meta-data Variant meta-data «variant» C1 «variant» Cn … f(param-list){…} f(param-list){…} Component Implementation Variants • Features • Different programminglanguages(C/C++, OpenCL, Cuda, OpenMP) • Task & Data parallelism • Constraints • Noside-effects; Non-preemptive • Stateless; Compositionon CPU only
Platform Description Language (PDL) • Goal: Make platform specific information explicitfor tools and users. • Processing Units (PUs) • Master (initiates program execution) • Worker (executes delegated tasks) • Hybrid (master & worker) • Memory Regions • Express key characteristics of memory hierarchy • Can be defined for all processing units • Interconnects • describe communication facilities between PUs • Hardware and Software Properties • e.g., core-count, memory sizes, available libraries Data movement
PEPPHER Coordination Language • Component calls • asynchronous & synchronous • Patterns • e.g. pipeline pattern #pragma pphcall //read A, write B -> meta data cf1(A, N, B, M); #pragma pph call cf2(B, M); #pragma pph pipeline while(inputstream >> file) { readImage(file,image); #pragma pph stage replicate(N) { resizeAndColorConvert(image); detectFace(image,outImage); ... } #pragma pph call sync cf(A, N); • Other Features: • Specification of optimization goals (time vs. power) and execution targets • Data partitioning; array access patterns; parameter assertions; • Memory consistency control
Transformation System • Source-to-Source Transformation • based on ROSE • generates C++ with calls to coordination layer and StarPU runtime • Coordination Layer • Support for parallel patterns(pipelining) • Submission of tasks to StarPU • Heterogeneous Runtime System • Based on INRIA’s StarPU • Selection of implementation variants based on available hardware resources • Data-aware & performance-aware task scheduling onto heterogeneous PUs Application with Annotations Platform Descriptor PDL PEPPHER Component Repository Transformation Tool Coordination Layer PEPPHER Component Framework Task-based Heterogeneous Runtime Hybrid Hardware SMP GPU MIC
Performance Results • OpenCV Face Detection • 3425 images • Image resolution: 640x480 (VGA) • Different implementationvariantsformiddlestages (CPU vs. GPU) • ComparisontoplainOpenCVversionand hand-coded Intel TBB(pipeline) version • Architecture: 2 Xeon X5550 (4 core), 2 NVIDIA C2050, 1 NVIDIA C1060
Major Results of PEPPHER • Component Framework • Multi-architectural, resource- & performance-aware components; • PDL adopted by Open Community Runtime (OCR) – US XStack program • Transformation, Composition, Compilation • Transformation Tool (U. Vienna) • Composition Tool & SkePU(U. Linköping) • Offload C++ compiler used by game industry (Codeplay) • Runtime System (U. Bordeaux) • StarPUpart of Linux (Debian) distribution and MAGMA library • Superior parallel algorithms and data structures (KIT, Chalmers) • PePU Experimental Hardware Platform & Simulator (Movidius) • PeppherSIM used in industry
Example Cholesky factorization FOR k = 0..TILES-1 POTRF(A[k][k]) FOR m = k+1..TILES-1 TRSM(A[k][k], A[m][k]) FOR n = k+1..TILES-1 SYRK(A[n][k], A[n][n]) FOR m = n+1..TILES-1 GEMM(A[m][k], A[n][k], A[m][n]) Utilize expert written components: BLAS kernels from MAGMA and PLASMA • Implementation variants: • multi-core CPU (PLASMA) • GPU(MAGMA) • PEPPHER component: • Interface, implementation variants + meta-data
PEPPHER Approach • Transformation/Composition • Processing of user annotations/meta data • Generation of task-based representation (DAG) • Static pre-selection of variants POTFR ... ... TRSM TRSM ... ... • Task-based Execution Model • Runtime task variant selection & scheduling • Data/topology-aware: minimize data transfer • Performance-aware: minimize make-span, or other objective (power, …) SYRK SYRK ... ... GEMM CPU-GEMM GPU-GEMM • Multi-level parallelism • Coarse-grained inter-component parallelism • Fine(r) grained intra-component parallelism • Exploit ALL execution units
Component Meta-Data • Interface Meta-Data (XML) • Parameter intent (read/write) • Supported performance apsects • (execution-time, power) • Implementation Variant Meta-Data (XML) • Supported target platforms (PDL) • Performance Model • Input data constraints (if any) • Tunable parameters (if any) • Required components (if any) • Key issues • Make platform specific optimizations/dependencies explicit. • Make components performance and resource aware. • Support runtime variant selection. • Support code transformation and auto-tuning. XML Schema for Interface Meta-Data XML Schema for Variant Meta-Data
Performance-Aware Components • Each component is associated with an abstractperformance model. • Invocation Context: captures performance-relevant information of inputdata • (problem size, data layout, etc.) • Resource Context: specifies main HW/SW characteristics (cores, memory, …) • Performance Descriptor: usually includes (relative) runtime, power estimates • Generic performance prediction function: Invocation Context Descriptors Component Performance Model Performance Descriptor Resource Context Descr. (PDL) PerfDscgetPrediction(InvocationContextDscicd, ResourceContextDscrcd)
Basic Coordination Language • Memory Consistency • flush; for ensuring consistency btw. host and workers • Component calls • implicit memory consistency across workers #pragma pph call cf1 (A, N); ... #pragma pphflush(A) // block until A has become available intfirst = A[0]; // explicit flush req. since A is accessed #pragma pph call cf1 (A, N); // A: read / write ... // implicit memory consistency on workers only ... // no explicit flush is needed here provided A ... // is not accessed within the master process #pragma pph call cf2(A, N); // A:read; actual values of A produced by cf1()
Basic Coordination Language • Parameter Assertions • influence component variant selection • Optimization Goals • specify optimization goals to be taken into account by runtime scheduler • Execution Target • specify pre-defined target library (e.g., OPENCL) or processing unit group from PDL platform descriptor #pragma pph call parameter(size < 1000) cf1(A, size); #pragma pph call optimize(TIME) cf1(A, size); ... #pragma pph call optimize(POWER < 100 && TIME < 10) cf2(A, size); #pragma pph call target(OPENCL) cf(A, size);
Basic Coordination Language • Data Partitioning • generate multiple component calls, one for each partition (cf. HPF) • Access to Array Sections • specify which array section is accessed in component call (cf. Fortran array sections) #pragma pph call partition(A(size:BLOCK(size/2))) cf1(A, size); #pragma pph call access(A(size:50:size-1)) cf(A+50, size-50);
Performance Results • Leukocyte Tracking • AdaptedfromRodiniabenchmarksuite • Different implementationvariantsforMotion Gradient Vector Flow (CPU vs. GPU) • ComparisontoOpenMPversion • Architecture: • 2 Xeon X5550 (4 core) • 2 NVIDIA C2050 • 1 NVIDIA C1060 • 4 different configurations • -> PDL descriptors SEQ OMP 8CPU cores 7CPU 1GPU 6CPU 2GPU 5CPU 3GPU Speedup PEPPHER Orig. (Rodinia)
Future Work • EU AutoTune Project • Autotuning of high-level patterns (pipeline replication factor, …) • Tunable parameters specified in component descriptors • Work on Energy Efficiency • Energy-aware components;Runtime scheduling for energy efficiency • User-specified optimization goals • Tradeoff execution time vs. energy consumption; QoS support • Extension towards Clusters • Combine with global MPI layer across nodes